Please login first
SpectoResNet: Advancing Speech Emotion Recognition through Deep Learning and Data Augmentation on the CREMA-D Dataset
* 1 , 2 , 3 , 3
1  IL3CUB Laboratory, University of Mohamed Khider, Biskra, 07000, Algeria
2  VSC Laboratory, University of Mohamed Khider, Biskra, 07000, Algeria
3  VSC Laboratory, Department of Electrical Engineering, University of Mohamed Khider Biskra, Algeria
Academic Editor: Eugenio Vocaturo

Abstract:

Speech emotion recognition (SER) is a particularly challenging task due to the intricate and non-linear features of emotional expressions in audio signals. In this work, we introduce SpectoResNet, an improved version of ResNet architecture that was tuned for classifying emotions using audio features from the CREMA-D dataset. CREMA-D, provided by the Speech and Emotion Research Group from New York University (NYU) , is a crowdsourced dataset consisting of 7,442 audio-visual recordings from 91 actors, which displays happiness, sadness, anger, and neutrality, among other emotions. While this dataset tries to provide opportunities for research into emotional recognition, its intrinsic variety and subtle differences due to individual traits and contextual environments pose significant obstacles to precise classification. To do this, we converted voice signals into 2D spectrograms to enable the deep CNN of ResNet to analyze and classify the emotions. ResNet was initially developed for image recognition and relies on residual connections in order to be able to train very deep networks effectively. Advanced data augmentation-adding noise and changing pitch-was used to simulate the variability found in real-time speech and make the model robust for different acoustic environments. Our model, trained on augmented spectrogram data, achieved 65.20% classification accuracy-a state-of-the-art breakthrough in vocal emotion recognition using deep learning. Success with SpectoResNet emphasizes the prowess of deep CNNs in extracting detailed patterns and subtleties within emotional audio expressions, thus paving the path toward more advanced model developments for multimodal emotion recognition.

Keywords: Automatic Speech Emotion Recognition, ASER, CREMA-D, Deep Learning, CNNs, ResNet, Spectrograms.

 
 
Top