SpectoResNet: Advancing Speech Emotion Recognition through Deep Learning and Data Augmentation on the CREMA-D Dataset

Zineddine Kahhoul; Nadjiba Terki; Ilyes Benaissa; Zine-Eddine Baarir

Previous Article in event

A novel approach for classifying gliomas from magnetic resonance images using image decomposition and texture analysis

Next Article in event

Automated Detection and Classification of Malaria Parasites in Microscopic Images Using Deep Learning Techniques

SpectoResNet: Advancing Speech Emotion Recognition through Deep Learning and Data Augmentation on the CREMA-D Dataset

Zineddine Sarhani Kahhoul

^{*

1},

Nadjiba Terki

²,

Ilyes Benaissa

³,

Zine-Eddine Baarir

¹ IL3CUB Laboratory, University of Mohamed Khider, Biskra, 07000, Algeria
² VSC Laboratory, University of Mohamed Khider, Biskra, 07000, Algeria
³ VSC Laboratory, Department of Electrical Engineering, University of Mohamed Khider Biskra, Algeria

Academic Editor: Eugenio Vocaturo

Published: 02 December 2024 by MDPI in The 5th International Electronic Conference on Applied Sciences session Computing and Artificial Intelligence

Abstract:

Speech emotion recognition (SER) is a particularly challenging task due to the intricate and non-linear features of emotional expressions in audio signals. In this work, we introduce SpectoResNet, an improved version of ResNet architecture that was tuned for classifying emotions using audio features from the CREMA-D dataset. CREMA-D, provided by the Speech and Emotion Research Group from New York University (NYU) , is a crowdsourced dataset consisting of 7,442 audio-visual recordings from 91 actors, which displays happiness, sadness, anger, and neutrality, among other emotions. While this dataset tries to provide opportunities for research into emotional recognition, its intrinsic variety and subtle differences due to individual traits and contextual environments pose significant obstacles to precise classification. To do this, we converted voice signals into 2D spectrograms to enable the deep CNN of ResNet to analyze and classify the emotions. ResNet was initially developed for image recognition and relies on residual connections in order to be able to train very deep networks effectively. Advanced data augmentation-adding noise and changing pitch-was used to simulate the variability found in real-time speech and make the model robust for different acoustic environments. Our model, trained on augmented spectrogram data, achieved 65.20% classification accuracy-a state-of-the-art breakthrough in vocal emotion recognition using deep learning. Success with SpectoResNet emphasizes the prowess of deep CNNs in extracting detailed patterns and subtleties within emotional audio expressions, thus paving the path toward more advanced model developments for multimodal emotion recognition.

Keywords: Automatic Speech Emotion Recognition, ASER, CREMA-D, Deep Learning, CNNs, ResNet, Spectrograms.

0 Reads
0 Recommendations

Zineddine Kahhoul

Nadjiba Terki

Ilyes Benaissa

Zine-Eddine Baarir