Speech emotion recognition (SER) is a particularly challenging task due to the intricate and non-linear features of emotional expressions in audio signals. In this work, we introduce SpectoResNet, an improved version of ResNet architecture that was tuned for classifying emotions using audio features from the CREMA-D dataset. CREMA-D, provided by the Speech and Emotion Research Group from New York University (NYU) , is a crowdsourced dataset consisting of 7,442 audio-visual recordings from 91 actors, which displays happiness, sadness, anger, and neutrality, among other emotions. While this dataset tries to provide opportunities for research into emotional recognition, its intrinsic variety and subtle differences due to individual traits and contextual environments pose significant obstacles to precise classification. To do this, we converted voice signals into 2D spectrograms to enable the deep CNN of ResNet to analyze and classify the emotions. ResNet was initially developed for image recognition and relies on residual connections in order to be able to train very deep networks effectively. Advanced data augmentation-adding noise and changing pitch-was used to simulate the variability found in real-time speech and make the model robust for different acoustic environments. Our model, trained on augmented spectrogram data, achieved 65.20% classification accuracy-a state-of-the-art breakthrough in vocal emotion recognition using deep learning. Success with SpectoResNet emphasizes the prowess of deep CNNs in extracting detailed patterns and subtleties within emotional audio expressions, thus paving the path toward more advanced model developments for multimodal emotion recognition.
Previous Article in event
Next Article in event
SpectoResNet: Advancing Speech Emotion Recognition through Deep Learning and Data Augmentation on the CREMA-D Dataset
Published:
02 December 2024
by MDPI
in The 5th International Electronic Conference on Applied Sciences
session Computing and Artificial Intelligence
Abstract:
Keywords: Automatic Speech Emotion Recognition, ASER, CREMA-D, Deep Learning, CNNs, ResNet, Spectrograms.
Comments on this paper