The conventional approaches to sentiment analysis often use only one type of modality, e.g., text or speech; hence, they cannot be used to identify the richness of human expression of emotions. As the concept of deep learning rapidly expands, Multimodal Sentiment Analysis (MSA) has become a potent technology that allows the merging of various sources of data to enhance emotional comprehension. This paper introduces a transformer-based architecture, integrating speech, text and faces to improve the accuracy of sentiment classification. Transformer networks are able to use the self-attention mechanism to capture long-range interactions as well as cross-modal interactions, which are difficult to capture using more traditional recurrent or convolutional models. The suggested system derives textual embeddings based on a pre-trained language model, acoustic features based on spectrogram-based encoders, and visual interpreters based on facial landmark and expression recognition systems. A cross-modal attention fusion approach synchronizes and dynamically balances features between modalities to produce higher-level and more context-initiative sentiment cues. Results of experiments on benchmark datasets, including CMU-MOSEI and IEMOCAP, show that the proposed model provides an accuracy of 87.6 and an F1-score of 86.9, surpassing unimodal and early-fusion baselines by 6.4 and 5.8, respectively. The architecture has been found to be accurate at recognizing subtle or ambiguous emotions. These results show the possibilities of using transformer-based MSA systems in real-life scenarios, such as human–computer interaction, healthcare, social robotics, and in digital learning environments, creating a path to emotionally intelligent and responsive AI systems.
Previous Article in event
Next Article in event
Multimodal Sentiment Analysis with Transformer Networks: Bridging Speech, Text, and Facial Expressions
Published:
03 December 2025
by MDPI
in The 6th International Electronic Conference on Applied Sciences
session Computing and Artificial Intelligence
Abstract:
Keywords: Multimodal Sentiment Analysis; Transformer Networks; Speech Emotion Recognition; Textual Sentiment Classification; Facial Expression Analysis; Cross-Modal Fusion
