Understanding how multimodal large language models interpret emotion is essential for evaluating their psychological plausibility. This study investigated how GPT encodes affect in complex social contexts by comparing its emotion ratings with human normative data. Using a database of 274 images depicting diverse interpersonal interactions, we examined model–human correspondence across both dimensional (valence, arousal) and categorical (angry, disgusted, fearful, happy, neutral, sad) measures. GPT-4o was instructed to generate continuous ratings on the same scales used in human assessments. At the dimensional level, GPT systematically assigned higher valence and arousal scores than humans, producing an affective landscape that appeared more positive and more activated overall. The model also exhibited greater variability within semantic categories, suggesting a looser or less constrained mapping of image features to affective dimensions. At the categorical level, confusion-matrix analyses revealed close alignment with human labels for highly prototypical expressions, particularly happiness, anger, and sadness. In contrast, the model frequently misclassified images associated with disgust, fear, and neutrality, indicating difficulty distinguishing among emotions that share overlapping contextual or semantic cues. Together, these findings suggest that GPT relies on a coarse, semantically organized evaluative axis when interpreting emotions in social scenes—an axis that only partially captures the structure of human affective representations.
Previous Article in event
Comparing GPT and Human Affective Evaluations of Social Images
Published:
27 March 2026
by MDPI
in The 1st International Online Conference on Behavioral Sciences
session Cognition
Abstract:
Keywords: GPT; valence; arousal; social images
