A Three-Stage Transformer-Based Approach for Food Mass Estimation

Sinda Besrour; Ghazal Rouhafzay; Jalila Jbilou

doi:10.3390/ECSA-12-26521

Previous Article in event

AI/ML-Enabled Internet of Medical Things (IoMT) for Personalized Cardiac Health Monitoring and Predictive Diagnostics

Next Article in event

Evaluating Voice Biomarkers and Deep Learning for Neurodevelopmental Disorder Screening in Real-World Conditions

A Three-Stage Transformer-Based Approach for Food Mass Estimation

Sinda Besrour

^{*

1},

Ghazal Rouhafzay

^{*

1},

Jalila Jbilou

^{2, 3}

¹ Department of Computer Science, Université de Moncton
² Department of Psychology, Université de Moncton
³ Centre de Formation Médicale du Nouveau-Brunswick

Academic Editor: Francisco Falcone

Published: 07 November 2025 by MDPI in The 12th International Electronic Conference on Sensors and Applications session Sensor Networks, IoT, Smart Cities and Health Monitoring

https://doi.org/10.3390/ECSA-12-26521

Abstract:

Accurate food mass estimation is a key component of automated calorie estimation tools, and there is growing interest in leveraging image analysis for this purpose due to its ease of use and scalability. However, current methods face important limitations. Some rely on 3D sensors for depth estimation, which are not widely accessible to all users, while others depend on camera intrinsic parameters to estimate volume, reducing their adaptability across different devices. Furthermore, AI-based approaches that bypass these parameters often struggle with generalizability when applied to images captured using diverse sen-sors or camera settings. To overcome these challenges, we introduce a three-stage, trans-former-based method for estimating food mass from RGB images, balancing accuracy, computational efficiency, and scalability. The first stage applies the Segment Anything Model (SAM 2) to segment food items in images from the SUECFood dataset. Next, we use the Global-Local Path Network (GLPN) to perform monocular depth estimation (MDE) on the Nutrition5k dataset, inferring depth information from a single image. These outputs are then combined through alpha compositing to generate enhanced composite images with precise object boundaries. Finally, a Vision Transformer (ViT) model processes the composite images to estimate food mass by extracting relevant visual and spatial features. Our method achieves notable improvements in accuracy compared to previous approach-es, with a mean squared error (MSE) of 5.61 and a mean absolute error (MAE) of 1.07. No-tably, this pipeline does not require specialized hardware like depth sensors or multi-view imaging, making it well-suited for practical deployment. Future work will explore the in-tegration of ingredient recognition to support a more comprehensive dietary assessment system.

Keywords: Food Mass Estimation; Calorie Estimation; Vision Transformer ; Monocular Depth Estimation; Segment Anything Model;

View paper

24 Reads
0 Recommendations

Sinda Besrour

Ghazal Rouhafzay

Jalila Jbilou