Please login first
A Three-Stage Transformer-Based Approach for Food Mass Estimation
* 1 , * 1 , 2, 3
1  Department of Computer Science, Université de Moncton
2  Department of Psychology, Université de Moncton
3  Centre de Formation Médicale du Nouveau-Brunswick
Academic Editor: Francisco Falcone

https://doi.org/10.3390/ECSA-12-26521 (registering DOI)
Abstract:

Accurate food mass estimation is a key component of automated calorie estimation tools, and there is growing interest in leveraging image analysis for this purpose due to its ease of use and scalability. However, current methods face important limitations. Some rely on 3D sensors for depth estimation, which are not widely accessible to all users, while others depend on camera intrinsic parameters to estimate volume, reducing their adaptability across different devices. Furthermore, AI-based approaches that bypass these parameters often struggle with generalizability when applied to images captured using diverse sen-sors or camera settings. To overcome these challenges, we introduce a three-stage, trans-former-based method for estimating food mass from RGB images, balancing accuracy, computational efficiency, and scalability. The first stage applies the Segment Anything Model (SAM 2) to segment food items in images from the SUECFood dataset. Next, we use the Global-Local Path Network (GLPN) to perform monocular depth estimation (MDE) on the Nutrition5k dataset, inferring depth information from a single image. These outputs are then combined through alpha compositing to generate enhanced composite images with precise object boundaries. Finally, a Vision Transformer (ViT) model processes the composite images to estimate food mass by extracting relevant visual and spatial features. Our method achieves notable improvements in accuracy compared to previous approach-es, with a mean squared error (MSE) of 5.61 and a mean absolute error (MAE) of 1.07. No-tably, this pipeline does not require specialized hardware like depth sensors or multi-view imaging, making it well-suited for practical deployment. Future work will explore the in-tegration of ingredient recognition to support a more comprehensive dietary assessment system.

Keywords: Food Mass Estimation; Calorie Estimation; Vision Transformer ; Monocular Depth Estimation; Segment Anything Model;

 
 
Top