SightSeeingGemma: Enhancing Assistive AI for the Visually Impaired via Object Detection and Monocular Depth Estimation with Language-Based Scene Understanding

Anh Dinh; Minh Hua; Tai Tien; Tri Trinh; Tho Quan

Previous Article in event

Spectral density analysis for light excitation of alkali–metal vapors with variant time-dependent chirping

Previous Article in session

Application of Artificial Intelligence and Machine Learning in A Nuclear Power Industry to Address Environmental Problems

Next Article in event

Sustainable Water Quality Monitoring: A Comparative Study between Automated and Manual Collection in the Amazon

Next Article in session

FedHeart-MM: A Privacy-Preserving Federated Multimodal Framework for Accurate Heart Disease Prediction

SightSeeingGemma: Enhancing Assistive AI for the Visually Impaired via Object Detection and Monocular Depth Estimation with Language-Based Scene Understanding

Anh Trac Duc Dinh

^*,

Minh Tue Hua

^*,

Tai Ta Tien

^*,

Tri Huu Trinh

^*,

Tho Thanh Quan

¹ Unlimited Research Group of AI (URA), Ho Chi Minh City University of Technology (HCMUT), Vietnam National University – Ho Chi Minh City (VNU-HCM), Vietnam

Academic Editor: Lucia Billeci

Published: 03 December 2025 by MDPI in The 6th International Electronic Conference on Applied Sciences session Computing and Artificial Intelligence

Abstract:

This paper presents an integrated assistive approach that combines multimodal vision-language models with advanced computer vision techniques to support visually impaired individuals. The proposed system utilizes object detection via a custom-trained YOLOv8 model on an expanded dataset, along with MiDaS for monocular depth estimation. This enables the extraction of visual cues about surrounding objects and their relative distances, allowing near real-time hazard recognition and contextual descriptions via voice feedback. The system is deployed on a cloud/server infrastructure using a lightweight prototype (Google Colab + Ngrok), which introduces an average latency of 15–17 seconds per response. A dedicated Vietnamese dataset of annotated images, warnings, and context-specific descriptions was developed. To evaluate semantic alignment between model-generated and human-written descriptions, cosine similarity (using SBERT embeddings) achieved approximately 95%, far above the 0.5 threshold. Natural Language Inference (NLI) techniques were used to assess logical consistency. For overall descriptions, 72.5% were labeled neutral, 14.5% entailment, and 13.0% contradiction. For hazard warnings, 73.5% were entailment, 25.5% contradiction, and only 1.0% neutral. These results demonstrate that the model produces reliable hazard descriptions while general summaries may omit minor contextual details. In conclusion, the integration of vision-language models with object detection and depth estimation offers a scalable and effective assistive solution for the visually impaired. The system achieves high semantic fidelity in descriptive tasks and robust hazard communication, proving its potential for real-world deployment in accessibility technologies.

Keywords: Assistive technology, Computer vision, YOLOv8, Depth estimation, Multimodal vison-language models, Semantic similarity, Natural Language Inference, Visual impairment support

View Poster

54 Reads
0 Recommendations

Anh Dinh

Minh Hua

Tai Tien

Tri Trinh

Tho Quan