Please login first
SightSeeingGemma: Enhancing Assistive AI for the Visually Impaired via Object Detection and Monocular Depth Estimation with Language-Based Scene Understanding
* , * , * , * , *
1  Unlimited Research Group of AI (URA), Ho Chi Minh City University of Technology (HCMUT), Vietnam National University – Ho Chi Minh City (VNU-HCM), Vietnam
Academic Editor: Lucia Billeci

Abstract:

This paper presents an integrated assistive approach that combines multimodal vision-language models with advanced computer vision techniques to support visually impaired individuals. The proposed system utilizes object detection via a custom-trained YOLOv8 model on an expanded dataset, along with MiDaS for monocular depth estimation. This enables the extraction of visual cues about surrounding objects and their relative distances, allowing near real-time hazard recognition and contextual descriptions via voice feedback. The system is deployed on a cloud/server infrastructure using a lightweight prototype (Google Colab + Ngrok), which introduces an average latency of 15–17 seconds per response. A dedicated Vietnamese dataset of annotated images, warnings, and context-specific descriptions was developed. To evaluate semantic alignment between model-generated and human-written descriptions, cosine similarity (using SBERT embeddings) achieved approximately 95%, far above the 0.5 threshold. Natural Language Inference (NLI) techniques were used to assess logical consistency. For overall descriptions, 72.5% were labeled neutral, 14.5% entailment, and 13.0% contradiction. For hazard warnings, 73.5% were entailment, 25.5% contradiction, and only 1.0% neutral. These results demonstrate that the model produces reliable hazard descriptions while general summaries may omit minor contextual details. In conclusion, the integration of vision-language models with object detection and depth estimation offers a scalable and effective assistive solution for the visually impaired. The system achieves high semantic fidelity in descriptive tasks and robust hazard communication, proving its potential for real-world deployment in accessibility technologies.

Keywords: Assistive technology, Computer vision, YOLOv8, Depth estimation, Multimodal vison-language models, Semantic similarity, Natural Language Inference, Visual impairment support
Comments on this paper
Currently there are no comments available.


 
 
Top