This paper presents an integrated assistive approach that combines multimodal vision-language models with advanced computer vision techniques to support visually impaired individuals. The proposed system utilizes object detection via a custom-trained YOLOv8 model on an expanded dataset, along with MiDaS for monocular depth estimation. This enables the extraction of visual cues about surrounding objects and their relative distances, allowing near real-time hazard recognition and contextual descriptions via voice feedback. The system is deployed on a cloud/server infrastructure using a lightweight prototype (Google Colab + Ngrok), which introduces an average latency of 15–17 seconds per response. A dedicated Vietnamese dataset of annotated images, warnings, and context-specific descriptions was developed. To evaluate semantic alignment between model-generated and human-written descriptions, cosine similarity (using SBERT embeddings) achieved approximately 95%, far above the 0.5 threshold. Natural Language Inference (NLI) techniques were used to assess logical consistency. For overall descriptions, 72.5% were labeled neutral, 14.5% entailment, and 13.0% contradiction. For hazard warnings, 73.5% were entailment, 25.5% contradiction, and only 1.0% neutral. These results demonstrate that the model produces reliable hazard descriptions while general summaries may omit minor contextual details. In conclusion, the integration of vision-language models with object detection and depth estimation offers a scalable and effective assistive solution for the visually impaired. The system achieves high semantic fidelity in descriptive tasks and robust hazard communication, proving its potential for real-world deployment in accessibility technologies.
Previous Article in event
Previous Article in session
Next Article in event
Next Article in session
SightSeeingGemma: Enhancing Assistive AI for the Visually Impaired via Object Detection and Monocular Depth Estimation with Language-Based Scene Understanding
Published:
03 December 2025
by MDPI
in The 6th International Electronic Conference on Applied Sciences
session Computing and Artificial Intelligence
Abstract:
Keywords: Assistive technology, Computer vision, YOLOv8, Depth estimation, Multimodal vison-language models, Semantic similarity, Natural Language Inference, Visual impairment support
