Please login first
Inclusive Multilingual Assistive Technology using a Computer Vision- and Transformer-based NLP Approach for the Visually Impaired
* , , ,
1  School of Sciences, Christ University, Bengaluru, 560029, India
Academic Editor: Francesco Arcadio

Abstract:

Inclusivity has become a central theme in the development of digital technologies. Designing with inclusivity in mind requires that systems remain accessible and valuable to users irrespective of their age, gender, or socio-economic background. However, most applications remain restricted to a single language—primarily English—thereby marginalizing large groups of non-English-speaking individuals. This work explores the integration of Computer Vision (CV) and Natural Language Processing (NLP) to enhance accessibility for visually impaired users, with a particular emphasis on multilingual support in assistive technologies. Through a review of the existing literature and user experiences, this study identifies language barriers as a major obstacle in accessing essential services. To address this, we employ a multi-stage methodology for multilingual image captioning. Image–caption pairs were extracted from the MS COCO dataset, reformatted into JSON, and translated from English to the local language (currently Hindi) to generate a bilingual corpus. The model combines Recurrent Neural Networks (RNNs) for image feature extraction with Long Short-Term Memory (LSTM) units for sequence generation, enabling the system to capture temporal dependencies inherent to natural language. Experimental outcomes indicate that the model can generate Hindi captions with about 80% accuracy, effectively describing visual scenes despite some grammatical limitations. Real-time camera integration and a text-to-voice module further enhance usability by delivering immediate audio captions to visually impaired users. Future work will focus on transformer-based multilingual architectures and larger datasets to improve accuracy, contextual richness, and language coverage, moving towards a robust, speech-enabled assistive platform.

Keywords: Image captioning; Scene Recognition; Language processing; Assistive technology; Computer vision; Natural language processing; transformer-based multilingual architectures; Recurrent Neural Network (RNN); , Long Short Term Memory(LSTM);Transformer Models
Comments on this paper
Currently there are no comments available.


 
 
Top