Deep Learning-Based Vision System for Real-Time Gesture Recognition and Speech Synthesis to Assist Non-Verbal Users

Ashutosh Das; B. Sahithi; Pallavi Choudhury; P. Krishna; Gurugubelli Narayana

Previous Article in event

Quantum-Fuzzy Adaptive Control Architecture for Nonlinear Dynamic Systems in Industrial Automation

Next Article in event

AI-Powered Smart Urban Navigation and Safety Alert System for Visually Impaired Pedestrians

Deep Learning-Based Vision System for Real-Time Gesture Recognition and Speech Synthesis to Assist Non-Verbal Users

Gurugubelli V.S. Narayana

¹ School of Engineering and Technology, Department of Computer Science and Engineering, GIET University, Gunupur 765022, Odisha, India

Academic Editor: Eugenio Vocaturo

Published: 03 December 2025 by MDPI in The 6th International Electronic Conference on Applied Sciences session Computing and Artificial Intelligence

Abstract:

Background: Individuals with speech impairments often face significant challenges in daily communication, limiting their ability to interact effectively. Traditional communication aids, though helpful, can be costly or inflexible. Recent advancements in computer vision and deep learning offer new opportunities to develop logical, real-time, and affordable assistive technologies. Objective: This study aims to design and implement a low-cost, vision-based gesture-to-speech system that enables nonverbal individuals to communicate through hand gestures. The goal is to translate recognized gestures into audible speech, bridging the communication gap and enhancing quality of life. Methods: The system uses a standard webcam to capture hand gestures, processed in real-time using OpenCV. A Convolutional Neural Network (CNN) developed with TensorFlow is trained on a custom dataset to classify hand signs accurately. The workflow includes image preprocessing, data augmentation, model training, and deployment. Each recognized gesture is mapped to a corresponding text, which is then converted into speech using a text-to-speech (TTS) engine. Results: The captured hand image is first passed through a filter, and the filtered image is then input to a CNN-based classifier that predicts the gesture class. Once classified, the corresponding word is displayed as output and then converted into audible speech. The system achieved 98% accuracy across 26 alphabetic gestures and performed reliably in real-time with minimal latency under varying lighting and background conditions. Conclusion: The proposed system is an effective and affordable communication aid for individuals with speech impairments. Its modular, real-time design makes it suitable for deployment in resource-constrained settings.

Keywords: Gesture recognition, non-verbal communication, hand sign detection, convolutional neural network (CNN), computer vision, OpenCV, real-time system, speech synthesis, assistive technology, TensorFlow, human-computer interaction (HCI), low-cost communication

32 Reads
0 Recommendations