Please login first
Improving Hand Pose Recognition using Localization and Zoom Normalizations over MediaPipe Landmarks
1 , * 2 , 2
1  Speech Technology and Machine Learning Group (T.H.A.U. Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid, 28040, Madrid, Spain
2  Speech Technology Group. Information Processing and Telecomunications Center. E.T.S.I. Telecomunicación. Universidad Politécnica de Madrid.
Academic Editor: Stefano Mariani


Hand Pose Recognition presents significant challenges that need to be addressed, such as varying lighting conditions or complex backgrounds, which can hinder accurate and robust hand pose estimation. This can be mitigated by employing MediaPipe to facilitate the efficient extraction of representative landmarks from static images combined with the use of Convolutional Neural Networks. Extracting these landmarks from the hands mitigates the impact of lighting variability or the presence of complex backgrounds. However, the variability of the location and size of the hands is still not addressed by this process. Therefore, the use of processing modules to normalize these points regarding the location of the wrist and the zoom of the hands can significantly mitigate the effects of these variabilities. In all the experiments performed in this work based on American Sign Language alphabet datasets of 870, 27,000, and 87,000 images, the application of the proposed normalizations has resulted in significant improvements in the model performance in a resource-limited scenario. Particularly, under conditions of high variability applying both normalizations resulted in a performance increment of 45.08 %, increasing the accuracy from 43.94 ± 0.64 % to 89.02 ± 0.40 %.

Keywords: deep learning; computer vision; human activity recognition, hand pose recognition, landmarks, location normalization, zoom normalization