Vision and touch are fundamental sensory modalities that enable humans to perceive and interact with objects in their environment. Vision facilitates the perception of attributes such as shape, color, and texture from a distance, while touch provides detailed information at the contact level, including fine textures and material properties. Despite their distinct roles, the processing of visual and tactile information shares underlying similarities, presenting a unique opportunity to enhance artificial systems that integrate these modalities. However, existing methods for combining vision and touch often rely on data fusion at the decision level, requiring extensive labeled data and facing challenges in generalizing to novel situations.
In this paper, we leverage contrastive learning to train a convolutional neural network on textile data using both visual and tactile inputs. Our objective is to develop a network capable of extracting unified representations from both modalities without the need for extensive labeled datasets. We explore two distinct contrastive loss functions to optimize the learning process. Our results demonstrate that the shared representations effectively capture critical data structures and features from both sensory modalities, enabling successful differentiation between object classes based on both vision and touch. We validate our approach through a series of experiments, optimizing hyperparameters to maximize performance. The findings suggest that extracting shared representations for vision and touch not only enhances the integration of visual and tactile information but also provides a robust framework for multimodal perception in artificial systems.