Introduction:
Recent developments in self-supervised learning techniques for representation learning have gained considerable attention in the remote sensing sector, particularly for reducing the substantial costs related to annotating large satellite image datasets. In the context of multimodal data fusion, contrastive learning has become widely adopted to address domain discrepancies between various sensor types. However, contrastive methods heavily rely on data augmentation techniques, which require significant expertise, especially when dealing with multispectral remote sensing data. A promising yet often overlooked alternative is to employ masked image modeling-based pretraining techniques to bypass these challenges.
Methods:
In this research, we introduce FusionX-Net, a self-supervised learning framework that utilizes masked autoencoders and incorporates cross-attention mechanisms for the early and feature-level integration of synthetic aperture radar (SAR) and multispectral optical data. These two data modalities generally exhibit a significant domain gap, which complicates the fusion process. FusionX-Net effectively addresses this challenge by using its cross-attention design, improving the representation learning process.
Results:
FusionX-Net achieves state-of-the-art performance well above 95% on a number of benchmarks. On BigEarthNet-MM, FusionX-Net achieves 95.2% mean average precision (mAP), outperforming Dino-MM (88.1%) and SatViT (90.4%). Even in the low-label regime (1% labels), it achieves an impressive 92.0% mAP, leaving Dino-MM (81.5%) and SatViT (79.8%) by a wide margin. On SEN12MS, FusionX-Net achieves a Top-1 accuracy of 96.1%, significantly better than competitive baselines.
Conclusions:
The proposed approach offers an effective alternative to contrastive learning techniques, which typically require extensive data augmentation. It demonstrates the potential of self-supervised learning, particularly masked autoencoders, in solving challenges associated with multispectral data fusion.
