The accuracy of large-area forest mapping is often compromised by the label noise present in global land cover products like ESA WorldCover. This study introduces a robust semi-supervised framework designed to mitigate this issue by leveraging a small, trusted set of manually curated clean data to refine a large, noisy dataset.
Our approach employs a modified ResNet-18 architecture in a two-stage training process. First, the model is trained exclusively on the high-quality, manually labeled clean dataset. This initial "teacher" model is then used to generate high-confidence pseudo-labels for the extensive but noisy WorldCover data, effectively filtering and re-labeling uncertain or incorrect regions. In the second stage, the model is fine-tuned on a composite dataset containing both the original clean labels and the newly generated, reliable pseudo-labels. This strategy leverages the accuracy of the clean data to improve the utility of the noisy data, significantly enhancing model robustness and generalization. The methodology was tested using Sentinel-2 and Digital Elevation Model (DEM) data in a case study covering the diverse forest ecosystems of North Africa.
Our semi-supervised methodology demonstrated exceptional performance, achieving a final classification accuracy of 98.50% on a combined validation set. The initial training on clean data showed rapid convergence, underscoring the power of a high-quality seed dataset. This research offers a practical and highly effective strategy for improving land cover classification in any region where large, noisy datasets are available alongside limited high-quality ground truth, providing a scalable solution to support global conservation efforts.
