LUNG CANCER DETECTION BASED ON MACHINE LEARNING WITH DATA AUGMENTATION APPROACHES

RAJAA BOUZIDI IDRISSI; Nabila Zrira; Rajaa Sebihi; Hamid Ez-Zahraouy

Previous Article in event

The Role of DGKH in Cervical Carcinogenesis: Mechanisms and Implications

Next Article in event

PLASMA MICRORNAS IN ASSESSING THE RISK OF RECURRENCE IN TRIPLE-NEGATIVE BREAST CANCER

LUNG CANCER DETECTION BASED ON MACHINE LEARNING WITH DATA AUGMENTATION APPROACHES

RAJAA BOUZIDI IDRISSI

^{*

1},

Nabila Zrira

²,

Rajaa Sebihi

³,

Hamid Ez-Zahraouy

¹ LAMCSCI, Faculty of Science, Mohammed V University in Rabat, Rabat 10106, Morocco
² ADOS Team, LISTD Laboratory, National Superior School of Mines, Rabat 10100, Morocco
³ Laboratory of High Energy Physics: Modeling and Simulation (LHEP-MS), Faculty of Science, Mohammed V University in Rabat, Rabat, 10106, Morocco

Academic Editor: Nicola Amodio

Published: 05 June 2026 by MDPI in The 5th International Electronic Conference on Cancers session Causes, Diagnosis and Treatment of Cancer

Abstract:

Class imbalance is a common challenge in medical data analysis and can significantly affect the performance of machine learning models by favoring majority classes while neglecting minority cases. This issue is particularly critical in lung cancer datasets, where uneven class distributions may lead to biased predictions and reduced diagnostic reliability.

In this study, we investigate the effectiveness of data augmentation techniques for improving lung cancer classification using machine learning approaches. The dataset used in this work consists of 283 positive lung cancer cases and 38 negative cases, resulting in a highly imbalanced class distribution. To address this problem, the Synthetic Minority Oversampling Technique (SMOTE) is employed as a data preprocessing method to enhance the representation of the minority class and improve model learning.

Several machine learning classifiers are trained and evaluated before and after applying SMOTE in order to assess the impact of data balancing on classification performance. The experimental results demonstrate that data augmentation significantly enhances the predictive capability of the models. Among the evaluated classifiers, Logistic Regression achieves the best performance, reaching an accuracy of 96.03% and a precision of 96.36% before optimization. After fine-tuning, the model’s performance further improves, achieving an accuracy of 98.15%, an area under the ROC curve (AUC) of 99.92%, and a precision of 100%.

These results highlight the importance of addressing class imbalance in lung cancer datasets and confirm the effectiveness of data augmentation strategies in improving machine learning-based diagnostic systems. The proposed approach contributes to the development of more reliable and robust tools for lung cancer detection, with potential applications in computer-aided diagnosis and clinical decision support.

Keywords: Lung cancer; Machine learning; Data augmentation; Class imbalance; SMOTE; Medical data analysis; Cancer diagnosis

9 Reads
0 Recommendations

RAJAA BOUZIDI IDRISSI

Nabila Zrira

Rajaa Sebihi

Hamid Ez-Zahraouy