Please login first
LUNG CANCER DETECTION BASED ON MACHINE LEARNING WITH DATA AUGMENTATION APPROACHES
* 1 , 2 , 3 , 1
1  LAMCSCI, Faculty of Science, Mohammed V University in Rabat, Rabat 10106, Morocco
2  ADOS Team, LISTD Laboratory, National Superior School of Mines, Rabat 10100, Morocco
3  Laboratory of High Energy Physics: Modeling and Simulation (LHEP-MS), Faculty of Science, Mohammed V University in Rabat, Rabat, 10106, Morocco
Academic Editor: Nicola Amodio

Abstract:

Class imbalance is a common challenge in medical data analysis and can significantly affect the performance of machine learning models by favoring majority classes while neglecting minority cases. This issue is particularly critical in lung cancer datasets, where uneven class distributions may lead to biased predictions and reduced diagnostic reliability.

In this study, we investigate the effectiveness of data augmentation techniques for improving lung cancer classification using machine learning approaches. The dataset used in this work consists of 283 positive lung cancer cases and 38 negative cases, resulting in a highly imbalanced class distribution. To address this problem, the Synthetic Minority Oversampling Technique (SMOTE) is employed as a data preprocessing method to enhance the representation of the minority class and improve model learning.

Several machine learning classifiers are trained and evaluated before and after applying SMOTE in order to assess the impact of data balancing on classification performance. The experimental results demonstrate that data augmentation significantly enhances the predictive capability of the models. Among the evaluated classifiers, Logistic Regression achieves the best performance, reaching an accuracy of 96.03% and a precision of 96.36% before optimization. After fine-tuning, the model’s performance further improves, achieving an accuracy of 98.15%, an area under the ROC curve (AUC) of 99.92%, and a precision of 100%.

These results highlight the importance of addressing class imbalance in lung cancer datasets and confirm the effectiveness of data augmentation strategies in improving machine learning-based diagnostic systems. The proposed approach contributes to the development of more reliable and robust tools for lung cancer detection, with potential applications in computer-aided diagnosis and clinical decision support.

Keywords: Lung cancer; Machine learning; Data augmentation; Class imbalance; SMOTE; Medical data analysis; Cancer diagnosis

 
 
Top