Class imbalance is a common challenge in medical data analysis and can significantly affect the performance of machine learning models by favoring majority classes while neglecting minority cases. This issue is particularly critical in lung cancer datasets, where uneven class distributions may lead to biased predictions and reduced diagnostic reliability.
In this study, we investigate the effectiveness of data augmentation techniques for improving lung cancer classification using machine learning approaches. The dataset used in this work consists of 283 positive lung cancer cases and 38 negative cases, resulting in a highly imbalanced class distribution. To address this problem, the Synthetic Minority Oversampling Technique (SMOTE) is employed as a data preprocessing method to enhance the representation of the minority class and improve model learning.
Several machine learning classifiers are trained and evaluated before and after applying SMOTE in order to assess the impact of data balancing on classification performance. The experimental results demonstrate that data augmentation significantly enhances the predictive capability of the models. Among the evaluated classifiers, Logistic Regression achieves the best performance, reaching an accuracy of 96.03% and a precision of 96.36% before optimization. After fine-tuning, the model’s performance further improves, achieving an accuracy of 98.15%, an area under the ROC curve (AUC) of 99.92%, and a precision of 100%.
These results highlight the importance of addressing class imbalance in lung cancer datasets and confirm the effectiveness of data augmentation strategies in improving machine learning-based diagnostic systems. The proposed approach contributes to the development of more reliable and robust tools for lung cancer detection, with potential applications in computer-aided diagnosis and clinical decision support.
