Missing values significantly impede data analysis and machine learning, especially in healthcare where complete data is vital. They can reduce predictive model performance, making robust imputation essential. Traditional methods like mean and median substitution often perform poorly with high missingness. This study compares traditional statistical imputers with machine learning models for handling missing data. Seven machine learning algorithms were tested on four datasets with substantial missing values, revealing performance declines in both statistical and ML-based imputation methods when missingness was high. To overcome this, the study proposes a stacking ensemble combining Random Forest, Linear Regression, and Ridge Regression to boost predictive accuracy and reduce error.The proposed model was evaluated using standard metrics, such as Accuracy and Root Mean Squared Error (RMSE), and was compared against individual models and traditional imputation methods. Results show that the ensemble technique achieved accuracy of 98.2% and RMSE 0.2093 outperforming all seven individual machine learning models and statistical methods on the breast cancer dataset. RF with 97.08% and XGBoost with 95.9% accuracy also consistently outperformed statistical imputers across all datasets. Notably, the Decision Tree model exhibited poor performance across all datasets, with high RMSE and low accuracy. These findings highlight the importance of selecting appropriate imputation strategies and algorithms to enhance predictive accuracy in the presence of missing data. This work contributes to the growing body of research on machine learning-based imputation and predictive modeling in healthcare and other domains.
Previous Article in event
Previous Article in session
Next Article in event
Next Article in session
Ensemble-Based Imputation for Handling Missing Values in Healthcare Datasets: A Comparative Study of Machine Learning Models
Published:
03 December 2025
by MDPI
in The 6th International Electronic Conference on Applied Sciences
session Computing and Artificial Intelligence
Abstract:
Keywords: Breast Cancer Prediction; Machine Learning; Imputation Methods; Predictive Performance; Missing Values; Model Comparison; Data Preprocessing.
