In applied insurance data science, practitioners routinely face critical challenges during the data preparation stage that can undermine the statistical validity and reliability of subsequent actuarial modeling. This study addresses a fundamental issue: demonstrating that conventional data preparation procedures, particularly random train–test partitioning, often yield unreliable and unstable results when confronted with highly imbalanced insurance loss data. Such naive splitting fails to maintain representativeness, especially of rare high-loss events, compromising the validity of model evaluation. To mitigate these limitations and establish a foundation for more robust modeling, we propose a novel data preparation framework leveraging two recent statistical advancements. First, we employ Support Points to achieve data splitting that is demonstrably more representative of the underlying data distribution than simple random sampling. Second, we utilize the Chatterjee Correlation Coefficient for an initial, non-parametric screening of feature relevance and dependence structure. We integrate these theoretical methods into a unified, efficient framework that also incorporates advanced handling of missing data, and we further embed the framework within our custom InsurAutoML\footnote{ \href{https://github.com/PanyiDong/InsurAutoML}{https://github.com/PanyiDong/InsurAutoML}.} pipeline. The performance of this proposed approach is rigorously evaluated using both simulated datasets and datasets often cited in the academic literature. The empirical assessment focuses on quantifying the gains in computational efficiency and model stability—key concerns in industry-scale workflows. Our findings definitively demonstrate that incorporating statistically rigorous data preparation methods not only significantly enhances model robustness and interpretability but also substantially reduces computational resource requirements across diverse insurance loss modeling tasks. This work provides a crucial methodological upgrade for achieving reliable results in high-stakes insurance applications.
Previous Article in event
Next Article in event
Next Article in session
Starting Off on the Wrong Foot: Pitfalls in Data Preparation
Published:
01 July 2026
by MDPI
in The 1st International Online Conference on Risks
session Actuarial Science
Abstract:
Keywords: Data preparation; missing data; AutoML; insurance data analytics; imbalance learning
