Please login first
Starting Off on the Wrong Foot: Pitfalls in Data Preparation
, , *
1  Actuarial and Risk Management Sciences, University of Illinois Urbana-Champaign, 1409 W. Green Street (MC-382), Urbana, IL 61801, USA
Academic Editor: Hailiang Yang

Published: 01 July 2026 by MDPI in The 1st International Online Conference on Risks session Actuarial Science
Abstract:

In applied insurance data science, practitioners routinely face critical challenges during the data preparation stage that can undermine the statistical validity and reliability of subsequent actuarial modeling. This study addresses a fundamental issue: demonstrating that conventional data preparation procedures, particularly random train–test partitioning, often yield unreliable and unstable results when confronted with highly imbalanced insurance loss data. Such naive splitting fails to maintain representativeness, especially of rare high-loss events, compromising the validity of model evaluation. To mitigate these limitations and establish a foundation for more robust modeling, we propose a novel data preparation framework leveraging two recent statistical advancements. First, we employ Support Points to achieve data splitting that is demonstrably more representative of the underlying data distribution than simple random sampling. Second, we utilize the Chatterjee Correlation Coefficient for an initial, non-parametric screening of feature relevance and dependence structure. We integrate these theoretical methods into a unified, efficient framework that also incorporates advanced handling of missing data, and we further embed the framework within our custom InsurAutoML\footnote{ \href{https://github.com/PanyiDong/InsurAutoML}{https://github.com/PanyiDong/InsurAutoML}.} pipeline. The performance of this proposed approach is rigorously evaluated using both simulated datasets and datasets often cited in the academic literature. The empirical assessment focuses on quantifying the gains in computational efficiency and model stability—key concerns in industry-scale workflows. Our findings definitively demonstrate that incorporating statistically rigorous data preparation methods not only significantly enhances model robustness and interpretability but also substantially reduces computational resource requirements across diverse insurance loss modeling tasks. This work provides a crucial methodological upgrade for achieving reliable results in high-stakes insurance applications.

Keywords: Data preparation; missing data; AutoML; insurance data analytics; imbalance learning

 
 
Top