Events The 1st International Online Conference on Risks

Event submissions

Published

This submission belongs to the session S2. Actuarial Science of the event The 1st International Online Conference on Risks

Published date

01 Jul, 2026

Academic Editor

Hailiang Yang

Citation

Zhiyu Quan, Jiayi Guo, Panyi Dong, Starting Off on the Wrong Foot: Pitfalls in Data Preparation, in Proceedings of The 1st International Online Conference on Risks, 6 July–7 July 2026, MDPI: Basel, Switzerland

Facebook

Twitter

Starting Off on the Wrong Foot: Pitfalls in Data Preparation

Jiayi Guo ¹

Panyi Dong ¹

Zhiyu Quan ¹

1. Actuarial and Risk Management Sciences, University of Illinois Urbana-Champaign, 1409 W. Green Street (MC-382), Urbana, IL 61801, USA, USA

Abstract

In applied insurance data science, practitioners routinely face critical challenges during the data preparation stage that can undermine the statistical validity and reliability of subsequent actuarial modeling. This study addresses a fundamental issue: demonstrating that conventional data preparation procedures, particularly random train–test partitioning, often yield unreliable and unstable results when confronted with highly imbalanced insurance loss data. Such naive splitting fails to maintain representativeness, especially of rare high-loss events, compromising the validity of model evaluation. To mitigate these limitations and establish a foundation for more robust modeling, we propose a novel data preparation framework leveraging two recent statistical advancements. First, we employ Support Points to achieve data splitting that is demonstrably more representative of the underlying data distribution than simple random sampling. Second, we utilize the Chatterjee Correlation Coefficient for an initial, non-parametric screening of feature relevance and dependence structure. We integrate these theoretical methods into a unified, efficient framework that also incorporates advanced handling of missing data, and we further embed the framework within our custom InsurAutoML\footnote{ \href{https://github.com/PanyiDong/InsurAutoML}{https://github.com/PanyiDong/InsurAutoML}.} pipeline. The performance of this proposed approach is rigorously evaluated using both simulated datasets and datasets often cited in the academic literature. The empirical assessment focuses on quantifying the gains in computational efficiency and model stability—key concerns in industry-scale workflows. Our findings definitively demonstrate that incorporating statistically rigorous data preparation methods not only significantly enhances model robustness and interpretability but also substantially reduces computational resource requirements across diverse insurance loss modeling tasks. This work provides a crucial methodological upgrade for achieving reliable results in high-stakes insurance applications.

Keywords

Data preparation

missing data

AutoML

insurance data analytics

imbalance learning

Impact of hedging on the cost of capital rate for hybrid life insurance

Multi-objective Stochastic Market-Oriented Optimal Power Flow for Day-Ahead Electricity Price Forecasting in Sustainable Electricity Markets