Events The 1st International Online Conference on Risks

Event submissions

Published

This submission belongs to the session S1. Insurance of the event The 1st International Online Conference on Risks

Published date

01 Jul, 2026

Academic Editor

Annamaria Olivieri

Citation

Alfiyyah Hasanah, A Comparative Study of Logistic Regression, Random Forest, and Gradient Boosting for Motor Insurance Lapse Prediction, in Proceedings of The 1st International Online Conference on Risks, 6 July–7 July 2026, MDPI: Basel, Switzerland

Facebook

Twitter

A Comparative Study of Logistic Regression, Random Forest, and Gradient Boosting for Motor Insurance Lapse Prediction

Alfiyyah Hasanah ¹

1. Secondary School, Regents School Bali, Denpasar 80237, Indonesia, Indonesia

Abstract

This study examines the application of machine learning techniques to predict policyholder renewal behavior in motor vehicle insurance. Accurately identifying customers likely to lapse is crucial for pricing strategies and customer retention in actuarial practice. Using a dataset of motor insurance policies, three classification models, Logistic Regression, Random Forest, and LightGBM, were developed and compared. Exploratory analysis revealed a moderate class imbalance, with approximately 79.6% renewal and 20.4% lapse observations. Feature engineering was performed to construct variables such as age and driving experience. The models were evaluated using multiple performance metrics, including accuracy, precision, recall, specificity, F1-score, and the area under the ROC curve (AUC). The results show that Logistic Regression achieved the highest accuracy (79.4%) and recall (99.5%), but exhibited extremely low specificity (2.1%), indicating poor performance in identifying lapse cases. Random Forest provided a more balanced performance, with an AUC of 0.663 and improved specificity (7.1%), though still limited. LightGBM achieved the best overall discrimination ability, with the highest AUC (0.683) and a more balanced trade-off between recall (63.0%) and specificity (63.3%), despite lower overall accuracy. These findings suggest that while traditional models, such as Logistic Regression, may perform well on aggregate metrics, they can be misleading in imbalanced insurance datasets. Ensemble methods, particularly gradient boosting, offer superior capability in capturing complex patterns and improving classification balance. The study highlights the importance of using appropriate evaluation metrics beyond accuracy and demonstrates the practical relevance of machine learning methods in actuarial modeling and policyholder retention analysis.

Keywords

Motor insurance

Policy lapse prediction

Machine learning

LightGBM

Imbalanced data

Poster

Poster Alfiyyah Hasanah.pdf

Time-Consistent Dynamic Risk Measures on State-Dependent Musielak–Orlicz Hearts

Corporate Bond Factor Momentum