Please login first
Predictive Modeling and Mutational Biomarker Identification for Invasive Ductal Carcinoma Recurrence using Machine Learning
1  Menlo-Atherton High School, 555 Middlefield Road, Atherton, 94027, California, United States of America
Academic Editor: Angeliki Magklara

Abstract:

Invasive ductal carcinoma (IDC) constitutes approximately 80% of breast cancer cases. After treatment, there is a 3-15% chance that IDC will recur. The aim of this study was to accurately predict IDC recurrence with a low false-negative rate and to identify mutated genes associated with IDC recurrence. We hypothesize that IDC recurrence can be predicted using genomic, clinical, and phenotypic patient data and that certain genes, when mutated, are highly influential in the recurrence of IDC. To attain an understanding for the underlying causes of IDC recurrence, genomic data such as gene expression and mutation variant data, in addition to clinical and phenotypic data, were aggregated. An XGBoost framework was utilized for the classification task of predicting a patient’s status as disease-free or cancer-recurrent. The XGBoost framework and the XGBClassifier algorithm employ sequential learning and an ensemble of weaker learning models, such as individual decision trees, both of which allow for the continuous correction of errors made by previous models and the identification of complex relationships between features. The XGBClassifier algorithm predicted IDC recurrence with over 99% accuracy and 0% false-negative rates across nearly 2200 IDC patient samples, approximately half of which were recurrent cases. These results were obtained after model fine tuning upon the examination of generated ROC curves, precision—recall curves, and confusion matrices. Feature importance scores, in this case representing the significance of mutated genes in IDC recurrence, were calculated using the XGBoost model for the 40 most commonly mutated genes among IDC patients. GATA3, CDH1, and BRCA1 genes were given the highest feature importance scores despite relatively low mutation occurrences, indicating that our model did not simply associate gene importance with mutation frequency. Our results prompt further inspection, both of genes given high importance scores with low occurrences and of genes given low importance scores with high occurrences using ontological tools. The feature importance scores can aid in biomarker identification, cancer diagnoses, and drug development. Our study can be generalized to other cancers to determine common biomarkers and deepen our understanding of general cancer recurrence.

Keywords: cancer; invasive ductal carcinoma; recurrence; prediction; diagnosis; machine learning; xgboost; IDC; biomarker

 
 
Top