Background
Various machine learning (ML) methods are applied for prediction of individual clinical efficiency of cancer drugs and therapeutic regimens. As features for ML, different multi-omics data may be used, such as genomic, transcriptomic, proteomic, and interactomic (activation levels of intracellular molecular pathways) profiles.
Methods
We proposed a next-generation ML approach termed FloWPS (FLOating-Window Projective Separator) that uses pre-processing/trimming/filtration of multi-omics features when building the ML models, in order to preclude extrapolation in the feature space. Additionally, FloWPS allows to neglect the influence of preceding cases from the training dataset, which are too distant in the feature space from the case that must be classified. Such extrapolation, as well as too distant instances, can cause model overtraining and results in decreased ML accuracy.
Results
Using Gene Expression Omnibus (GEO), The Cancer Genome Archive (TCGA), and Tumor Alterations Relevant for GEnomics-driven Therapy (TARGET) project databases we selected 27 gene expression datasets for cancer patients, annotated with clinical response status. Each dataset had the same cancer type and treatment regimen. The biggest dataset included 235, and the smallest - only 41 patient cases. To form the robust set of marker features (gene expression levels), we applied the leave-one-out (LOO) cross-validation test that selected genes with the highest AUC values for good-vs-poor responder discrimination.
When using the blind/agnostic LOO approach for data trimming, we demonstrated essential improvement of ML quality metrics (AUC, sensitivity and specificity) for FloWPS-based clinical response classifiers for all global ML methods applied, such as support vector machines (SVM), random forest (RF), binomial naïve Bayes (BNB), adaptive boosting (ADA), as well as multi-level perceptron (MLP). Namely, the AUC for the treatment response classifiers increased from 0.61–0.88 range to 0.70–0.97.
Discussion
To exclude possible overtraining effects of data trimming, we evaluated the relative importance of gene expression features for ML models. Our results showed that pre-processing significantly increases the correlation of feature importance metrics among different ML methods. Since different ML methods produce different geometrical models of class separation in the features space, such increase of correlation indicates that FloWPS unveils essential features rather than adapts to random noise, and thus increases the classifier accuracy.
Conclusion
Considering our ML trial with 27 clinically annotated cancer gene expression datasets, the BNB method showed best performance for data trimming and was the most effective for classifying the clinical response using multi-omics features, with minimal, median and maximal AUC values equal to 0.77, 0.86 and 0.97, respectively.