Introduction
Metastasis is the leading cause of cancer-related mortality, highlighting the need for molecular markers that distinguish primary and metastatic tumors. Genes in actin cytoskeleton organization and cell mechanics regulate cell migration and invasion and may reflect metastatic potential. This study investigated copy number alterations (CNAs) in a mechanobiological gene panel and evaluated their ability to predict metastatic status across multiple cancer types using statistical analysis and machine learning.
Methods
CNA data for FSCN1, GSN, ACTN4, EZR, CFL1, MYH9, PFN1, TAGLN, ECM1, ANXA2, and ANXA6 were obtained from cBioPortal breast, pancreatic, and gastric tumors. Breast cancer datasets (TCGA Cell 2015; Provisional 2021; INSERM PLoS Med 2016) included 930 primary and 465 metastatic samples, as well as cBioPortal annotations. Pancreatic and gastric datasets (Broad 2019; TCGA Firehose Legacy; TCGA Nature 2014) were class-balanced using the Synthetic Minority Over-sampling Technique (SMOTE), generating synthetic minority samples via interpolation; balanced sets contained 212 and 306 samples per class and were used for training only. CNA differences were assessed using the Kruskal–Wallis test. Machine learning models—k-nearest neighbors, decision tree, logistic regression, random forest—were trained, splitting the data into 80% training and 20% test sets with hyperparameter optimization via random search. Two strategies were compared, breast cancer-only training and combined multi-cancer training, with evaluation on an independent breast cancer cohort not used in training.
Results
Most genes showed statistically significant CNA differences between primary and metastatic tumors, with cancer-specific patterns. All MLMs achieved stable classification performance, with random forest performing best. Breast cancer-only training yielded an accuracy of 0.78, precision of 0.77, recall of 0.78, and F1 score of 0.76, while multi-cancer training improved performance to an accuracy of 0.85, precision of 0.84, recall of 0.85, and F1 score of 0.84.
Conclusions
CNA in a mechanobiological gene panel, analyzed using MLMs, represents a promising approach for distinguishing between primary and metastatic tumors. The improved performance under multi-cancer training indicates the pan-cancer relevance of this gene set and supports its potential utility as a predictive tool for metastasis prediction.
