Mass Spectral Matching for Compound Identification in Metabolomics: An Open-Source Python/Shiny Package

¹ Department of Oncology, School of Medicine, Wayne State University, Detroit, MI, USA
² Biostatistics and Bioinformatics Core, Karmanos Cancer Institute, Wayne State University, Detroit, MI, USA
³ Department of Pathology, School of Medicine, Wayne State University, Detroit, MI, USA
⁴ Epidemiology Division, Populations Sciences in the Pacific Program, University of Hawaii Cancer Center, Honolulu, HI, USA
⁵ Pharmacology and Metabolomics Core, Karmanos Cancer Institute, Wayne State University, Detroit, MI, USA
⁶ Department of Chemistry, University of Louisville, Louisville, KY, USA

Academic Editor: Hunter Moseley

Published: 10 October 2025 by MDPI in The 4th International Electronic Conference on Metabolomics session Advanced Metabolomics and Data Analysis Approaches

Abstract:

Compound identification is a critical step for mass spectrometry-based metabolomics. Two widely adopted approaches are chemical structure library- and mass spectral library-based compound identifications, both of which rely on calculating similarity scores between a query mass spectrum and reference spectra, either experimentally acquired or generated in silico. The use of a highly accurate similarity measure for mass spectral matching is a key component for successful compound identification. To meet this need, we developed a user-friendly Python/Shiny package with a graphical interface for mass spectral matching-based compound identification. The package supports both nominal-resolution mass spectrometry data (e.g., GC-MS) and high-resolution data (e.g., LC-MS/MS). It provides comprehensive preprocessing options, including filtering, weight factor transformation, low-entropy transformation, centroiding, noise removal, and spectral matching, and allows users to customize the order of these steps. The package also incorporates recently proposed entropy-based similarity measures, such as Shannon, Tsallis, and Rényi entropy correlations, alongside traditional cosine and binary similarity measures (1,2). Users can also combine these similarity measures to create mixture similarity measures with custom-defined weights. When using a mass spectral library with known compound names, parameters for specific preprocessing steps can be fine-tuned beyond the default settings. The package supports both untargeted and targeted metabolomics workflows and includes a command-line interface for batch processing. The implementation was validated on two reference libraries, a Webbook-based NIST GC-MS library and a GNPS-derived LC-MS/MS library. On the WebNist GC-MS data, the Cosine Similarity Measure outperformed the three entropy-based similarity measures with respect to accuracy (Cosine: 84.28% [83.75%,84.73%]; Shannon: 80.69% [80.17%,81.25%]; Rényi: 81.09% [80.56%,81.63%]; Tsallis: 81.68% [81.15%,82.19%]), and on GNPS LC-MS/MS data, the Cosine Similarity Measure slightly outperforms the three entropy-based similarity measures with respect to accuracy, although the 95% CIs overlap (Cosine: 69.07% [67.26%,71.01%]; Shannon: 67.92% [66.02%,69.95%]; Rényi: 68.63% [66.72%,70.66%]; Tsallis: 68.94% [67.03%,70.79%]).

Keywords: liquid-chromatography mass spectrometry; gas-chromatography mass spectrometry; compound identification; similarity measure; Cosine similarity; Shannon Entropy; Rényi Entropy, Tsallis Entropy; Python

View Poster

22 Reads
0 Recommendations

Jing Li