Information-Content-Informed Kendall-tau Correlation Methodology: Interpreting Missing Values as Useful Information

Robert Flight; Praneeth Bhatt; Hunter Moseley

Previous Article in event

Machine Learning Predicted Pathway Annotations Greatly Improves Pathway Enrichment Analysis of Metabolomics Datasets

Next Article in event

Integrated Metabolomics and Lipidomics from Dried Blood Spots: Extraction Optimization and Short-Term Evaluation

Next Article in session

Deriving three one-dimensional NMR spectra from a single spectrum

Information-Content-Informed Kendall-tau Correlation Methodology: Interpreting Missing Values as Useful Information

Robert M Flight

^{*

1},

Praneeth S Bhatt

²,

Hunter NB Moseley

¹ Department of Molecular and Cellular Biochemistry, College of Medicine, University of Kentucky, Lexington, 40536, USA
² University of Washington Graduate School, University of Washington, Seattle, 98195, USA

Academic Editor: Reza Salek

Published: 10 October 2025 by MDPI in The 4th International Electronic Conference on Metabolomics session Advanced Metabolomics and Data Analysis Approaches

Abstract:

Introduction: Almost all correlation measures currently available are unable to directly handle missing values. Either missing values are ignored, or they are imputed and used in the calculation of the correlation coefficient. In either case, the correlation values are impacted based on a perspective that the missing data represents no useful information. Missing values occur in real metabolomics data sets for a variety of reasons, but frequently due to specific analytes falling below the limit-of-detection of the analytical instrumentation (left-censored values). These missing data are not missing at random, but represent potentially useful information by virtue of their "missingness" at one end of the data distribution.

Methods: To use this left-censored missingness in correlation, we propose the information-content-informed Kendall-tau (ICI-Kt) methodology. We show how left-censored missing values can be included within the definition of the Kendall-tau correlation coefficient, and how that inclusion leads to an interpretation of information being added to the correlation. We also implement calculations for additional measures of theoretical maxima and pairwise completeness that provied additional interprative perspectives to the methodology.

To test and validate the ICI-Kt methodology, we used a variety of simulated datasets with different types and amounts of missingness, as well as real metabolomics experiments from Metabolomics Workbench (MW). These include (but are not limited to) lipidomics of non-small cell lung carcinomia and small-molecule metabolomics of rat stamina (MW ID PR000016), comparing ICI-Kt to Kendall-tau, Spearman, and Pearson correlations in each case.

Results: Tests on both simulated and real metabolomics datasets demonstrate that ICI-Kt performs better than Kendall-tau, Spearman, or Pearson correlation methods when left-censored values are present, both in detecting outlier samples and constructing metabolite–metabolite networks.

We provide explicitly parallel implementations in both R and Python packages that enable fast calculations when applied to large datasets.

Keywords: missing data; left-censored; correlation

9 Reads
1 Recommendation

Robert Flight

Praneeth Bhatt

Hunter Moseley