Metabolomics generates large datasets that require the use of advanced and complementary statistical tools in order to extract the maximum amount of useful information. Traditionally, various non-supervised and supervised pattern recognition methods have been employed in food traceability and authentication, including principal component analysis (PCA), partial least squares discriminant analysis (PLS-DA) or soft independent model class analogy (SIMCA), among others. Complementarily, the use of new machine learning algorithms is emerging in food metabolomics during the last years due to their excellent performance for the analysis of complex datasets, such as random forest (RF) and support vector machines (SVM). In this work, we show the advantages, limitations and complementarities of these statistical tools in food analysis, on the basis of data acquired in various traceability studies performed in our research group with strawberry and extra virgin olive oil (1-4).
(1) I. Akhatou, R. González-Domínguez, A. Fernández-Recamales. Investigation of the effect of genotype and agronomic conditions on metabolomic profiles of selected strawberry cultivars with different sensitivity to environmental stress. Plant Physiol. Biochem. 101 (2016) 14-22
(2) I. Akhatou, A. Sayago, R. González-Domínguez, Á. Fernández-Recamales. Application of targeted metabolomics to investigate optimum growing conditions to enhance bioactive content of strawberry. J. Agric. Food Chem. 65 (2017) 9559-9567
(3) A. Sayago, R. González-Domínguez, R. Beltrán, Á. Fernández-Recamales. Combination of complementary data mining methods for geographical characterization of extra virgin olive oils based on mineral composition. Food Chem. 261 (2018) 42–50
(4) A. Sayago, R. González-Domínguez, J. Urbano, Á. Fernández-Recamales. Combination of vintage and new-fashioned analytical approaches for varietal and geographical authentication of olive oils. Under preparation