Machine Learning (ML) is used to learn a system. One of the purposes of this machine learning is the construction of new computational models. ML shows success in various areas such as reference systems, brain-computer interface, robotics, and chemistry.
Recently, Perturbation Theory (PT) operators and ML techniques have been combined to create powerful PTML (PT+ML) models, which are applied to complex biological systems in predicting drug-protein interaction. As for those target proteins involved in the dopamine pathway, nanotechnology, material science, etc.
This PTML method has been developed by our group to search for models capable of predicting the v_{ij} values of multiple properties of an i^{th} system measured under different experimental conditions c_{j}. In general, PTML tries to predict an objective function f(v_{ij})_{obs} obtained from the experimental value v_{ij} and is obtained as a function f (v_{ij}) = f (s_{i}, c_{i})_{k} of the structure of the system (s_{i}) and the conditions for a type k. PTML models can predict multiple properties of the system at the same time (multi-output and multi-objective) taking into account the variations (disturbances) with respect to a reference or expected value in multiple input variables used to quantify the experimental conditions c_{j}=(c_{0}, c_{1}, c_{2},…. c_{n}) and other structural variables or molecular descriptors D_{ki}=(D_{0}, D_{1}, D_{2},…. D_{n}) used to quantify the structure of the system (s_{i}).
The main application of this method is the study of molecular systems (drug, protein, vaccine, biomarker, nanoparticles, etc.) with multiple values v_{ij} of parameters to optimize, which have been measured in numerous different tests with test conditions different c_{j}. Through this model it is possible to directly obtain the values of a calculated function f(v_{ij})_{calc}=f (s_{i}, c_{j})_{calc} from a reference function f(v_{ij})_{ref} and the disturbance operators PTO(s_{i}, c_{j})_{k}.
The classification models obtain the values of the probability for a specific system i in a specific test, under known test conditions c_{j}. The probability represents the probability of a system to be designed, it shows desired levels of the values v_{ij} of a parameter to optimize.
For a chemical reaction, the properties to be studied could be the yield yield (%) and the enantiomeric excess ee(%). In our case we focus only on the ee(%) because our α-amidoalkylation reactions are enantioselective and their yield (%) is usually high. Furthermore, if the PTML model sought is proposed as an objective function f (v_{ij}) = ee(%), we would be in the presence of a regression model and the probability p(f (ee(%)_{ij})=1) is not calculated. If the model attempts to classify the reactions as reactions with high excess ee(%)>cut-off or low excess ee(%) we would be in the presence of a classification model, with cut-off being a cut-off value defined by the researcher. This model would have as objective function the function f(v_{ij})=f(ee(%)>cut-off))= 1 or 0. In that case if we can obtain the probability p(f(v_{ij})=1)=p(f(ee(%)>cut-off)) that the system has a certain level of the ee(%)>cut-off property. In the development of the MATEO program for the regression models, the objective function f(v_{ij})=ee_{R}(%)=dq·ee(%) was used, which quantifies the ee(%) of product R where dq= 1 when R is the majority enantiomer and dq =-1 when S is the majority enantiomer.
The PTML models add the values of the operators to the values of . Therefore, we need to calculate the values of the PTOs (Perturbation Theory Operators) in the data processing step. This allows us to carry out a process of merging information with variables and conditions from different sources. Moving Averages (MA), multi-condition MA (MMA), double MA, covariance operators, etc., are some examples of useful PTOs. Then, we can use Multiple Linear Regression (MLR), Linear Discriminant Analysis (LDA), or other linear ML techniques to find the PTML model.
The MATEO software that we have verified is based on regression models. However, it does not implement classification models. In the case of chemical reactions, classification models are usually desirable to minimize possible errors in experimental measurements and / or to obtain a final answer on the interest of the reaction. For this reason, in addition to the regression models implemented in MATEO, we set out to develop PTML classification models. To create the PTML-LDA model, linear discriminant analysis has been used, which is a statistical classification technique that has applications to classify cases. This technique is of special interest when there are precision problems in the measurement of the observed experimental variable that make it difficult to obtain regression models, such as the ee_{R}(%). To use this technique, we must discretize the continuous variable ee_{R}(%) transforming it into a discret or Boolean variable. Initially, two alternatives were proposed for the development of the model. The first is based on the classification of the data sets into three classes, defined by the objective variable or function that takes the values f(ee_{R}(%))_{obs} 1, 0, or -1. In this case, the observed function would be f(ee_{R}(%))_{obs}=1 when ee_{R}(%)> cut-off, otherwise f(ee_{R}(%))_{obs} = -1 if ee_{R} (%) <-cut- off, otherwise f(ee_{R}(%))_{obs} = 0 (cut-off> eeR (%)> -cut-off). The distribution in these groups is possible by introducing two limit values, also known as Cut-off, one of them positive and the other negative. The "-1" group is indicative of the excess S enantiomer, the "0" when it is an inefficient reaction including racemic mixtures and others (cut-off>ee_{R}(%)>-cut-off). The value of the observed objective function f (ee_{R}(%))=1 corresponds to the case of excess of R enantiomers. Furthermore, the reference function (first input variable) was not transformed in this model. Therefore, the same reference variable was used as in the previous regression models f(ee_{R} (%))_{ref} = ee_{R}(%)_{ref}.
The model obtained from this strategy is tedious and unfeasible to achieve a percentage greater than 70 for both the training and confirmation tests and achieve equitable specificity and sensitivity percentages. For this reason, this option was discarded.
The second possibility, like the previous case, makes use of the "Cut-off". This option is simpler because it allows ordering the data set into two large classes that are defined by the objective function: f(ee_{R}(%))_{obs} = 0 or 1. In the case of f(ee_{R}(%))_{obs}=0 is when the reaction has a low ee_{R}(%)(cut-off>ee_{R}(%)>-cut-off). But, when f(ee_{R}(%))_{obs} = 1 there are two sub-cases. The sub-cases are when there is an excess of R or an excess of S. In order to differentiate these two sub-groups of f(ee_{R}(%))_{obs} = 1, the reference function was modified in the input of the model. In this new PTML-LDA model the new reference function is f(ee_{R}(%))_{ref} =dq · ee_{R}(%)_{ref}.
With the STATISTICA software, it has been possible to obtain these models. This program has implemented multiple techniques for the selection of variables. The most noteworthy are "All effects", which gives the user the option to choose between the different variables that he wishes to include based on expert criteria. On the other hand, "Forward-Stepwise" makes an automatic selection of variables based on the fact that the software itself chooses the variables by doing a Fisher (F) test. For the construction of the chemoinformatic model of the present work, the first option was used, where it is possible to choose the most important perturbations in different reaction conditions. In addition, the model uses 75% of the data sets for model training and the remaining 25% for confirmation.
On the other hand, the function returned by the model belongs to the function , which provides a numerical result and coefficients of the entered variables. Additionally, to achieve an ideal model, the following points must be taken into account:
· Predicted sets should be in the 70-95% range for both the training and confirmation tests.
· The percentages of specificity (0) and sensitivity (1) must be balanced. In cases where they are not, it can lead to errors, since the model correctly predicts one of them (either the specificity or sensitivity) and the other poorly.