Please login first
SMILES Testing in Chemoinformatics software for prediction of intermolecular a-amidoalkylation reactions
1  Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of the Basque Country UPV/EHU, P.O.Box 644, 48080 Bilbao, Spain.
2  IKERDATA S.L, ZITEK, UPV/EHU, Rectorate Building, n0 6, Leioa, Greater Bilbao, Basque Country, Spain.
Academic Editor: Humbert G. Díaz


SMILES codes are a specification in the form of online notation to describe the structure of chemical species using short ASCll (American Standard Code for Information Interchang) strings.

It contains the same information that can be found in an extended connection table, but is more useful as it is a linguistic construct rather than a computer data structure. Another important property of SMILES is that it is quite compact compared to most other methods of representing structures and involves less file space. These properties open many doors to the programmer of chemical information. For instance:

· Keys to access the database

· Mechanism for researchers to exchange chemical information

· Chemical data entry system

· Part of languages ​​for artificial intelligence or chemistry expert systems.

In this work, SMILES codes of all the compounds that participate in the intermolecular a-amidoalkylation reaction are used to calculate the molecular descriptors of the MARKOV chain, which will later be substituted in the equation from the Regression model implemented in MATEO software to predict ee(%). Therefore, in software testing the recognition and identification of SMILES is of vital importance. Furthermore, during the course of the verification and testing of the program, some weak aspects related to the programming of the software have been discovered.

In relation to the errors found in the MATEO software for the specific identification of some SMILES (Figure1), an error in the ring closure stands out for case 1. This defect occurred in the Excel learning procedure, when extending the same SMILES for the rest of the reactions. As the ending of the said SMILES is "1" Excel recognized it as a number and expanded this bug for the rest of the reactions. Although, it was a slippage caused by the experimentalist, the program was not able to particularly identify the wrong SMILES.

Although the SMILES of alkenes in this work does not pose serious problems due to their absence in the reactions studied, but they should be taken into account if this model is extended to other types of reactions.

For case 3, initially the software failed to recognize the SMILES due to the presence of the pad, which is indicative of a triple bond. This problem has been corrected and solved by Carracedo-Reboredo et al ..

Furthermore, the program does not take into account the chirality of the molecules (case 4), this has been partially remedied by multiplying the results of ee(%) by the chirality of the catalyst (+/-) 1. Although, there is no great significance in the prediction of ee(%) for the reactions studied in this work since no chiral substrates have been reported in the literature for enantioselective intermolecular a-amidoalkylation reactions. However, it is suggested to optimize the software to extend its use towards future reactions with chiral reagents.

On the other hand, different alternatives of SMILES representations were tested for the same compound, specifically the nitro group and the aromatic groups. As a result of this analysis, the software was only able to recognize the first option for the nitro group, while for the aromatic groups both alternatives were identifiable.

Finally, the possibility of SMILES recognition of non-covalently linked compounds (hydrogen bonds (case 7) and ionic bonds (case 8) was examined, since it is common to find them in the original database, either the union between the rest of solvent with the substrate, in this case it is easy to identify by the large size of the substrate compared to the solvent or the grouping between the solvent and an impurity, in this situation it is more laborious to recognize which portion refers to the solvent by the similarity of the size between these molecules. In this study, it was shown that the program was not able to consider this type of SMILES, so in a previous work a manual cleaning of them was carried out and it is also impossible to identify which part of the complex corresponds to the solvent and the reference portion to impurity. In view of this drawback, an automated cleaning by the software is proposed, since the SMILES codes can not only be used for a-amidoalkylation reactions, but it is possible to extend to other types of reactions. One way of expanding the use of MATEO consists of the development of new chemoinformatic models for the prediction of the chemical reactivity of other reactions and in this sense, although the error in the SMILES code in this master's thesis is not so important due to the scarcity of cases of non-covalently linked compounds, it is necessary to solve it as a source of SMILES code that remains for our research group.

Keywords: Software MATEO; α-amidoalkylation reactions; SMILES; (%)ee; prediction