Chemoinformatics in antibacterial drug discovery : Simultaneous modeling of anti-enterococci activities and ADMET profiles through the use of probabilistic quadratic indices

Enterococci are Gram-positive bacteria responsible for causing multiple nosocomial infections in humans. Chemoinformatics could be a great ally of medicinal chemistry in the search for efficacious anti-enterococci drugs. Current methods cannot model the anti-enterococci activity and ADMET (absorption, distribution, metabolism, elimination, toxicity) properties at the same time. We create the first multitasking model for quantitative-structure biological effect relationships (mtk-QSBER), focused on the simultaneous prediction of anti-enterococci activities and ADMET profiles of compounds. The mtk-QSBER model was constructed by using a large and heterogeneous dataset of chemicals, and exhibited accuracy higher than 95% in both training and prediction sets. We provided the physicochemical interpretations of the molecular descriptors (probabilistic quadratic indices) that entered in the model. In order to demonstrate the practical utility of our model, we predicted multiple biological profiles of the investigational antibacterial drug oritavancin, and the results of the virtual predictions strongly converged with the experimental evidences. To date, this is one the most promising attempts to use a unified in silico model to guide drug discovery in antimicrobial research by predicting the antibacterial potency against enterococci, as well as the safety in laboratory animals and humans.


Introduction
Enterococci belong to a group of Gram-positive, facultative anaerobic bacteria that can occur both, as single cocci and in chains. [1][3] When compared with other Gram-positive cocci such as bacteria of the genera Staphylococcus and Streptococcus, enterococci exhibit a lower degree of pathogenicity in terms of mortality, but they are reservoirs of antibiotic resistance genes. [4]wo problems arise to eliminate enterococci: the lack of an efficient antimicrobial chemotherapy, and appearance of undesirable ADMET (absorption, distribution, metabolism, elimination, toxicity) profiles that the antibacterial agent may have.From one side, medicinal chemistry has brought great benefits to drug discovery, being a discipline focused on the design, identification and preparation of biologically active compounds, and studying also the mechanisms of action at molecular level, as well as the ADMET (absorption, distribution, metabolism, elimination, toxicity) parameters. [7]At the same time, even with the alliance of medicinal chemistry with powerful experimental techniques such as high-throughput screening (HTS) and combinatorial chemistry, the chemical space to be covered is huge (10 63 small and medium size molecules) in order to search for new therapeutic agents with the desired properties. [8]n the other hand, if a drug candidate is discovered, serious concern is expected due to the possible lack of adequate ADMET properties, [9][10][11] which remains as one of the principal causes of disapproval of drugs.
Chemoinformatics has offered very useful theoretical/computational tools in drug discovery, [12][13][14][15] contributing to rationalize the chemical synthesis, as well as the evaluation of biological and/or ADMET profiles, and consequently, strengthening its link with medicinal chemistry.In fact, in the last years, promising chemoinformatic multitarget methodologies for quantitative-structure activity relationships (mt-QSAR) have been used for the prediction of diverse biological profiles,  against dissimilar targets (biomolecules, microorganisms, cell lines, mammals), and by using large and heterogeneous datasets of molecules. These mthodologies have made possible the integration of different types of biological and chemical data under many sets of experimental conditions.[41][42][43][44][45][46] Until now, there is no methodology or approach able to predict anti-enterococci activity and ADMET properties at the same time.The existence of such methodology would constitute a prime interest in drug discovery because the process of designing a drug would be guided in several stages: from in vitro assays to clinical studies.For this reason, and considering all the ideas mentioned above, this work introduces a promising chemoinformatic approach focused on constructing a multitasking model for quantitative-structure biological effect relationships (mtk-QSBER).In this context, the mtk-QSBER model is devoted to perform simultaneous predictions of anti-enterococci activities and ADMET parameters, with the aim of searching for safer antibacterial drugs against the aforementioned pathogens.

Dataset, calculation of molecular descriptors and creation of the mtk-QSBER model
The chemoinformatic methodology used for the extraction of the dataset, generation of the molecular descriptors, and construction the mtk-QSBER model has been reported in previous works of Gonzalez-Diaz and coworkers. [41][47] Anyway, here, only the essential details will be given (Fig. 1).The dataset was composed by 29309 different drugs/chemicals that were extracted from CHEMBL, [48] being this a public source available at https://www.ebi.ac.uk/chembl/.The 29309 drugs/chemicals were assayed by considering at least 1 out of 19 measures of biological effect (m e ), against at least 1 out of 163 biological targets (b t ).These biological targets include biomacromolecules, bacterial strains, cell lines, and superior organisms such as mice, rats, and humans.All the experiments were carried out by considering at least 1 out of 3 types of assay information (a i ), with at least 1 out of 9 categories of target mapping (t m ), and where at least 1 out of 3 levels of curation/reliability of the assays (l c ) were taking into account.Notice that here, the combination of the elements (m e ), (b t ), (a i ), (t m ), and (l c ) represents a unique experimental condition, which can be defined by the ontology c j → (m e , b t , a i , t m , l c ).In our dataset, several chemicals were tested in more than one experimental condition.For this the dataset contained 46822 cases as results of the combination of the aforementioned elements of c j .Each of the 46822 cases were annotated as positive ([BE i (c j ) = 1]) or negative [BE i (c j ) = -1] according to certain cutoff values (Table 1), being BE i (c j ) a binary (categorical) variable, which defines the specific biological effect of a compound i under experimental condition c j .On the other hand, a *txt file containing the SMILES codes of the cases was manually changed to *.smi, and after transformed to *.sdf by employing the program OpenBabel 2.3.0. [49]For the calculation of the molecular descriptors from the *.sdf file, the software TOMOCOMD-CARDD was used. [50][53][54][55][56] They can be defined according to the following expression: In the Eq. 1, Pq k (x) is the mutual probability quadratic index of order k, weighted by the atomic physicochemical property x.Here, x is a molecular vector, [X T ] is the transpose of [X], being the latter, a column vector (n × 1 matrix) with components x 1, …, x n ( different atomic physicochemical properties).In addition, the terms k p ij represent the mutual probabilities of the adjacent vertices (atoms) i and j belonging to the kth power of the matrix P. The element can be calculated according to the following expression: [53][54][55][56] Descriptors of the type Pq k (x) can only take into account the chemical structures of the molecules.By definition, a time series is a sequence of data points, which are typically measured over an interval of time.In this context, some formulations based on the Box-Jenkins approach transform any series by subtracting the mean of the series from the value of each data point. The application of the aforementioned approach permits to create new molecular descriptors that can account for both the chemical structure, and diverse elements of the experimental condition/ontology c j under which compounds have been assayed.Thus, we can write an equation with the following form: In Eq. 3, avgPq k (x)c j is the arithmetic mean of the Pq k (x) m descriptors for all the mth compounds in a subset n(c j ).In this context, n(c j ) is the number of compounds assayed by considering the same element of the experimental condition c j , being also annotated as positive.It is important to emphasize that Eq. 3 was applied to each element of c j .The subsequent equation can be written according to the following formalism: Here, Eq. 4 shows the deviation terms ΔPq k (x), which are the Box-Jenkins moving averages, and they take into account both the chemical structure of a compound and the biological target against which the compound was assayed.The training set was used to construct the mtk-QSBER model, containing 35212 cases, with 18347 positive and 16865 negative.The prediction (test) set was employed to validate the model, being formed by 11610 cases, 6052 positive and 5558 negative.Linear discriminant analysis (LDA) was used as pattern classification technique to find the best model, using a forward step-wise procedure as variable selection strategy.In order to accomplish this task, the program STATISTICA was used. [59]The mtk-QSBER model follows the expression of the form:

Δ
In Eq. 5, a 0 is the constant, and b i represents the coefficients of the variables.It should be emphasized that the program STATISTICA takes the categorical variable BE i (c j ), and transforms it into a real score that predict the propensity of a drug/ chemical i to exhibit certain biological effect under the experimental condition c j .After, that score is transformed to the predicted categorical value of BE i (c j ).[62] The last five statistical indices were determined for both training and prediction (test) sets.

Mtk-QSBER model
With the aim of finding the most appropriate model, the principle of parsimony was applied.This means that the model exhibiting the highest statistical quality, but with few descriptors as possible was selected.In this sense, the best mtk-QSBER model found by us had five descriptors: All the symbols of the molecular descriptors together with their corresponding definitions can be found in Table 2.The relatively small values of λ and p-level, and the large χ 2 , demonstrate the statistical quality of our mtk-QSBER model. [60]ble 2 The mtk-QSBER model correctly classified 33755 out of 35212 cases, with an accuracy of 95.86% in the training set, while in the prediction (test) set, 11098 out of 11610 cases were correctly classified, with an accuracy of 95.59%.Specific details regarding the percentages of correct classification are depicted in Table 3, while other important information concerning the chemical and biological data of all molecules, as well as their respective classifications can be found in Supplementary Information 1 (Suppl.Inf. 1) upon request to the authors.Furthermore, the descriptors of type avgPq k (x)c j depending on the elements m e , b t , a i , t m , and l c appear in Supplementary Information 2 (Suppl.Inf. 2) upon request to the authors.As final evidences of the quality and predictive power of the mtk-QSBER model, the areas under the ROC curves were determined, showing a value of 0.994 for both training and prediction sets (Fig. 2).This value demonstrates that our mtk-QSBER model is very different from a random classifier (area = 0.5) because the areas under the ROC curves are much larger.By analyzing Table 3, the ROC curves, as well as Suppl.3][44][45][46][47]

Fig. 2. Pictorial representation of the areas under the ROC curves.
An interesting and peculiar detail of our mtk-QSBER model represented by Eq. 5 is that the molecular descriptors can be interpreted in terms of simple physicochemical and/or structural properties.First, it is necessary to emphasize that all these molecular descriptors based on mutual probabilities indicate that the global property of a molecule will be influenced by the atomic contributions expressed as quadratic functions of the physicochemical properties, depending on the number of times (occurrences) in which certain connections (bonds) appear in the whole molecule.For this reason, it is intuitive to see that the molecular descriptors involved in the construction of the mtk-QSBER model consider the atoms with their corresponding chemical environments.Bond multiplicity is also accounted for by these descriptors.Thus, ΔPq 2 (R)m e describes the increment in the molecular refractivity, and consequently the increment in the molecular polarizability and/or size depending on both the chemical structure and the measure of the biological effect.This increment in the physicochemical property mentioned above should occur in regions where atoms are placed at topological distances equal to 2 (two bonds between the atoms).The descriptor ΔPq 0 (H)b t represents the diminution of the global hydrophobicity, taking into consideration the structure of the molecules, and the biological targets against which they were tested.This last descriptor is constrained by the variable ΔPq 2 (H)a i which expresses the increment of the hydrophobicity in regions where atoms are placed at topological distances equal to 2. Thus, ΔPq 2 (H)a i characterizes the structure of the molecules and the assay information, which means that the experiments exhibit different hydrophobicity requirements when they are carried out by assessing the affinity (binding), measuring effects of the compound on a pathway, system or whole organism (functional), or by determining ADMET properties that involving key metabolic enzymes, cells, tissues and even organisms.
The increment in the hydrophobicity explained by ΔPq 2 (H)a i is consistent with the diminution of the polar surface area in the same molecular regions expressed by the descriptor ΔPq 2 (PSA)t m , which depends on the chemical structure and the target mapping, i.e., the degree of knowledge whether an assay is intended to a general type of biological target.Finally, the diminution of the polar surface area is confirmed by the descriptor ΔPq 5 (PSA)l c , which characterizes the molecular regions where any two atoms are placed at topological distances equal to 5.This descriptor provides information about the variation in the molecular structure, and the level of curation of the biological tests.

Oritavancin. Prediction of multiple biological effects
Until now, from the analysis of Eq. 5, tables, and supplementary materials, we have demonstrated that our mtk-QSBER model can integrate different kinds of chemical and biological data.In fact, in the dataset used to construct the model, there are many chemical families of compounds, where multiple biological effects associated with antienterococci activities and ADMET profiles have been predicted by considering dissimilar experimental conditions c j .Anyway, the purpose here is to show the practical applicability of the mtk-QSBER model.In this sense, we performed simultaneous prediction of many biological effects for the investigational antibacterial drug oritavancin (Fig. 3), which was originally discovered and developed by Eli Lilly.After, The Medicines Company was running clinical trials for a possible new FDA (Food and Drug Administration) application in 2013. [63]Very recently, this antibacterial drug has been approved by FDA in 2014 for treatment of skin infections in the United States.Oritavancin has exhibited high antibacterial activity against sensitive, drug resistant, MDR strains.More specifically, this antibacterial drug has been reported with values of MIC 90 = 1.00 μg/ml (557.69 nM) against different strains belonging to the genus Enterococcus, containing dissimilar degrees of resistance against vancomycin. [64]The same value of has been reported against Enterococcus faecium exhibiting resistance to vancomycin and ciprofloxacin. [65]Other reports indicate that oritavancin showed MIC 50 = 0.25 μg/ml (139.42 nM) and MIC 90 = 0.50 μg/ml (278.85 nM) against diverse enterococci strains. [66]According to all these experimental evidences, if the cutoff values depicted in Table 1 are used, then, oritavancin should be classified as positive (active against enterococci).
In the case of preclinical studies, experimental results show that the half-life (t 1/2 ) of oritavancin in rats with venous catheter-associated infection was 10 hours. [67]At the same time, the t 1/2 in neutropenic mouse infection model was of 33 hours, and the area under the curve (AUC) of 562.17 μg.hr/ml (313.52 μM.hr). [68]All these parameters were determined after intravenous administration in laboratory animals.We did not find data reported for oritavancin in healthy laboratory animals.However, by analyzing the experimental evidences, and the criteria regarding the use of cutoff values in Table 1, we can assume that with the large values of half-life times and AUC, oritavancin should be classified as positive (safe) in the case of the ADMET parameters mentioned above in healthy mice and rats.On the other hand, a review devoted to clinical studies indicated that after intravenous administration in healthy human volunteers, oritavancin had a value of volume of distribution at steady state (V ss ) as large as 1.92 L/kg, with t 1/2 = 356 hours, and AUC = 1111 μg.hr/ml (619.60 μM.hr). [69]These experimental ADMET values clearly demonstrate that according to the cutoff values, the investigational antibacterial drug mentioned above can be considered as positive, i.e., oritavancin can be considered as a safe drug.
Predictions of diverse biological effects of oritavancin were realized by using the mtk-QSBER model, where 1311 different experimental conditions (combinations of the elements of c j ) were considered.Results of these predictions are depicted in Supplementary Information 3 (Suppl.Inf. 3) upon request to the authors.These results suggest that oritavancin is very active against drug sensitive, and MDR strains belonging to different enterococci, including Enterococcus faecium and Enterococcus faecalis.Regarding the ADMET profiles in preclinical studies, the analysis of our predictions confirm that oritavancin exhibits desirable pharmacokinetic parameters, which strongly converge with the experimental reports.We could not find toxicological data for this drug, but the virtual assessment of several measures of toxicity such as LD 50 and TD 50 confirmed that the appearance of toxic effect depends on the breed of mice and/or rats, which were used in the assays.
Finally, very useful information was extracted from the predictions realized for ADMET profiles in clinical studies.Thus, our analysis demonstrates the safety of oritavancin because this drug was predicted to have good absorption and bioavailability (including measures such as Papp, F(%), and AUC), with excellent volume of distribution, and good elimination.In terms of metabolism, the predictions indicate that oritavancin is not metabolized major cytochromes P450 (CYPs) such as CYP1A2, CYP2C19, CYP2C9, CYP2D6 and CYP3A4.By taking into account our predictions, only few CYPs and other metabolizing enzymes are inhibited by oritavancin.These theoretical results complement the experiments, which have demonstrated that the metabolism of the drug mentioned above is very limited. [70]Consequently, metabolism of oritavancin should be analyzed with caution.

Conclusion
With the fast emergence of drug resistant enterococci strains, more innovative approaches for rational discovery of antibacterial agents are needed.As an alternative to overcome this problem, we have applied a chemoinformatic methodology through the generation of an mtk-QSBER model based on probabilistic quadratic indices.Our model was devoted to the virtual search for potent and safer anti-enterococci agents.The predictions of multiple biological effects performed over the antibacterial drug oritavancin demonstrate that the present mtk-QSBER model can serve as a guide for pharmaceutical and medicinal chemists through the different stages in drug discovery: from in vitro tests to preclinical and clinical studies.Our work also suggests the possibility of extending the present chemoinformatic methodology to integrate other pharmacological activities with dissimilar ADMET profiles.This constitutes a new horizon regarding the application of promising and innovative in silico tools in medicinal chemistry to support the design of new molecular entities with desired properties.

Fig. 1 .
Fig. 1.Descriptive overview of the main steps involved in the development of the mtk-QSBER model.

.ΔPq 2 (ΔPq 5 (
List of molecular descriptors which entered in the final mtk-QSBER model.Deviation of the mutual probability quadratic index of order 2, weighted by the refractivity, depending on the molecular structure and the measure of biological effect ΔPq 0 (H)b t Deviation of the mutual probability quadratic index of order 0, weighted by the hydrophobicity, depending on the molecular structure and the biological target ΔPq 2 (H)a i Deviation of the mutual probability quadratic index of order 2, weighted by the hydrophobicity, depending on the molecular structure and the assay information PSA)t m Deviation of the mutual probability quadratic index of order 2, weighted by the polar surface area, depending on the molecular structure and the target mapping PSA) c Deviation of the mutual probability quadratic index of order 5, weighted by the polar surface area, depending on the molecular structure and the level of curation of the experimental information

Table 1 . Summary of the cutoff values for the different measures of biological effects.
a The cutoff values represent the conditions under which a compound/case was assigned to the group of the positive cases.

Table 3 . Performance of the mtk-QSBER model. CLASSIFICATION a,b Training set
a NC -Number of cases.b CCC -Correctly classified cases.c Sensitivity.d Specificity.e Accuracy.f Mathew's correlation coefficient.