– Entropy Multi Activity QSAR Models for Anti-Parasite Drugs Using Markov Entropy Indices

There are many parasite species with very different antiparasite drugs susceptibility. Computational methods in biology and chemistry prediction of the biological activity based on Quantitative Structure-Activity Relationships (QSAR) susbtantialy increases the potentialities of this kind of networks avoiding time and resources consming experiments. Unfortunately, almost QSAR models are unspecific or predict activity against only one species. To solve this problem we developed here a multispecies QSAR classification model (ms-QSAR). In so doing, we use Markov Chains theory to calculate new multi-target spectral moments to fit a QSAR model that predict by the first time a ms-QSAR model for 500 drugs tested in the literature against 16 parasite species and other 207 drugs no tested in the literature using entropy type indices. The data was processed by Artificial Neural Network (ANN) classifying drugs as active or nonactive against the different tested parasite species. The best ANN found was MLP 23:2318-1:1. Overall model classification accuracy was 85.65% (211/244 cases) in training. Validation of the model was carried out by means of external predicting series. In this serie, the model classified correctly 81.85% (275/357 cases).


Introduction
There is a high interest on the search of rational approaches for antiparasite drugs discovery.In this way, theoretical studies as quantitative-structure-activity-relationships (QSAR) may play an important role.Disappointingly, QSAR studies are generally based on databases considering only structurally parent compounds acting against one single microbial species (1)(2)(3)(4)(5)(6)(7).As a consequence, to predict the antiparasite activity for a given series of compounds one have to use/seek as many QSAR models as microbial species drugs susceptibility is desirable to predict.It is very important the report of one single unified equation to calculate the probability of activity of a given drug against different antiparasite species.The basic concept of parasite infection is that parasites are a very primitive life form and have been on this planet for much longer than man has.They have, therefore, been very successful at what they do because they are not an endangered species in any place we know they exist (8,9).In particular, the spread and present distribution of many parasites throughout the world has largely been the result of human activities, and the advent of AIDS has added a new chapter to the history of parasitology.Infections caused by parasites have increased dramatically during the past decades and there are one of the most important infectious in the world, the most important is the infection caused by Plasmodium spp., millions of people have been infected each year, and millions die each year.
Up today there are near to 5000 molecular descriptors that may be in principle generalized and used to solve the former problem.In addition other QSAR approaches have been introduced recently with demonstrated utility in medicinal chemistry.In any case, no one of these indices have been extended yet to encode additional information to chemical structure.Our group has introduced elsewhere one Markov model (MM) encoding molecular backbones information, with several applications in bioorganic medicinal chemistry.The method was named the MARCH-INSIDE approach, MARkovian CHemicals IN SIlico Design.It allowed us introducing matrix invariants such as stochastic entropies and spectral moments for the study of molecular properties.Specifically, the stochastic spectral moments introduced by our group have been largely used for small molecules QSAR problems including design of fluckicidal, anticancer and antihypertensive drugs.Applications to macromolecules have been restricted to the field of RNA without applications to proteins (10).The entropy like molecular descriptors has demonstrated flexibility in many bioorganic and medicinal chemistry problems such as: estimation of anticoccidial activity, modeling the interaction between drugs and HIV-packaging-region RNA, and predicting proteins and virus activity (11)(12)(13).
In recent studies the MARCH-INSIDE method has been extended to encompass molecular environment interesting information in addition to molecular structure.This new interpretation allows calculating molecular thermodynamic entropy for many physicochemical and biological processes.This approach is able to take into consideration for instance not only the molecular structure of the drug but the entropy of its interaction with the specific parasite organism the drug has to eliminate too.The present study develops a single linear equation based on these previous ideas to predict the antiparasite activity of drugs against different species.

Markov model for drug-target step-by-step interaction
By using, Chapman-Kolgomorov equations we can calculate multi-target k Cθ,s(j) values referred to atoms (nodes) in molecular graphs.As was mentioned above multi-target here means that we obtain different k Cθ,s(j) values for the same atom in the same molecule when the molecular target (bacteria, virus, parasite, receptor, enzyme, etc.) change.First, we have to calculate the absolute probabilities s pk(j) for the interaction in many step of different j-th atoms with the specific target.Here targets are only different microbial species (s).In this sense we insert the superscript s in the symbol of the centrality.These values can be determined as the elements of the vectors k π(s).These vectors are elements of a Markov chain based on the stochastic matrix 1 Π, which describes probabilities of interaction s p1(i,j) of the j-th atom given that previously other i-th atom has interacted with the target.
T he specificity for one target is given using target specific weights in the definition of the elements of the matrix 1 Π.The theoretic foundations of the method have been given in previous works, so we do not detail it here but refer the reader to these works (14,15).After that, the entropy centrality is very ease to calculate applying the Shannon's formula to each element s pk(j) of the vectors k π(s) and obtain the entropy centrality measures k Cθ,s(j).As in the example 1 we can sum the k Cθ,s(j) values for specific atom sets (AS), or the same groups of nodes, to create local molecular descriptors for the drug-target interaction.Herein the AS used were: halogens (X), insaturated carbons (Cins), saturated carbons (Csat), heteroatoms (Het), and hydrogens bound to heteroatoms (H-Het).The corresponding symbols of the local entropy centrality for these AS are: k Cθ,s(X), k Cθ,s(Cins), k Cθ,s(Csat), k Cθ,s(Het), k Cθ,s(H-Het) and k Cθ,s(T).In this study, we calculated the first six classes of entropy centrality (k = 0 to 5) for the 5 AS in total 6•5 = 30 molecular local centralities for each drug (15).At following, we give the formula for both the transition probabilities (elements of the matrix) and the atoms set entropy centrality measures.This methodology has been successfully tested previously, see the works of Gonzalez-Diaz, H. et.al. (16)(17)(18)(19)(20)(21)(22)(23)(24)(25).

ANN models
The ANN models are non-linear models useful to predict the biological activity of a large data set of molecules.This technique is an alternative to linear methods such as LDA (26,27).Figure 1 depicts the network maps for some of the ANN models.In general, at least one ANN of every types tested was statically significant.However, one must note that the profiles of each network indicate that these are highly nonlinear and complicated models (28)(29)(30).

Figure 1.
Depicts the networks maps for some of the ANN models used in this manuscritpt.

Data set
The data set was conformed by a set of marketed and/or very recently reported antiparasite drugs which reported MIC50 against different virus.The data set was conformed by 500 different drugs experimentally tested against some species of a list of 16.The three data sets used were as follows training series: 115 active compounds plus 129 non-active compounds (244 in total); predicting series: 114 + 243 = 357 in total.

Results and discussion
The data was processed by Artificial Neural Network (ANN) classifying drugs as active or nonactive against the different tested parasite species.The best ANN found was MLP 23:23-18-1:1.Overall model classification accuracy was 85.65% (211/244 cases) in training.Validation of the model was carried out by means of external predicting series.In this serie, the model classified correctly 81.85% (275/357 cases).We compare different types of networks to obtain a better model; Table shows the classification matrix of the different networks.MLP 23:23-18-1:1 was taken as the main network because it presented a wider range of variables, 23 inputs in the first layer and 23 neurons in the second layer, and two sets of cases (Training and Validation).Another tested networks found were MLP 8:8-10-1:1 presented high accuracy but only classified protein variables, PNN 190:190-14891-2-2:1 had a very low percentage of PP leading to possible errors in the model although its accuracy was very good, and a RBF 1:1-1-1:1 with a bad accuracy and presents only one variable leading possible error in the model, see Table 1.We depict the ROC-curve for MLP 23:23-18-1:1to show how reliable was the network model developed, see Figure 1.The network found was LNN and it showed training performance higher than 92.8%.The summary of results is shown in Table 1.After direct inspection of the results reported in Table 1 for ANN methods, we can conclude that a complex ANN method is better to predict the activity than LDA.We depict the ROC-curve for MLP 23:23-18-1:1 to show how reliable was the network model developed.In Figure 2, we depict the ROC-curve (31,32) for MLP tested.Notably, the model presented an area under curve higher than 0.5 (the value for a random classifier).If the data are in 0.5 (black line in Figure 2), it means that our new data model predicts 50%.But the ROC-curve of our model is close to one, which means that our model predicts correctly.The vitality of this type of procedures developing ANN-QSAR models has been demonstrated before (33); see, for instance, the work of Fernandez and Caballero (34).The same is true about the ANNs tested, where is illustrated ROC-curves of ANN MLP with an area higher than 0.98.We processed our data with ANNs looking for a better model.In general, the ANN MLP tested was statically significant (27).

Figure 2. ROC curve for training and prediction of LNN network
Comparison with previous ML models.
The ANN model shows excellent results with a relatively small number of parameters (only 23) with respect to some previously published Machine-Learning (ML) models.To assess the importance of this result, we compared these ML models with other models used to address the same problem.For an example, have been reported a notably more complicated ML model, which included a non-linear SVM model, a large number of parameters as well as many class-to-class trials rather than the single QSAR equation used in this work (35,36).All the other models included less than (20) input parameters or unknown parameters and some with 1000 or more (5000+) cases, and non-linear techniques such as Support vector Machine (SVM) and others (37)(38)(39)(40)(41).Our model is notably simpler and is the only developed in 3D parameters (proteins).However, some of these other models have low accuracy, or use ROC curve or Correlation coefficient as good classification at which makes the task more difficult (42)(43)(44)(45)(46)(47)(48), see Table 2 for details.

Conclusion
Using the MI approach, it is possible to seek for an ms-QSAR classifier to predict the probability of drugs with antiparasite activity of more than 16 different parasite species.The model can be used as a tool for preliminary screening of drugs without relaying upon geometrical optimization of drug, receptor, and drug-receptor complex structure and avoiding receptor alignment as well.We compared our model with different models already published, and concluded that a specific model of antiportozoal mt-QSAR is much more accurate than other models that only cover a single target or mt-QSAR models are seeking information from many targets, but without being specific anti-protozoa.That is why the need to develop new methodologies specific mt-QSAR on antiprotozoal, to find and design better drugs.

Table 1 .
Comparison of ANN classification models.