Mol 2 Net Complex Networks of anti-HIV Drugs Activity vs . Prevalence of AIDS in US Counties Using Symmetry Information Indices

Different aspects about the epidemiology, drugs, targets, chem-bioinformatics, and systems biology methods, related to AIDS/HIV have been reviewed. Next, we developed a new model to predict complex networks of the AIDS prevalence in U.S. counties taking into consideration the Gini coefficient (income inequality) and activity/structure data of anti-HIV drugs in preclinical assays. First, we trained different Artificial Neural Networks (ANNs) using as input Markov and Symmetry information indices of social networks and of molecular graphs, respectively. We obtained the data about AIDS prevalence and Gini coefficient from the AIDSVu database of the Rollins School of Public Health at Emory University and the data about anti-HIV compounds from ChEMBL database. To train/validate the model and predict the complex network we needed to analyze 43,249 data points including values of AIDS prevalence in 2310 US counties vs. ChEMBL results for 21,582 unique drugs, 9 viral or human protein targets, 4856 protocols, and 10 possible experimental measures. The best model found was a Linear Neural Network (LNN) with Accuracy, Specificity, Sensitivity, and AUROC above 0.720.73 in training and external validation series. The new linear equation was shown to be useful to generate complex network maps of drug activity vs. AIDS/HIV epidemiology in U.S. at county level.


Introduction
Human immunodeficiency virus (HIV) is a retrovirus belonging to the family of lentiviruses that causes AIDS.Retroviruses 1 can use their RNA and host DNA to make viral DNA, and are known for their long incubation periods.There are two types of HIV: HIV type 1 and HIV type 2. Despite progresses, HIV 2 remains a public health challenge.After thirty years in the AIDS epidemic, there are over 34 million people living with HIV 3 , and still 2.5 million new infections and 1.7 million deaths each year.
A useful chemoinformaticspharmacoepidemiology model must be multilevel to account molecular and population structure.We need to process diverse types of input data.Initially, we need the information about the anti-HIV drugs, such as chemical structure of the drug (level i) and preclinical information, like biological targets (level ii), organisms (level iii), or assay protocols (level iv).Afterwards, we need to incorporate population structure descriptors (level v) that quantify the epidemiological and socioeconomic factors affecting the population selected for the study.

Results and Discussion
After analysis of the previous results, we decided to test the predictive power of these indices in a simpler model using the STATISTICA 6.0 4 software.In so doing, we trained the LNN predictors using only each family of information indices of drugs ( q IC5f) of 5-order, their MA operators (Δ q IC5fj)) and the fifth MA operator of the U.S. counties (ΔI a 5s).The LNN model based on qIC51 (LNN-IC51) presented the higher values of Sn = 72.04/72.81and Sp = 72.38/72.50 in training/ and external validation sets (see Table 1).LNN-IC51 presented also the higher values for the AUROC in train and validation series (0.73 and 0.74 respectively).Analyzing all the previous results for this dataset, we found that the ICk index appears to be the most important to predict the drug structure-activity relationships.We can conclude it by comparison to the other indices, which have lower values of classification.The equation of LNN-IC51 this model is the following: .
Last, we used this LNN-ALMA model to generate/predict a complex network of the prevalence of AIDS in the United States at county level with respect to the preclinical activity of anti-HIV drugs (Figure 1).The bipartite network has two types of nodes (counties vs. drug).Thus, this is a multiscale network similar to bipartite networks of drugs vs. target proteins reported by other groups [5][6][7] .However, the nodes in the present network contain information about the molecules, i.e., chemical structure as well as assay conditions (target protein, organism, experimental measure, etc.).Additionally, the other set of nodes contain information about socioeconomic factors, such as the income inequality in the county.http://sciforum.net/conference/mol2net-1Multiscale networks of this type have been discussed by Barabasi et al. 8 as one of the more important tools to perform trans-disciplinary research.The links of this complex network are the outputs Laq(cj)pred = 1 of our model.In Figure 1, we illustrate the sub-network of AIDS prevalence vs. Anti-HIV drug preclinical activity for the state of Florida.For instance, the model predicts a high effectively for the drug Zidovudine to treat AIDS in Nassau County.

Materials and Methods
In the present paper, we changed the Balaban information indices (I q k) by Symmetry information content indices ( q ICkf) 9 .These indices are calculated for H-included molecular graph and based on neighbor degrees and edge multiplicity. 10,11 he symmetry information http://sciforum.net/conference/mol2net-1indices are calculated by partitioning graph vertices into equivalence classes; the topological equivalence of two vertices is that the corresponding neighborhoods of the k th order are the same.However, we used the I a k(s) indices to characterize the different populations.We used the software DRAGON 12 to calculate the q ICkf indices the molecules of the ChEMBL dataset of anti-HIV drugs.In this case we calculated a total of Nindices = Nk•Nf = 6*5 = 30 values of q ICkf indices with Nk = 6 different orders (k) that belong to Nf = 5 different families of descriptors (f).We have used Markov chains to calculate Shannon information indices of different systems including simulations of disease spreading relevant to epidemiology. 13e codification of the chemical structure of the compounds is the first step here.We have data about a large number of assays developed in very different conditions (cj) for equal or different targets (molecular or not).The non-structural information here refers to different assay conditions (cj) like concentrations, temperature, targets, organisms, etc.A solution may rely upon the use of the idea of Moving Average (MA) operators used in time series analysis with a similar purpose.We have developed a similar approach called ALMA (Assessing Links with Moving Averages) using also MA operators.ALMA models remember those used in ARIMA models of time series analysis 14 .They are adaptable to all molecular descriptors and/or graphs invariants or descriptors for complex networks.In consonance with the previous section, we use a similar terminology.The inputs of one ALMA model are the descriptors D q k of type k th of the q th system (compound or drug dq in this case) represented by a matrix M. On the other hand, the outputs of one ALMA model are the links (Laq = 1 or Laq = 0) of a complex network with Boolean matrix L and formed by different pairs of input systems.We developed different ANN models using all the set of parameters as well as simple models using different sub-sets of descriptors.The new ALMA model developed using these other set of indices has the following general form:

Conclusions
This work presents a review of several aspects of the disease, including the epidemiology, pathophysiology, treatments, etc.We also developed a model called LNN-ALMA to generate complex networks of the prevalence of AIDS in the counties of the U.S. with respect to the preclinical activity of anti-HIV drugs.The best classifier found was the LNN-IC51; this classifier has only six inputs based on neighborhood information content indices, compared to the other models, the ICk index seems to be the most important to predict the drug structure-activity relationships.The new model has similar performance but is notably simpler than a previous model based on Balaban's information indices with >20 inputs.

Figure 1 .
Figure 1.Sub-network of AIDS prevalence vs. Anti-HIV drug activity for U.S. state of Florida (FL)

Table 1 .
LNN classifier for symmetry information indices of 5-order