Mol 2 Net Machine Learning and Atom-Based Quadratic Indices for Proteasome Inhibition Prediction

The atom-based quadratic indices are used in this work together with some machine learning techniques that includes: support vector machine, artificial neural network, random forest and k-nearest neighbor. This methodology is used for the development of two quantitative structure-activity relationship (QSAR) studies for the prediction of proteasome inhibition. A first set consisting of active and non-active classes was predicted with model performances above 85% and 80% in training and validation series, respectively. These results provided new approaches on proteasome inhibitor identification encouraged by virtual screenings procedures.


Introduction
SciForum http://sciforum.net/conference/mol2net-1 The ubiquitin-proteasome pathway (UPP) is responsible for the selective degradation of the majority of the intracellular proteins in eukaryotic cells and regulates nearly all cellular processes [1].Disfunction of the ubiquitination machinery or the proteolytic activity of the proteasome is associated with many human diseases [2].Proteasome inhibitors have been developed being effective for some disorders but sometimes show detrimental effects and resistance.Therefore, efforts are currently directed to the development of new therapeutics with adequated potency and safety properties that target enzyme components of the UPP [3,4].
Ligand-based molecular design and QSAR approaches are promising fields with several applications in drug development, which use a battery of novel molecular descriptors and different classification algorithms for in silico virtual drug screening studies [5,6].In the present research, we use and compare a set of different machine learning (ML) techniques using the 2D atom-based quadratic indices as attributes with the objective to perform the QSAR modeling of two datasets.The first dataset allows to separate molecules with proteasome inhibitory activity from inactive ones, and the second provides the numerical prediction of the EC50.

Results and Discussion
In the case of our classification study, we reduced the inactive subset removing all the cases that fall outside of the applicability domain of our model.Therefore, the dataset remains with 705 chemicals, being 258 active and the rest 447 inactive ones.The first 705 dataset used for classification studies generates 529 in the training set (TS) and 176 compounds in the prediction set (PS).Based on the aspects mentioned above for our case a first step with non-supervised feature reduction filtering was done, by using the Shannonś entropy as a measure keeping c.a. the 30% of the features (4 143).In a second step a supervised feature reduction filtering was done.In this stage, the process was carried out for the class problem.In this case the features were reduced a 70%, keeping a total of 1248 for the class data.These feature selection processes were carried out with the IMMAN software an "in house" program.Later, in the two-class data the best subset search was done resulting in 43 selected variables.Then wrapper methods associated with the ML techniques were applied to reduce data sets giving different data subsets combinations.Finally, all these subsets were used to generate diverse ML-QSAR models keeping those with the best results for each algorithm.The results for each ML technique used to develop classification QSAR models to predict proteasome inhibitors are shown in Fig. 1.
As it can be observed in Fig. 1 for the TS the fitted models using RF and MLP techniques showed the best accuracies (Ac = 90.17%and Ac = 89.22%)with Mathewś correlation coefficient (MCC) values of 0.79 and 0.77, respectively.In the case of the PS, the performance of these two QSAR models was of 86.36% (MCC=0.70)and 83.52% (MCC=0.64),respectively.Moreover, can be observed low values of false positive rates, which ensures a good performance at time to perform virtual high-throughput screenings, disminissing the wrong evaluation of predicted positive cases.In the same Fig. 1 can also be noted that RF outperforms other models in most of the quality parameters.Besides, the rest of the models also depicted adequate performances with accuracies values above 85% in the case of the TS and 80 % for the PS.http://sciforum.net/conference/mol2net-1

Materials and Methods
In this study the molecular descriptors atombased quadratic indices were calculated using the TOMOCOMD software version 1.0 [7].We also attempt the different feature selection methods implemented in the IMMAN software [8].Moreover, the attribute selection method based on BestSubset Search (BSS) of LDA discriminant analysis was used [9].Later, the wrapper and ranker methods of Waikato environment for knowledge analysis (WEKA) [10] were considered.As a final stage, the parameter tuning optimization for each ML technique was performed to find the best ML-QSAR models.
A dataset derived from a luminescent cellbased dose titration retest counterscreen assay to identify inhibitors of the proteasome pathway was selected from PubChem BioAssay (AID 2486) where the name, structures, compound identifier (CID), and activities can be found.First, a curation process on the database was assessed removing salts, and inorganic compounds.The main difficulty of the ML approaches is to select attributes from a large list of candidates to describe the data.This is because the complete set of molecular descriptors is not needed for the description of the proteasome inhibition.In this sense, the addition of non-relevant attributes can cause noise to the ML systems [10].Therefore, the feature selection approaches are very suitable to deal with this kind of problem.In this work, different schemes of attribute selection including filter and wrapper approaches implemented in WEKA [10] are examined to select the best attribute subset for each ML technique.Some details, advantages and drawbacks of the two approaches can be reviewed in many works dealing with this subject [11][12][13].
The machine learning methods shows impressive performances a wide diversity of studies involving automated, text classification and drug design [14][15][16].Based on this the machine learning approaches selected were: support vector machine, artificial neural network and k-nearest neighbor also included in the list of http://sciforum.net/conference/mol2net-1 the top ten algorithms used in data mining [17].Besides the random forest technique was included because is fast and robust approach with recent succesfull application into many problems [18][19][20].For each ML method applied in this study, various schemes of selecting attributes were examined and for each selected subset, various models were developed and checked out.

Conclusions
In this work, a QSAR study on a diverse and enlarged proteasome inhibitor database collected from the PubChem Bioassay is shown for the first time.The random forest algorithm demonstrates to be the best technique for the modeling of the proteasome inhibitory activity with high accuracies values in the training and test set.The low false positive rates observed validates the presented workflow based on ML-QSAR for the prediction of active proteasome inhibitors compounds from inactive ones.

Figure 1 .
Figure 1.Performance of the ML-based QSAR classifiers