Development of QSAR Models for Identification of CYP 3 A 4 Substrates and Inhibitors

The pharmacokinetic properties of absorption, distribution, metabolism and excretion (ADME) play a crucial role in drug discovery and development, since many drug candidates fail due to an inappropriate pharmacokinetic profile. Cytochrome P450 enzymes are predominantly involved in Phase 1 metabolism of xenobiotics. Thus, it is important to better understand and prognosticate substrate binding and inhibition of CYP450. The goal of this study was to obtain QSAR (Quantitative Structure-Activity Relationship) models to identify substrates and inhibitors of CYP3A4. The data sets were collected and curated from online available databases and literature. Several QSAR models were obtained and validated according to the recommendations of the Organization for Economic Cooperation Development (OECD). The combination of different descriptors and machine learning methods led to robust and predictive QSAR models with high coverage. The interpretation of developed models was performed using the predicted probability maps (PPMs). These maps help to encode major structural fragments to classify compounds as inhibitors or not inhibitors of CYP3A4. In conclusion, the obtained models can reliably identify substrates and non-substrates, and inhibitors and non-inhibitors of CYP3A4, which is very important in the early stages of the development of new


Introduction
Many drug candidates fail during the drug development process in clinical trials due to an inappropriate pharmacokinetic profile.For this reason, the study of the pharmacokinetic properties absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) of a drug candidate is important to reduce time and SciForum http://sciforum.net/conference/mol2net-1increase the chances of success during drug discovery and development¹ .ADME/Tox properties are the major contributors to the failures of new drugs in the development pipeline and often the underlying biological mechanism of toxicity is related to metabolism.Metabolic liability can lead to a number of diverse issues, including drug−drug interactions, in particular enzyme inhibition and induction, which in turn may cause therapeutic failure toxicity, and adverse effects 2 .Cytochrome P450 (CYP) enzymes are predominantly involved in Phase 1 metabolism of xenobiotics.CYP3A4 is the most abundant cytochrome isoenzyme present in liver and is responsible for the metabolism of more than 50% of the marketed drugs 3 .The main goal of this study was to develop robust and predictive models that can be used to classify compound as inhibitor/non-inhibitor or substrate/non-substrate of CYP3A4 for identifying and discarding drug candidates with potential metabolism issues.

Results and Discussion
The statistical results of QSAR models generated for substrates of CYP3A4 (dataset I), using the test set compounds, are summarized in Figure 1.The two best binary and multiclass models were generated using a combination of Morgan-SVM and Morgan-RF.These binary models showed equal values of accuracy 0.76, which corresponds to the percentage of molecules that are correctly classified by model.Furthermore, they showed sensitivity values of 0.74 and 0.77, respectively.The accuracy of these models was 0.77 and 0.78, respectively, whereas F1 was 0.76 and for both models.The multiclass models were also generated using the combination of Morgan-SVM and Morgan-RF.The Morgan-RF model presented precision value 0.69, while the Morgan-SVM was 0.66.The Morgan-RF model was also slightly higher in relation to F1 value, http://sciforum.net/conference/mol2net-1with value of 0.69, compared to the value of 0.66 for the Morgan-SVM.However, multiclass and binary QSAR models showed similar statistical results.Therefore, both models were considered the best models to evaluate the inhibition of CYP3A4.In addition, predicted probability maps (PPMs) were generated by Morgan-RF models.The maps for drugs ketoconazole, tioconazole and miconazole are presented in Figure 3. Miconazole, ketoconazole and tioconazole are antifungal drugs and CYP3A4 inhibitors.These three drugs were classified by the binary model as CYP3A4 inhibitors, and multiclass model considered the three drugs as strong inhibitors with high probability.The imidazole fragment in their structures outlined in green indicate that this fragment has favorable characteristics for the investigated property.These fragments have atoms which are capable of coordinating with heme group iron.The phenyl and thiophene rings are outlined in gray color, which features neutral contribution to the property.Gray isolines demarcate the separation of regions that have favorable and unfavorable contribution.

Materials and Methods
In this study, two large datasets were collected for profiling the CYP3A4 activity.The dataset I contained 8,214 compounds, in which 475 are substrates of CYP3A4 and 7,739 are nonsubstrates (inactive).The annotated dataset was gathered from the literature 4 and PubChem bioassay (Assay ID: 1851).The dataset II contained 9,186 compounds, in which 4,962 are inhibitors de CYP3A4 and 4,224 are noninhibitors.The annotated dataset was gathered from ChEMBL340 assay.All the molecular modeling studies were performed using a workflow in KNIME platform developed in our laboratory.The dataset curation (removal of duplicates, structural conversion, normalization of specific chemotypes etc.) was performed using Indigo Open Source Standardizer following the workflow described by Fourches et al. 5 including the duplicate analysis.Binary and multiclass QSAR models were developed and validated according to the OECD principles.For generation of QSAR models we used the qsaR package fully integrated workflow KNIME 2.9 6 .The cross-validation procedure 5-fold was used to estimate the robustness of the model using the training set, while the test set was used to validate and estimate the predictive power of the generated models.Because dataset I was highly unbalanced, it was not recommended to build binary QSAR models for the entire dataset.Therefore, a linear undersampling strategy was used to investigate the more adequate dataset balancing.We generated five under-sampled datasets with substrates-tonon-substrates ratios of 1:1, 1:2, 1:3, 1:4, 1:8, and the unbalanced dataset.From the six different datasets splits generated, the balancing with proportion of 1:1 and the total unbalanced http://sciforum.net/conference/mol2net-1dataset were selected because of the best statistical results and covering the largest chemical space.Thus, various QSAR models were generated using different types of descriptors and algorithms, in order to use more information from QSAR models.Four different types of molecular fingerprints were utilized in this study (Atom Pair 7 , PubChem 8 , MACCS 9 and FeatMorgan 10 ), as well as four ML algorithms (SVM 11 , GBM 12 , PLSDA 13 and kNN 14 ) were used to model generation, totaling in 16 different QSAR models.For dataset II, the models for CYP3A4 inhibitors were generated using a 5-fold technique, i.e., spliting the data set in modeling set and external validation set.We used only one type of molecular descriptor (Morgan) and two ML methods (SVM and RF 15 ).For construction of multiclass models, the threshold activity was defined as follows: strong inhibitor ≤ 1 µM; weak-moderate inhibitor, property between 1 µM and 10 µM; non-inhibitor ≥ 10 µM 16 .PPMs 17 were generated for visualization of favorable (positive) and unfavorable (negative) structural fragments for compound to be inhibitor or non-inhibitor of CYP3A4.

Conclusions
The largest publicly available data sets for substrates and inhibitors of CYP3A4 were collected, prepared and balanced.Robust and predictive QSAR models were generated for the identification of substrates (binary models) and inhibitors (binary and multiclass models).Obtained models can be used for identifying substrates and inhibitors of CYP3A4 in early stages of drug development.PPMs showed important contribution of some fragments probably responsible for interaction with the heme group of CYP3A4.http://sciforum.net/conference/mol2net-117.RINIKER, S.; LANDRUM, G. A. Similarity maps -a visualization strategy for molecular fingerprints and machine-learning methods.Journal of cheminformatics, 2013, 1, p. 43.© 2015 by the authors; licensee MDPI, Basel, Switzerland.This article is an open access article distributed under the terms and conditions defined by MDPI AG, the publisher of the Sciforum.netplatform.Sciforum papers authors the copyright to their scholarly works.Hence, by submitting a paper to this conference, you retain the copyright, but you grant MDPI AG the non-exclusive and unrevocable license right to publish this paper online on the Sciforum.netplatform.This means you can easily submit your paper to any scientific journal at a later stage and transfer the copyright to its publisher (if required by that publisher).(http://sciforum.net/about).

Figure 1 .
Figure 1.Statistical results of predictions of QSAR models for CYP3A4 substrates evaluated by 5-fold external cross-validation.

Figure 2 .
Figure 2. Statistical results of predictions for the best binary and multiclass QSAR models for CYP3A4 inhibitors evaluated by 5-fold external cross-validation.

Figure 3 .
Figure 3. PPMs for selected antifungal drugs generated using Morgan-RF models.Green atoms/fragments have favorable contribution in the property (CYP3A4 inhibition); Gray: no contribution; Pink atoms/fragments have unfavorable contribution in the property (CYP3A4 non-inhibition).The bit vector size of Morgan was 1024 bits.