Ligand-Based and Structure-based virtual screening for the discovery of natural larvicidal against Aedes aegypti

Abstract. The Aedes aegypti mosquito belongs to the order Diptera and is one of the main vectors of transmission of etiological agents that cause several diseases. This mosquito can transmit diseases such as dengue, yellow fever, Zika, chikungunya, among others. The aim of this study was combining structure-based and ligand-based virtual screening (VS) techniques to select potentially larvicidal active molecules against Ae. aegypti from in-house secondary metabolite dataset (SistematX). From the ChEMBL database, we selected a set of 161 chemical structures with larvicidal activity against Ae. aegypti to create random forest models with an accuracy value higher than 82% for cross-validation and test sets. Afterward, the ligand-based virtual screen selected 38 secondary metabolites. In addition, a structure-based virtual screening was also performed for the 38 molecules selected. Finally, using consensus analyzes approach combining ligand-based and structure-based VS, five molecules were selected as potential larvicidal against Ae. aegypti .


Introduction
The Aedes aegypti mosquito belongs to the order Diptera and is one of the main vectors of transmission of etiological agents that cause several diseases [1].According to the World Health Organization (WHO), the diseases transmitted by this insect are classified as neglected tropical diseases, as they mainly affect vulnerable socioeconomic populations where there is little investment in their control as well as in the development of treatments [1][2][3].This mosquito can transmit diseases such as dengue, yellow fever, Zika, chikungunya, among others [1][2][3].
The life cycle of Ae. aegypti starts in the egg, from which larvae emerge.After going through four stages, the larvae turn into pulps and then into adult mosquitoes [1,4].The eggs of this mosquito can remain viable for more than a year even without the presence of water, which represents a great threat to the control of Ae. aegypti [1,4].The main method for the prevention and spread of these diseases is vector control, especially during the larval and adult stages [3,5].
Chemical pesticides, despite being effective, can cause several unwanted effects for both man and the environment.In addition to already having reports in the literature of resistance by Ae. aegypti to various chemical pesticides [1,2,[6][7][8].Thus, the search for alternatives to combat the vector is extremely important.Secondary metabolites from plants can be an excellent alternative to search for new insecticides.
In this perspective, a combination of ligand-based and virtual structure-based screening techniques was performed on secondary metabolite Annonaceae dataset to select the best larvicidal active molecules against Ae.aegypti.

Dataset
From the ChEMBL database, was selected a dataset of 161 chemical structures with larvicidal activity against Ae.aegypti for construction of predictive models.The compounds were classified as active (85) (pIC50 > 4.15) or inactive (76) (pIC50 < 4.15).After a literature search, 11 flavonoids and palmitic acid were added.They had known activity against A. aegypti larvae, with six being identified as active and six as inactive, based on the cuto_ point (pIC50).The compounds were classified using values of -logIC50 (mol/L) = pIC50.In this case, IC50 represented the concentration required for 50% inhibition of Ae. aegypti.
A dataset of secondary metabolites composed by 1885 structures from Annonaceae were extracted from our in-house databank SistematX available at http://sistematx.ufpb.br[9,10].This database was used for virtual screening to select the molecules with the highest values of probability to inhibit the Ae.aegypti.For all structures, SMILES codes were used as input data in Marvin 19.27.0, 2019, ChemAxon (http://www.chemaxon.com)[11] and Standardizer software [JChem 19.27.0, 2019; ChemAxon (http://www.chemaxon.com)][12] to canonize structures, add hydrogens, perform aromatic form conversions, clean the molecular graph in three dimensions and save compounds in sdf format.
Molecular descriptors are used to calculate the physicochemical properties of the molecules of each set of molecules.To obtain the molecular descriptors, the DRAGON 7.0 program34 was used [13].
The DRAGON 7.0 software can calculate 5270 molecular descriptors, covering several approaches.These molecular descriptors are arranged in 30 logic blocks [13].This calculus was realized for all sets of chemical structures.

Predict Model
The Knime 4.4.2software (Knime 4.4.2 the Konstanz Information Miner Copyright, 2003-2021, www.knime.org)[14] was used to perform the analyses and to generate the in silico model.Datasets of molecules, along with their calculated descriptors and class variables were imported from the Dragon 7.0 software.The dataset was divided using the "Partitioning" tool, with the "stratified sample" option, to create a training set and an external test set, which represented 80% and 20% of the compounds, respectively.Although the compounds were selected randomly, the same proportion of active and inactive samples was maintained in both sets.Was used the Random Forest algorithm for created the predict model was used 50 Trees and 1 seed for Random generator.
For internal validation, we employed cross-validation using 10 randomly selected, stratified groups, and the distributions according to activity class variables were found to be maintained in all validation groups and in the training set.Descriptors were selected, and a model was generated using the training set and the Random Forest algorithm (RF), using the WEKA nodes [15,16].
The internal and external performances of the selected models were analyzed for sensitivity (true positive rate, i.e., active rate), specificity (true negative rate, i.e., inactive rate) and accuracy (overall predictability).In addition, the sensitivity and specificity of the Receiver Operating Characteristic (ROC) curve were found to describe true performance with more clarity than accuracy.Using Knime nodes the most important descriptors in the generation of prediction model was evaluated.
The model was also analyzed by the Matthews Correlation Coefficient (MCC), a way to evaluate the model globally from the results obtained from the confusion matrix.The MCC is a correlation coefficient between observed and predictive binary classifications.It results in a value between -1 and +1, where a coefficient of +1 represents a perfect forecast, 0 is nothing more than a random forecast, and -1 indicates total disagreement between forecast and observation [17].The Matthews correlation coefficient can be calculated from the following formula (Equation 1): where VP is the value of true positive, VN is the value of true negative, FP is the value of false positives and FN of false negatives.The domain of applicability (APD) was used to analyze the compounds of the test sets evaluating whether or not their predictions were reliable.The APD is based on Euclidean distances and similarity measures between the descriptors of the training set are used to define the applicability domain, so if a test set compound has distances and similarity beyond this limit, its prediction is not reliable.The APD calculation is performed behind the formula (Equation 2): APD = d + Zσ (2) where d and σ are the Euclidean distances and the standard mean deviation, respectively, of the compounds in the training set.Z is an empirical cut-off value, and in this work the Z value was used as 0.5 [18,19].

Molecular Docking
The target protein of Aedes aegypti 1PZ4 [20], with their respective inhibitor ligands were downloaded from Protein Data Bank (http://www.rcsb.org/pdb/home/home.do).All water molecules were deleted from the enzyme structure, and the enzyme and compound structures were prepared using the same default parameter settings in the same software package (score function: MolDock Score; ligand evaluation: internal ES, internal H-bond, sp2-sp2 torsions, all checked; number of runs: 10 runs; algorithm: MolDock SE; maximum interactions: 1500; max.population size: 50; max.steps: 300; neighbor distance factor: 1.00; max.number of conformations returned: 5).The docking procedure was performed using a GRID with a radius of 15 Å and a resolution of 0.30 Å to cover the ligand-binding site in the structures of the four enzymes.

Results and Discussion
Analyzing the Ae.aegypti model, you can see that the internal cross validation and the external test demonstrated similar statistical performance, with accuracy higher than 81%, showing to be a model with great performance.The Table 1 summarizes the statistical rates of the RF model.Two parameters were used to evaluate the quality of these binary models: The Receiver Operating Characteristic (ROC) curve and Matthews correlation coefficient (MCC) [17].In the model, the area under the curve was greater than 94% for the cross-validation sets, and greater than 91% for the test sets, revealing that the models can perform a good classification and prediction rate.Figure 1 shows the ROC curves of the test and cross-validation for the model.Of the 1885 secondary metabolites analyzed, the RF model was able to select 1300 chemical structures that are within the applicability domain and its predictions are reliable.These 1300 molecules obtained a prediction between 50 and 91%.
38 molecules had a prediction equal to or greater than 80%, these molecules were then selected to undergo structure-based virtual screening.The chosen protein was Sterol Carrier Protein-2 (PDB ID 1PZ4) which is a protein present in the intestine of the Ae.aegypti.A relevant target since we are analyzing the larvicidal potential of secondary metabolites of Annonaceae.
The molecular docking, Structure-based virtual screening, was first validated by redocking of the original ligand for the 1PZ4 protein.The MolDock scores are listed in Table 2 along with their respective RMSD values and the energies from the PDB.Therefore, the molecular docking was performed for the 38 molecules with the best's prediction (higher than 80%) in the RF prediction model.Based on the binding energy values, all tested molecules were ranked using the following probability calculation (Equation 3): where ps = structure-based probability, ETM = docking energy of molecule test and TM ranges from 1 to 38 (secondary metabolites dataset); EM = lowest energy value obtained from tested molecules; EL = the ligand energy from protein crystallography.This equation aims to normalize the scores obtained from molecular docking (structure-based virtual screening) so that the values can be compared with the active probability values from the ligand-based virtual screening [21][22][23].In addition, a principle of selection is that the structures must have an energy lower than the value obtained for the ligand in the crystallography study.The secondary metabolites were classified as active if the structure-based probability values are greater than or equal to 0.5.The numbers of molecules with probability values greater than 0.5 and binding energy values less than the ligand was 53, just five molecules don't was predict like active in the molecular docking.
An approach combining structure-based and ligand-based virtual screening was realized to verify potentially active molecules as well as their possible mechanism of action.This approach also seeks to minimize the probability of selecting false positive molecules because it considers the scores of both virtual tracking techniques and correlates them with the true negative rate [21][22][23].The calculation is done with the following equation (Equation 4): where Pc is combined probability, ps is the structure-based probability, Esp is the specificity rate the cross validation and p is the ligand-based probability.In this equation, the ligand-based score is conditioned to a decrease in the false positive rate with the increment of Esp.Thus, the probability of selecting inactive molecules as active molecules is minimized.
Table 3 summarizes the results for the best-ranked molecules obtained using the combined approach, and Figure 2 shows the best-rated structures.Table 5: Summary of the best-ranked structures obtained using an approach combining ligand-based and structure-based virtual screening; p = active probability value in ligand-based VS; ps = active probability value in structure-based VS.Pc = combined probability value.Molecular docking was validated by the redocking and RMSD.The redocking compare the assumed conformation of the binder in redocking with the conformation of the crystallographic ligand.In this analysis we observed that the assumed conformation by the ligand in the redocking and the ligand cristallized was very similar, validating the docking for this enzyme.Figure 3 shows the conformation of the inhibitory linker the enzyme 1PZ4 assumed in the redocking superimposed with the conformation of the inhibitory linker assumed in the X-ray crystallography of the enzyme.As observed in the best molecules selected by the approach combining structure-based and ligandbased virtual screening present same common characteristics with the ligand PDB, the palmitic acid.This leads us to believe that these characteristics are important to give the active potential of these molecules.

Conclusions
In this study, we selected five secondary metabolites as potential larvicidal against Ae.aegypti through rapid approaches using ligand-based and structure-based VS of 1885 secondary metabolites from Annonaceae, obtained from an in-house database.The compounds selected have structural similarities with other secondary metabolites related in the literature as antiviral compounds.The selected structures are a start point to further studies in order to develop new insecticidal compounds based on natural products.

Figure 1 :
Figure 1: .ROC chart with area under a curve for the Aedes aegypti model test set obtained with Random Forest.AUCarea under the curve; red line -Internal cross validation; blue line -External test.Of the 1885 secondary metabolites analyzed, the RF model was able to select 1300 chemical structures that are within the applicability domain and its predictions are reliable.These 1300 molecules obtained a prediction between 50 and 91%.38 molecules had a prediction equal to or greater than 80%, these molecules were then selected to undergo structure-based virtual screening.The chosen protein was Sterol Carrier Protein-2 (PDB ID 1PZ4) which is a protein present in the intestine of the Ae.aegypti.A relevant target since we are analyzing the larvicidal potential of secondary metabolites of Annonaceae.The molecular docking, Structure-based virtual screening, was first validated by redocking of the original ligand for the 1PZ4 protein.The MolDock scores are listed in Table2along with their respective RMSD values and the energies from the PDB.Table2: The docking energy (kJ/mol) of the ligand PDB for the 1PZ4 enzyme the Ae.aegypti.Ligand energy of the MolDock score and the RMSD values obtained from the redocking procedure.

Figure 3 :
Figure 3: Redocking of the target protein 1PZ4 against the Ae.aegypti.The blue conformation is the conformation of the ligand in X-ray crystallography, and the red conformation is assumed by the redocking.As observed in the best molecules selected by the approach combining structure-based and ligandbased virtual screening present same common characteristics with the ligand PDB, the palmitic acid.This leads us to believe that these characteristics are important to give the active potential of these molecules.ConclusionsIn this study, we selected five secondary metabolites as potential larvicidal against Ae.aegypti through rapid approaches using ligand-based and structure-based VS of 1885 secondary metabolites from Annonaceae, obtained from an in-house database.The compounds selected have structural similarities with other secondary metabolites related in the literature as antiviral compounds.The selected structures are a start point to further studies in order to develop new insecticidal compounds based on natural products.References

Table 1 :
Summary of parameters corresponding to the results obtained in model.

Table 2 :
The docking energy (kJ/mol) of the ligand PDB for the 1PZ4 enzyme the Ae.aegypti.Ligand energy of the MolDock score and the RMSD values obtained from the redocking procedure.