1 LINEAR REGRESSION MODELS OF MOULTING ACCELERATING COMPOUNDS WITH INSECTICIDE ACTIVITY AGAINST SILKWORM BOMBYX MORI L

Dibenzoylhydrazine derivatives are used as insect growth regulators that act through the induction of a lethal larval molting process in insects that belong to the species of Lepidoptera and Coleoptera. This paper presents linear regression models of ecdysone agonistic activity of dibenzoylhydrazine insecticides measured in the silkworm Bombyx Mori lepidopteran species cell lines. These structures were modeled through the PM7 semiempirical quantum chemical method using the MOPAC 2016 software. Several structural descriptors were derived from the energy optimized structures and were related to the insecticidal activity, expressed as pEC50 values, using the multiple linear regression (MLR) and the partial least squares (PLS) methods. The dataset was divided into training and test (30% of the total number of compounds, chosen randomly) sets to test the model predictive power by several parameters. According to the squared correlation coefficients values, of 0.827 and 0.78 for the MLR and PLS models, respectively, and other statistical tests, the MLR model had better fitting results and good predictive ability compared to the PLS one. Structural features which influence the ecdysone agonistic activity of dibenzoylhydrazine insecticides encode chemical information on molecular flexibility, are related to sigma and pi bonding patterns in molecules and to geometrical descriptors invariant to translation and rotation, which contain electronic and topological information.


INTRODUCTION
Dibenzoylhydrazine compounds are insect growth regulators that act through the induction of an early and lethal larval molting process in vulnerable insects that belong to the species of Lepidoptera and Coleoptera [1].These compounds activate the steroid receptor complex of ecdysone type at lower concentrations than the natural hormone.The insect cannot remove them efficiently from its body and as consequence a constant state of ecdysteroid signaling is displayed in the insect, which avoids it to complete the molting process.Because the insect stays permanently trapped in the molting process and is unable to feed, it dies in the period of a few days from desiccation and starvation.
The activity of ecdysteroids is mediated by a heterodimer protein complex composed of ecdysone receptor and ultraspiracle, which activates the translation of the associated genes after the trigger caused by the binding of the corresponding ligand molecule [2].
The molecular mechanism of action of ecdysteroids is still unknown because one of the three interaction sites of the hormone-receptor model is not present in some active compounds [3].
The objective of our study is to estimate the ecdysone agonistic activity of dibenzoylhydrazine insecticides [4] measured in the silkworm Bombyx Mori lepidopteran species cell lines by linear regression techniques (multiple linear regression (MLR) and the partial least quares (PLS) methods).

Definition of target property and molecular structures
A set of 33 dibenzoylhydrazine ecdysone agonists with known biological activity was analyzed in this study.The ecdysone agonistic activity data [4], expressed as pEC 50 values (where EC 50 represents the concentration at which 50% of the maximum response is achieved) was used as dependent variable.
In the first step, the structures of the investigated molecules were pre-optimized using the (MMFF94) molecular mechanics force field included in the MarvinSketch (MarvinSketch 15.2.16.0,ChemAxon Ltd. http://chemaxon.com)package.In the next step, the minimized structures were refined using the semiempirical PM7 Hamiltonian [5]

The Multiple Linear Regression (MLR) method
Because the number of 1412 of computed descriptors is too high compared to the number of compounds (N = 33), a proper variable selection method was mandatory.The Genetic Algorithm (GA) is a trustworthy and extensively used variable selection method [6].GA uses a stochastic algorithm that elucidates the optimization issues illustrated by fitness criteria, implying the evolution assumption of Darwin and various genetic functions, including crossover and mutation.The QSARINS v. 2.1 program [7] uses GAs to choose the meaningful descriptors that influence the variation of biologic activity of the compounds.The following parameters were employed: the RQK fitness function [8] with leave-one-out crossvalidation [9] correlation coefficient as constrained function to be optimized, a crossover/mutation trade-off parameter of T = 0.5 and a model population size of P = 50.

The Partial Least Squares (PLS) method
Projections to latent structures (PLS) represent a regression technique for modeling the relationship between projections of dependent factors and independent responses.In this approach a block (or a column) of response variables is linked to a block of explanatory variables [10].The relationship between the dependent and independent variables is described as a latent variable approach [11].In the PLS approach stable, correct and highly predictive models are obtained even for correlated descriptors [12].In this work PLS calculations were performed using the SIMCA (SIMCA P+12.0.0.0,May 20 2008, Umetrics, Sweeden, http://www.umetrics.com/)package.The QSAR matrix (of dependent and independent variables) was analyzed in a first step by the principal component analysis (PCA) [10], and subsequently by the partial least squares (PLS) approach.The squared correlation regression coefficient r 2 , and the squared cross-validated correlation coefficient, q 2 , are the most eloquent statistical parameters that ensure a measure of the quality and validity of the final PLS model, while the Variables Importance in the Projection (VIP) values and the sign of the variables' coefficients are more relevant in explaining the activity mechanism.The significant principal components were selected by 7 cross-validation groups.

Model validity
The dibenzoylhydrazine derivatives were divided into training and test sets by random split, taking out 27% of the total number of compounds (no.3, 4, 9, 10, 11, 16, 18, 28, 29), while the remaining 73% were used as training set.The model's predictability was tested using the Q [15] and the concordance correlation coefficient (CCC) [16] (having the thresholds values higher than 0.85, as they have been rigorously determined by a simulation study [17])-external validation parameters.
The predictive power of the QSAR models was, also, evaluated based on the predictive parameter 2 m r (with a lowest threshold value of 0.5 to be accepted) [18].
The Y-randomization test is a usually used technique that exhibits the robustness of a QSAR model, being a measure of model overfit.The dependent variable (biological activity) is arbitrarily mixed and a QSAR model is built using the same X matrix of molecular descriptors.The obtained MLR and PLS models (after 999 randomizations) must have minimal r 2 and q 2 values [19].
The data over fitting and model applicability was controlled by comparing the rootmean-square errors (RMSE) and the mean absolute error (MAE) of the training and validation sets [20].

MLR analysis
The data was normalized using the autoscaling method: where for each variable m, XT mj and X mj are the j values for the m variable after and before scaling, respectively, m X is the mean, and S m is the standard deviation of the variable.
Several MLR models were built after variable selection, which was carried out by genetic algorithm.The fitting and predictivity criteria for these models are presented in  q -leave-one-out correlation coefficient; 2 LMO q leave-more-out correlation coefficient; 2 adj r -adjusted correlation coefficient; RMSE tr -root-mean-square errors; MAE tr -mean absolute error; CCC tr -the concordance correlation coefficient;     The Williams plot is used to identify compounds with the greatest structural influence (h i > h * ; h i =leverage of a given chemical; h * = the warning leverage) in the QSAR model.
The Williams plot for the training sets presented in Figure 2 (for the MLR1 model), establishes the applicability domain of the models within ±2.5σ and a leverage threshold h* of 0.500.The analysis of Figure 2 suggests that all the compounds in the dataset are within the applicability domain of the models.
The y-scrambling test indicates the robustness of a QSAR model, being a measure of the model overfit.The robustness of the developed models is confirmed by a significant low scrambled r 2 ( 2 scr r ) and cross-validated q 2 ( 2 scr q ) values obtained for 999 trials.Figure 3 suggest that in case of all the randomized models, the values of 2 scr r and 2 scr q were < 0.5   In the present study, the best MLR1 model has three parameters.A higher or lower number of molecular descriptors does not have any significant effect on the model's accuracy.
Additionally, the predictive r 2 (leave-one-out, 2 LOO q , and leave-more-out, An intercorrelation analysis of the selected molecular descriptors from the MLR1 model is presented in Table 5.The three selected descriptors are not intercorrelated.The statistical results and intercorrelation coefficients presented above confirm that the MLR method associated with a proper variable selection procedure generates an efficient QSAR model for predicting the ecdysone agonistic activity of dibenzoylhydrazine insecticides.

PLS analysis
A PCA model was built using the SIMCA-P+ version 12.0 software for the entire X matrix which include N=33 compounds and X = 1412 molecular descriptors.From the total of 7 significant principal components resulted from this analysis, we observed that the first three components already explained 65.5% of the information content of the descriptor matrix.PLS calculations were, as well, performed using the same training and test sets, as in case of MLR models.The statistical results of the PLS model:

Conclusion
A series dibenzoylhydrazine insecticide with ecdysone agonistic activity measured in the silkworm Bombyx Mori lepidopteran species cell lines was investigated using linear regression methods.After structure optimization modeling using the semiempirical quantum chemical PM7 approach, calculated descriptors were related to the insecticide activity using the multiple linear regression and partial least squares approaches.The final model of dibenzoylhydrazine non-steroidal ecdysone agonists obtained using the MLR method have good statistical parameters.Molecular descriptors related to molecular flexibility, to sigma and pi bonding patterns in molecules and to geometrical descriptors invariant to translation and rotation, which contain electronic and topological information influenced the insecticidal activity.PLS modeling of the same data gave worser statistical results and a less predictive model compared to the MLR one.
-scrambling parameters; SEE-standard error of estimates; F-Fischer test.

Figure. 1 .
Figure. 1. Experimental versus predicted pEC 50 values for the MLR1 model predicted by the model (left) and by the leave-one-out (right) crosvalidation approach (yellow circles-training compounds, blue circles-test compounds).
cumulative sum of squares of all the X and Y values, respectively, explained by all extracted principal components; 2 ) CUM ( Q is the fraction of the total variation of the Y values that can be predicted for all the A extracted principal components in the cross-validation procedure (7 rounds) used to establish the number of significant principal components, A).The noise variables from this model was excluded and a robust PLS-M2 model (N= 24 and X= 27) with two latent variables (Tables 2, 3 and 4) was obtained.Although the PLS-M2 model contains only the descriptors significantly different from zero it has poorer statistical results and predictive power compared to the MLR1 model.

Table 1 .
The smiles notation of dibenzoylhydrazine structures and their experimental insecticidal (pEC 50 ) and predicted (pEC 50 pred ) activity values obtained using the MLR/PLS

Table 2
Fitting and cross-validation parameters of the MLR models (training set)*

Table 3
Predictivity criteria calculated for the MLR models (test set)*

Table 5 .
Correlation matrix of the selected descriptors included in the MLR1 model