Intrinsic Dimensionality of Chemical Space: Characterization and Applications

One popular method for the representation and characterization of chemical structure is through the use of their computed mathematical descriptors. Such descriptors, often called molecular descriptors, quantify different aspects of molecular structure, viz., size, shape, branching, cyclicity, bonding patterns, etc. Applications of discrete mathematics in the development of molecular descriptors began in the middle of the twentieth century and the trend is going on in an unabated manner even today. While in the 1970s only a few descriptors could be calculated, currently available software can calculate a large number of descriptors for molecules or biomolecules like DNA/ RNA, proteins. When p molecular descriptors are calculated for n molecules, the data set can be viewed as n vectors in p dimensions, each chemical being represented as a point in R. Because many of the descriptors are strongly correlated, the n points in R will lie on a subspace of dimension lower than p. Methods like principal components analysis (PCA) can be used to characterize the intrinsic dimensionality of chemical spaces. Since the early 1980s, Basak et al have carried out PCA of various congeneric and diverse data sets relevant to new drug discovery and predictive toxicology. Principal components (PCs) derived from mathematical chemodescriptors have been used in the formulation of quantitative structure-activity relationships (QSARs), clustering of large combinatorial libraries as well as quantitative molecular similarity analysis (QMSA). This presentation will review the results of PCA carried out by Basak and coworkers since the early 1980s to the present time in the characterization and visualization of SciForum Mol2Net, 2015, 1(Section B), pages 1-10, Proceedings 2 http://sciforum.net/conference/mol2net-1 chemical spaces with special reference to five data sets, both congeneric and structurally diverse: 1) A large and structurally diverse set of 3,692 chemicals which was a subset of the Toxic Substances Control Act (TSCA) Inventory maintained by the United States Environmental Protection Agency (USEPA), 2) A set of 74 alkanes, 3) A virtual library of 248,832 psoralen derivatives, 4) A congeneric set of 95 aromatic and heteroaromatic amine mutagens, and 5) A structurally diverse collection of 508 chemicals mutagens.


Introduction
Mathematical chemistry, or more correctly discrete mathematical chemistry, had its beginning at the middle of the twentieth century probably with the publication of the seminal paper by Harry Wiener [1] on the calculation of .structuralindices for the prediction of molecular properties.Although representation of chemical species by graphs was explored by Sylvester [2] as early as 1878, the characterization of molecular structure by graph invariants has made great strides during the past half century or so [3][4][5][6][7][8][9][10][11][12][13][14][15][16] following the seminal work of Wiener [1].Invariants of graphs associated with molecules and biomolecules quantify certain aspects of their structure and have been used in the characterization and comparison of such structures as well as prediction of their properties [4,17,18,19].Specifically, such invariants and orthogonal factors like principal components PCs) derived from them have found applications in quantitative structure-activity relationship (QSAR) studies [3][4][5][6][7][8][9][10][11][12][13][14][15]20], quantitative molecular similarity analysis (QMSA) research [21][22][23][24], clustering of large libraries of structures into smaller subsets [23,24], and in the discrimination of pathological structures like isospectral graphs [17].One of the authors of this paper (Basak) has been involved since the early 1970s in the development of novel numerical graph invariants or topological indices (TIs) [6,7,11,[25][26][27] as well as biodescriptors derived from DNA/ RNA sequences [28] and proteomics maps [29].Basak's research [20] carried out with colleagues at the University of Calcutta, India, in the 1970s involved mainly formulation of QSARs of congeneric sets of chemicals using their own information theore5tic indices and topological indices developed by Bonchev & Trinajstić [4,5], Randic [9][10][11][12] & Kier and Hall [3] as well as physical properties like van der Waals' volume, calculated or experimental hydrophobicity (log P, octanol water) [20].In the early 1980s, after Basak joined the University of Minnesota Duluth, the software POLLY [30] was developed and large scale calculation of TIs for QSAR and QMSA analyses was initiated.In one of the http://sciforum.net/conference/mol2net-1earliest studies of its kind, Basak et al [31] used POLLY for the calculation of ninety TIs for a collection of 3,692 structurally diverse chemicals which was a subset of the Toxic Substances Control Act (TSCA) Inventory of the United States Environmental Protection Agency (USEPA).The authors carried out principal components analysis (PCA) on this data set and asked the question: What is the intrinsic dimensionality of chemical structure measured by the large number of TIs?This line of research, i.e., PCA and use of principal components (PCs) derived from different collection of TIs calculated by POLLY [30], MolConnZ [32], Triplet [33,34], and APProbe [35] in QSAR and QMSA, has continued to this day.This paper summarizes the results and the lessons learned from a few of these studies using both congeneric and structurally diverse sets of chemicals, viz., 1) A large and structurally diverse set of 3,692 chemicals mentioned above, and 2) A data set of 74 alkanes, 3) A virtual library of 248,832 psoralen derivatives, 4) A congeneric set of 95 aromatic and heteroaromatic amine mutagens, and 5) A structurally diverse collection of 508 chemicals mutagens. .

Results and Discussion
2.1 A large and structurally diverse set of 3,692 chemicals.
For this data set, 90 TIs were calculated by the POLLY [30] software and PCA was performed.For details of the list of the particular TIs calculated for this study see Basak et al [21,31].Results showed that first ten PCs with eigenvalues greater than or equal to 1 explained 92.6% of the variance in the data and PC1-PC4 explained 78.3% of the variation in the original variables.Regarding the correlation profiles of the original variables or TIs with the first four important PCs, Table 1 below gives the data:  1 that PC1 is strongly correlated with those indices which are related to size of chemicals.It is noteworthy that for the set of 3,692 chemicals PC1 was also highly correlated (r = 0.81) with molecular weight.PC2 may be interpreted as an axis of molecular complexity as encoded by the higher order information theoretic indices [27].PC3 is most highly related to the cluster/ path-cluster type molecular connectivity indices which quantify information regarding molecular branching.The data in Table 1 clearly show that PC4 is strongly correlated with the cyclicity terms of the connectivity type.

A data set of 74 alkanes
For boiling point estimation lf alkanes, twenty six TIs, total surface area (TSA), and volume (V) were calculated for a set of 74 alkanes [36].Table 2 below gives the three different twoparameter regression models for the prediction of boiling point.It appears from the data in Table 2 that the individual TIs, PCs derived from them as well as the calculated physical properties like volume and total surface area give good QSARs for this congeneric set of molecules.The TIs and PCs derived from them give a little bit superior models as compared to the properties.

A virtual library of 248,832 psoralen derivatives
A virtual library of 248,832 psoralen derivatives [23] was created and analyzed using PCs derived from TIs.For this study, a set of 92 topological indices was calculated by POLLY [30].The set of TIs consisted of 37 topostructural and 55 topochemical indices.
We define topostructural indices as those invariants which are derived from simple (unweighted) molecular graphs.Such graphs do not distinguish among different types of bonds or atoms.The Wiener index; cluster, path-cluster, and simple connectivity indices; and path length indices are examples of topostructural parameters.Topochemical indices, on the contrary, are indices defined on weighted molecular graphs such that the various types of atoms and bonds are weighted to reflect their nature and contribution to chemical bonding.The SIC, CIC, and IC indices as well as both bonding and valence connectivity indices are all examples of topochemical indices.For this data set, the top 3 PCs explained 89.2% of the variance in the data; first 6 PCs explained 95.5% of the variance of the original calculated indices.The PCs were used to cluster the large set of chemicals into a smaller subset as an exercise of managing combinatorial explosion that can happen in the drug design scenarios when one wants to create a large pool of derivatives of a lead compound.For details of the outcome of clustering of the 248,832 psoralen derivatives, please see [23].
For the large but congeneric set of 248,832 psoralen derivatives, first 6 PCs explained 95.5% of the variance of the original calculated indices.

A homogeneous set of aromatic amines
<Data description> http://sciforum.net/conference/mol2net-1For the 95 aromatic amine set, PC1 is correlated with different original variables including some triplet indices and some indices related to molecular size.PC2 is most strongly correlated with the heat of formation, ∆Hf (r = 0.75); the energy of the highest occupied molecular orbital, EHOMO (r = -0.52)and ∆Hf (r = -0.50)are most highly corr4elated with PC3 whereas PC4 is strongly correlated with fw (r = -0.76)which is the molecular weight of the chemical species.
For the diverse 508 chemical mutagen set, the energy of the highest occupied molecular orbital, EHOMO (r = .96)is most strongly correlated with PC4; PC3 is highly correlated with ∆Hf (r = -.62) which is also negatively correlated with PC2 (r = -.77);PC1 is loaded with some triplet indices and those invariants which reflect molecular size.http://sciforum.net/conference/mol2net-1

Conclusions
In this paper, we reviewed our over three decades of research on the use of topological indices and principal components analysis in the characterization of five data sets: 1) A large and structurally diverse set of 3,692 industrial chemicals which was a subset of the Toxic Substances Control Act (TSCA) Inventory of the United States Environmental Protection Agency (USEPA), 2) A data set of 74 alkanes, 3) A virtual library of 248,832 psoralen derivatives, 4) A congeneric set of 95 aromatic and heteroaromatic amine mutagens, and 5) A structurally diverse collection of 508 chemicals mutagens.
The results show that the PCs derived from the TIs can be used for the development of QSARs as exemplified in Table 2 with 74 alkanes.PCs derived from TI have been used in the clustering of large set of psoralens [23].Basak et al also used both PCs and individual TIs for analog selection [24] and characterization of isospectral graphs [17].The data presented here show the usefulness of TIs and PCs derived from them in the clustering/ characterization of chemical libraries as well as QSAR.Details of QMSA analyses using PCs derived from TIs are not given here for brevity.
Basak [37] recently noted: "Mathematical chemistry or more accurately discrete mathematical chemistry had a tremendous growth spurt in the second half of the twentieth century and the same trend is continuing now.This growth was fueled primarily by two major factors: 1) Novel applications of discrete mathematical concepts to chemical and biological systems, and 2) Availability of high speed computers and associated software whereby hypothesis driven as well as discovery oriented research on large data sets could be carried out in a timely manner.This led to the development of not only a plethora of new concepts, but also to various useful applications to such important areas as drug discovery, protection of human as well as ecological health, and chemoinformatics.Following the completion of the Human Genome Project in 2003, discrete mathematical methods were applied to the "omics" data to develop descriptors relevant to bioinformatics, toxicoinformatics, and computational biology."Initially, TIs were used for the discrimination of structure and QSAR studies of congeneric and small sets of structures.For example, Randic's [9] first order connectivity index (1χ), the information theoretic indices developed by Bonchev and Trinajstić [38] and those developed by Raychaudhury et al [7] were used to discriminate the set of alkanes and they worked well in those cases.In the case of 18 octanes, the molecules do not vary from each other with respect to size, but primarily in terms of branching patterns.Therefore, the indices [7,9,38] were interpreted based on the data as reflecting molecular branching.But when PCA was carried out with a diverse set of 3,692 chemical structures, the results entered an uncharted territory and were counterintuitive, to say the least.As shown from the correlation of the original variables with PC1, 1χ and related indices were now strongly correlated with molecular size in the large and diverse set, not to molecular branching.PC3 emerged as the axis representing

2. 5 A
diverse set of 508 chemicals <Data description> Table 4: Top 10 variables behind each of first 4 PCs (loadings in brackets) for 508

Table 1 :
Correlation of the first four PCs with the original variables including topological indices.

Table 2 :
Results of three 2-parameter models in predicting the boiling points of 74 alkanes

Table 3 :
Top 10 variables behind each of first 4 PCs (loadings in brackets) for 95 compound dataset