Please login first
Intrinsic dimensionality of chemical space: Characterization and applications
1 , * 3
1  University of Minnesota Duluth-Natural Resources Research Institute, 5013 Miller Trunk Highway, Duluth, MN 55811, USA
2  University of Minnesota Duluth-Natural Resources Research Institute & Department of Chemistry and Biochemistry, 5013 Miller Trunk Highway, Duluth, MN 55811, USA
3  University of Minnesota Duluth –Natural Resources Research Institute and Department of Chemistry and Biochemistry, 5013 Miller Trunk Highway, Duluth, MN 55811, USA

Abstract:

One popular method for the representation and characterization of chemical structure is through the use of their computed mathematical descriptors.  Such descriptors, often called molecular descriptors, quantify different aspects of molecular structure, viz., size, shape, branching, cyclicity, bonding patterns, etc.   Applications of discrete mathematics in the development of molecular descriptors began in the middle of the twentieth century and the trend is going on in an unabated manner even today.  While in the 1970s only a few descriptors could be calculated, currently available software can calculate a large number of descriptors for molecules or biomolecules like DNA/ RNA, proteins.  When p molecular descriptors are calculated for n molecules, the data set can be viewed as n vectors in p dimensions, each chemical being represented as a point in Rp.   Because many of the descriptors are strongly correlated, the n points in Rp will lie on a subspace of dimension lower than p.  Methods like principal components analysis (PCA) can be used to characterize the intrinsic dimensionality of chemical spaces.  Since the early 1980s, Basak et al have carried out PCA of various congeneric and diverse data sets relevant to new drug discovery and predictive toxicology.  PCs derived from mathematical chemodescriptors have been used in the formulation of quantitative structure-activity relationships (QSARs), clustering of large combinatorial libraries as well as quantitative molecular similarity analysis (QMSA).  This presentation will review the results of PCA carried out by Basak and coworkers since the early 1980s to the present time in the characterization and visualization of chemical spaces with special reference to three data sets: 1) 1) A large and structurally diverse set of 3,692 chemicals which was a subset of the Toxic Substances Control Act Inventory of the United States Environmental Protection Agency (USEPA), and 2) A data set of 74 alkanes, and 3) A virtual library of 248,832 psoralen derivatives.

.

.

Keywords: Computed mathematical descriptors; Principal components analysis (PCA); Homogeneous set; Diverse set; Psoralen derivatives; alkanes
Top