Computational Study of Mycobacterial Promoters with Low Sequence Homology

This communication shows a classification model for prediction of mycobacterial promoter sequences (mps), which constitute a very low sequence homology problem. The model developed (mps = –4.664·ξM + 0.991·ξM – 2.432) was intended to predict whether a naturally occurring sequence is an mps or not on the basis of the calculated ξM value for the corresponding RNA secondary structure. The model predicted 115/135 mps (85.2%) and 100% of control sequences (cs). The detailed results have been published in detail in: Bioorg Med Chem Lett. 2006 Feb;16(3):547-53, the present is a short communications.


Harshey and Ramkrishnan stated that
Mycobacteria have a low transcription rate and a low RNA content per unit DNA and that their genomes are rich in Guanine and Cytosine (g + c) content.Given that the g + c content of a genome affects the codon usage and the promoter recognition sites in an organism, Nakayama et al., and Ohama et al. predicted that the transcription and translation signals in Mycobacteria may be different from those in other bacteria such as E. coli.Therefore, understanding the factors responsible for the low level of transcription and the possible mechanisms of regulation of gene expression in Mycobacteria requires examination of the structure of mycobacterial promoter sequences (mps) and their transcription machinery, including information concerning the RNA macromolecules involved.Unfortunately, mps present a very low sequence homology and mathematical methods to assign biological activity based on sequence alignment are not of practical use in this case.Different mathematical methods have been used for the analysis of genome information.The group of Professor Grau has reported results on genome algebras.Markov models are also well-known tools for analyzing biological sequence data.However, SciForum http://sciforum.net/conference/mol2net-1advances have not been reported concerning the treatment of this macromolecular structureactivity problem from the point of view of the corresponding RNA structure.A real possibility to address this problem involves the analysis of structure-activity relationships for naturally occurring RNA macromolecules, synthetic polymers and small molecules in general with Markov molecular descriptors.For this reason, one may expect higher success for classical molecular indices in branched biomacromolecules.However, it must be remembered that the more commonly known branched biomacromolecule is the RNA secondary structure as described by Mathews and Zukker.
Researchers worldwide have reported increasing interest in the characterization of biomacromolecules, particularly the RNA macromolecular structure, by computational techniques.In this context, we propose here that 2D-RNA-QSAR is a promising field within biomacromolecules research.New analogues of our stochastic molecular descriptors will be introduced for the RNA secondary structure and these descriptors have been largely applied to small molecules and biomacromolecules.Two preliminary studies into secondary QSAR of RNA macromolecules have also been published, but these focus only on local properties of a single RNA molecule.As a consequence, the main aim of the present paper is to introduce in RNA-QSAR studies the Markov electrostatic potentials ( k ξM) previously used for proteins QSAR.In this sense, we intend to predict whether a naturally occurring DNA sequence is an mps or not on the basis of the k ξM calculated for the macromolecular secondary structure of its putative RNA.Consequently, a more specific but still important aim of this work is to introduce a novel approach to predict mps.This work has led to the first 2D-RNA-QSAR to discriminate between two groups comprising several RNA macromolecules, including 135 mycobacterial promoters and 450 control sequences.

Results and Discussion
Several authors have studied the mycobacterial promoter sequence problem from the point of view of DNA.Linear Discriminant Analysis was used to classify RNA macromolecules as mycobacterial promoter sequence (mps) or control group sequence (cs).In the development of the LDA the output was a dummy variable, mps, which codifies whether a sequence lies within the mps class (mps = 1) or belongs to the cs group (mps = 0).In this problem the inputs were the Markov electrostatic potentials ( k ξM) of interaction between nucleotides located with respect to each other at a topologic distance k within the 2D-RNA backbone, with k it is in the range [ Where λ is Wilk's statistic, N is the number of RNA sequences studied, F is Fisher's statistics and p is the p-level (probability of error) <0.001.This latter factor means that the hypothesis of groups overlapping with a 5% error can be rejected.A high Matthews' regression coefficient (C = 0.903) was observed and this high C value indicates a strong linear relationship between the structural descriptors of the biomacromolecules and the classification of the RNA sequences.The significance of the two variables ( 0 ξM and 1 ξM) in the model was demonstrated with the stepwise analysis (see original work).Conversely, the second order potential 2 ξM does not have a significant relationship with the mps characteristic or RNA sequences.In physical terms the above results show that, as in other studies, there is a relationship between the electrostatic potential of the RNA molecule and its biological activity.However, in this case not all the electrostatic interactions affect the activity in the same way.The RNA-QSAR predicts that the possibility of a sequence acting as an mps decreases by a factor of 4.664 per unit of electrostatic potential on considering isolated nucleotides ( 0 ξM).Conversely, the variations of global electrostatic potential ( 1 ξM) due to secondary structure folding 65 as a result of direct covalent and/or hydrogen bonds between nucleotides increase by a factor of only 0.991 with respect to the possibility of RNA being encoded as an mps.Finally, long-term electrostatic interaction potentials between nucleotides at distances longer than 1 ( 2 ξM, 3 ξM, 4 ξM) do not correlate with the mps activity.The detailed results of the forward stepwise analysis are given in the original work.
Jack-knife cross validation (cv) experiments were performed by the re-substitution technique, leaving out four different groups selected at random and containing 25% of the RNA molecules.The cross validation accuracies and the average cross validation accuracy (cvaverage) were cv1 = 95.9%,cv2 = 96.6%,cv3 = 96.6% and cv4 = 96.5%,respectively, with the average Cv-average = 85.7.The testing of the model fit to data and its robustness -although very important -is not the only characteristic of an acceptable QSAR.The data for mps name, sequences, training and cross-validation probabilities for all the RNAs used in this work are given in Table 2SM and Table 3SM of the supplementary material of the original work.Finally, as far as the quality of the model is concerned, we would like to point out that the present linear QSAR model compares very favourably to a previous non-linear model reported by Kalate et al. in terms of simplicity (two variables: 0 ξM and 1 ξM).This non-linear model presented only slightly higher accuracy (97%) but makes use of very large space http://sciforum.net/conference/mol2net-1parameters to describe DNA sequences rather than RNA structure.The success of our RNA-QSAR model, which uses only two variables, can be explained by considering that RNA structure molecular descriptors encode not only sequences (as is the case for DNA linear sequence descriptors) but also molecular branching.The present paper introduces the simplest up-todate reported method to predict mycobacterial promoters.With this ultimate aim in mind, we changed the classical point of view and used RNA 2D-macromolecular descriptors instead of DNA sequence analysis.In this sense, this work opens a new way for the application of classical QSAR approaches to biomacromolecules.

Conclusions
In accordance with the aims of the work presented here, two main conclusions can be drawn from the results and discussion.Firstly, the 2D structure of RNA can be encoded with k ξM to develop QSAR studies in the presence of low sequence homology, as in the mps problem.Secondly, there is a very simple linear QSAR model for mps prediction that involves the first two members of the k ξM series ( 0 ξM, 1 ξM).http://sciforum.net/conference/mol2net-140.Kremer, L.; Baulard, A.; Estaquier, J.; Content, J., Capron, A.; Locht, C. J. Bacteriol.1995, 177, 642.
© 2015 by the authors; licensee MDPI, Basel, Switzerland.This article is an open access article distributed under the terms and conditions defined by MDPI AG, the publisher of the Sciforum.netplatform.Sciforum papers authors the copyright to their scholarly works.Hence, by submitting a paper to this conference, you retain the copyright, but you grant MDPI AG the non-exclusive and unrevocable license right to publish this paper online on the Sciforum.netplatform.This means you can easily submit your paper to any scientific journal at a later stage and transfer the copyright to its publisher (if required by that publisher).(http://sciforum.net/about).

Figure 1 .
Figure 1.Circular representation for a folded RNA macromolecule of mps T3 from M. tuberculosis, note main stem highlighted in red.
Mulder et al. listed −35 and −10 DNA regions of a few mycobacterial promoters.Mycobacteriophage I3 and M. paratuberculosis promoter sequences and their similarity with the E. coli promoters have been studied by Ramesh and Gopinathan and Bannantine et al., respectively.Kremer et al. studied the DNA sequences essential for transcription in promoters like M. tuberculosis 85A.It is possible that DNA promoters with a high GC content in the −10 region 52 are the true representatives of the mycobacterial type.An analysis of M. smegmatis and M. tuberculosis promoters by Bashyam et al. showed that there are similarities to E. coli 70 promoters; however, in this case the −35 regions showed greater sequence variability.Strohl It can therefore be inferred that recognition of mycobacterial promoter sequences requires a powerful technique that is capable of unravelling those hidden pattern(s) in the http://sciforum.net/conference/mol2net-1biomacromolecule structure -patterns that are difficult to identify visually.