Mol 2 Net Solvent Accessible Surface Area Hot-Spot Detection Method

The natural tendency of proteins to bind to each other, as well as to many different molecules, forming stable and specific complexes is fundamental to all biological processes. The structural and functional description of protein-protein and protein-ligand complexes and their comprehension is a key concept, not only to increase the scientific knowledge in basic terms but also for the application to the biomedical and pharmaceutical industry. In this work we have look for more accurate ways of predicting the crucial residues for complex binding (Hot-spots) that can be used to model protein structure, dynamics and function. We developed an algorithm based in innovative series of descriptors, which have not been used in hot-spot determination and that can be applied to both protein-protein and protein-nucleic acid interfaces HS detection. A webserver for public use of the new methodological approaches was built and can be accessed at http://bio-aims.udc.es/MolStructPred.php


Introduction
Protein-protein interactions (PPIs) are fundamental for all life processes and it is vital to understand their dynamics, structural and energetic characteristics in order to find new improved ways to influence these molecular machineries [1].
Traditional mutagenesis approaches, including the use of hybrid receptors and alanine scanning mutagenesis techniques, have led to important insights into the structural basis underlying PPIs.However, experimental mutagenesis scanning of a complete interface is highly costly from a financial and time point of view [1][2][3].To overcome this problem it was needed an efficient and fast computational technique that allows the detection of the major binding determinants at a protein-protein interface: the Hot-Spots (HS).HS tend to be conserved residues tightly clustered in the central part of protein-protein interfaces forming a network of specific interactions that are optimized and cooperative [4].Figure 1 illustrates an example of a protein-protein complex in which HS are highlighted in a vdW red representation and non-HS (called Null-Spots NS) in a yellow one.HS tend to be surrounded by a region of supposedly "less important" residues, largely hydrophobic, that leads to solvent occlusion and results in a lower local dielectric constant environment and enhancement of specific electrostatic and hydrogen bond interactions (Figure 1) [5].So, according to this theory (O-ring theory), HS regions have a low number of interfacial waters, implying that water entropy effects provide one of the driving forces to complex formation [6] and that occlusion of bulk solvent slows down dissociation.Having these knowledge gathered through the years about HS [1][2][3][4]7], we decided to look for a method based on genomic conservation scores and 12 different Solvent Accessible Surface Areas features (described at reference [8]).

Figure 1.
Structural representation of a protein-protein complex (PDBid: 1DX5 [9]) in which the HS and NS are highlighted in a red and yellow vdW representation, respectively.http://sciforum.net/conference/mol2net-1

Results and Discussion
The performance in ML is usually measured using predictive accuracy, which could be problematic if the data is unbalanced [10].Dataset S1 comprised 71 HS/406 NS, dataset S2 35 HS/56 NS, dataset S3 60 HS/162 NS and dataset S4 20 HS/80 NS, which demonstrates that our datasets (described at reference [8]) are highly unbalanced (classes are not equally represented as HS are less represented in Nature).This way, we evaluated the performance of each model by taking into account Recall (TPR), Precision, Specificity and FPR as well as F1-score and AUROC.We showed that simple Bayes Networks were able to classify HS for protein-protein interactions but only complex methods such as GA-SVM-Full could be used to classify HS for protein-nucleic acid interactions.
The best classifier for protein-protein case uses four features: CONSURF score, ΔSASAi, rel/resSASAi and rel/aveSASAi.(TPR=0.79,FPR=0.21,Precision=0.87,F1-score=0.83 and AUROC=0.85).Our algorithm was assessed against some of the state-of-the-art methods available by web-servers and proven to more accurately predict HS at protein-protein interfaces.

Materials and Methods
Three different datasets were used for the protein-protein interfaces: ASEdb, [11] BID [12] and SKEMPI [13] (comprising a total of 790 residues from 58 complexes) and one for proteinnucleic Acid: Pronit [14][15][16] (a total of 117 residues from 28 complexes).The datasets were constituted by protein complexes for which simultaneously exists experimental alanine scanning mutagenesis data, genetic conservation scores and tridimensional crystallographic structures of the bounded complex.These ones were filtered to ensure that a maximum of 35% sequence identity could be found for at least one protein in each interface [8].Various machinelearning (ML) techniques were employed for this particular problem and in order to improve the performance and to reduce the number of features in the input space we also performed a Feature Selection (FS) approach as the number and relevance of the input variables can affect the performance of the model.Several statistics analyzes were performed to ensure the achievement of the high accuracy method.http://sciforum.net/conference/mol2net-1

Conclusions
Our methods are accurate and time efficient.Moreover, our method can be applied not only to protein-protein but as well, and for the first time, to protein-nucleic acid complexes [8].Web-servers were also constructed and made available for the scientific community at BioAIMS portal (http://bio-aims.udc.es/MolStructPred.php).The code of the Web tools is available as pySBHD repository (https://github.com/muntisa/pySBHD).