Pro-ChInt: Machine Learning Methods for Identifying Dual-/Multi-Protein Chains Interactions with Python

: In nature, protein chain interactions (Pro-ChInt) of single-/ multi-protein, a common but complex system, refer to physical contacts established between two or more protein chains depending on the amino acid sequences, which contains tremendous information. Decoding amino acid sequence information of protein using complex networks or graphs of the peptides is a grateful solution to discover the communication information between different Pro-ChInt. We first constructed some python codes to directly download the specify protein sequences from the RCSB protein data bank (PDB). Then, we changed the FASTA format to S2SNet format to calculate the embedded / non-embedded parameters of protein chains according to the star graph topological indices of peptide sequences. Meanwhile, we numbered all protein chains, then used the chain numbers to get a random number for a given set of chain number or case number used for each protein. Then, we replaced all the random numbers with the corresponding parameters of each protein chain calculated with S2SNet application. After that, a machine learning classification model was constructed based on the combinatorial / combining interaction of different chains. This new method can be used to identify two or more protein chain interactions combined with machine learning technique.


Introduction
Proteins are the main components of the biological metabolic pathways in living organisms.In nature, it could be one individual chain, or more than two chains to constitute a functional complex organic whole.Generally, the communications among different protein chains are very complicated, how to decode the communicational "language" is an important research topic in current chemoinformatics, bioinformatics, and pharmaceutics.
The biological systems are very complicated, therefore, a lot of scientists try to account for the biological complex problems with the techniques of genomics, transcriptomics and proteomics.However, proteomics are more complicated than genomics as genome is generally constant, whereas the proteome differs lie on cell and time.Proteins are subjected to a wide variety of chemical modifications after translation.It called as post-translational modifications, such as phosphorylation, ubiquitination, methylation, oxidation, etc. SciForum http://sciforum.net/conference/mol2net-1 Well understood of protein molecular information is helpful to disease control or prevention.This is because structure decides function for proteins.Whereas, proteins are the "practitioner" directly participating in the complex biological life cycle.In nature, proteinprotein interactions refer to the physical contacts established between two or more proteins by the electrostatic forces and/or biochemical events.Whereas, the functional domains are generally formed by two or more protein chains but not only one chain or one protein.Decoding amino acid sequence information of protein, using complex networks or graphs of the peptide, is a grateful solution to uncover protein chain -chain interaction (Pro-ChInt).
Some sequence to structure graphs are used to calculate the numeric descriptors of molecular structure, for instance, MARCH-INSIDE 1 and S2SNet 2, 3 .These tools can transform the characters and numeric sequences into Star network graph.And then to calculate Star Graph Topological Indices.

Results and Discussion
In present work, we first searched the target PDB-ID with some special performance, and save all PDB-ID in a text file.Then we got all the FASTA profile of protein chain by using the python module "urllib2".We transformed the FASTA to S2SNet format, some examples of FASTA and S2SNet profile was presented in figure 1. S2SNet format is easier for further work.After to get the S2SNet format, the TIs parameters of each protein chain was calculated by S2SNet application.In here, we also can use others methods to calculate the molecular descriptors for protein sequences.For instance, we are trying to divide all the amino acids into four different types, polar or non-polar, charged or uncharged amino acids.We can count the number of polar-polar amino acids, polar-x-polar amino acids, and polar-x…x-polar amino acids.Or other types of connection between the different amino acids.However, this part of work, we have not yet finished.So in here, we used S2SNet, one of previous work in our group.
On the other hand, we numbered all the chain in a given file, and to select the corresponding numbers of each protein to run a random selection among the given chain numbers (n).For example, the first protein has 9 chains, these chains have the number from 1 to 9. We let users to put the number (m-fold), we can get the random cases = n × m.However, we have to remove the duplicates before we get all the final cases.The more important part of present work is to define how many of chain will be assigned to run the interaction between one to others.For example, with our new codes, we can perform the interaction among two or more (depending on the users, Figure 2).In addition, each random number refers to the corresponding chain sequence, and each sequence would be calculated into 42 TIs.If all the sequences (numbers) in the combination are from the same protein, we defined this case as the "positive" or "1", whereas, if not all numbers from the same protein, we consider this case as the "negative" or "0".
After that, we obtained and calculated each combination character based on the average values of each combination (42 TIs average values).Using this data to run a classification model to identify if there are interactions among the two or more chains.http://sciforum.net/conference/mol2net-1

Material and Methods
All codes were programmed in the platform of PyCharm 3.5 version under the environmental of python 2.7 version.There are different steps to establish Pro-ChInt.They include to obtain the target protein chains sequence, change FASTA to S2SNet format and calculate S2SNet star graph topological indices, to get the serial random number, etc.

Download FASTA files
First step, we programed the codes in python to download the FASTA file from the protein data bank (PDB) according to the PDB-ID (serial number).In this part, we used urllib2 module to corresponding website of specify protein ID.

Calculate the S2SNet topological indices
We obtained 42 topological indices (TIs) for each protein chain, calculated by S2SNet star graph.There are two types of TIs (Embedded and non-Embedded indices).Each one has 21 TIs.Like Shannon entropies, connectivitymatrices, Harary number, Wiener index, Gutman topological index with different power 4 .S2SNet are widely used in obtaining the molecular information of protein 5 .

Get the random number matrix
In this part, we first numbered all the protein chains according to the order of all chain appeared in the PDB chain file.Each protein chain has the only special number.For example, the first protein has 9 chains (n), these chains have the number from 1 to 9, but for the second protein, if it has 15 chains, the number of these chains are from 10 to 24, and so on.Then, we used two codes to let the users to input the chain number, how many chains (Maximum) will be accounted for Pro-ChInt.Meantime, we let users to put the number (m-fold), we can get the random cases = n × m.In final, we remove all the replicated cases.

Classification modeling
In this step, we replace the random number with the 42 TIs of corresponding protein chain sequence.For one combination, if all the chains are from the same original protein, we consider this combination has the chain-chain interaction (Pro-ChInt) set as "1 or positive".If the combination is from different protein, we consider this combination has no chain-chain interaction, Pro-ChInt, set as "0 or negative".
For each combination, we calculate the average value of each parameter in 42 TIs of S2SNet Star Graph.After that, we can use Weka to obtain the best classification model depending on the combination mentioned previous.

Conclusions
This short communication is presenting some original python codes for identify the protein chain -chain interactions lie on the S2SNet Star Graph Topological Indices.The ideas of this work are on account of molecular descriptors obtained from Star Graphs.Then to use Machine Learning methods running in Weka to search for the best classification model.We can explain the protein chain -chain interaction based on the molecular information of protein sequences.http://sciforum.net/conference/mol2net-

Figure 1 .
The FASTA and S2SNet profiles of some protein chain

Figure 2 .
Figure 2. The examples of random numbers for the chainchain interaction The codes presented as following: 1 partially supported by the Galician Network for Colorectal Cancer Research (Red Gallega de Cá ncer Colorrectal -REGICC, Ref.: CN 2012/217), Institute for Biomedical Informatics of A Coruña (INIBIC), and Center for Research of Information and Communication Technologies (CITIC).