2D Polar Co-ordinate Representation of Amino Acid Sequences With some applications to Ebola virus, SARS and SARS-CoV-2 (COVID-19)

Graphical Abstract Abstract. We consider a novel approach to mathematically define a graphing method to represent amino acid sequences of proteins in two-dimensional plane and characterize them numerically. The amino acids are represented by their relative magnitude of their hydrophobicity.


Introduction
The structure and function of proteins arise from their amino acid (aa) sequences. There have been several approaches to quantify and mathematically characterize protein sequences arising from concepts of graphical representation and numerical characterization of DNA/RNA sequences [1] to novel representation of protein sequences in 2-dimensional plane [2] . Also, in some other approaches a MOL2NET, 2020, 6, ISSN: 2624-5078 2 http://sciforum.net/conference/mol2net-06 geometry-based alignment method has also been designed [3] . Our lab also proposed an approach in which a protein sequence was represented in a hypothetical 20D co-ordinate system [4] . Other algorithms including representation in 3D coordinate are also there. [

5] [6] [7] [8] [9]
However, there are several issues in this graphical or numerical approach. Plotting amino acid sequences, using some algorithms and rules, can be more effective than plotting base sequences, as in the base sequence there can be deletion or addition of a base, which may lead to a frameshift mutation. In that case the whole genetic code reading frame will be changed, but there will be only a small change in the gene graph which can be hard to identify in a large sequence. While if we plot amino acids, because of its generally small sequence length in comparison with base sequence, one can notice whenever a change of amino acid occurs. It is pertinent to say, in case of a frameshift mutation, in the amino acid graph, from the point of mutation and onwards there will be changes which can be easier to separate from another graph of different time domain. Besides, due to degeneracy of the genetic code, sometimes many triplet codons may code for the same amino acid. So, there may be a change in base sequence graph but protein graph won't change, so it may interpret a closer estimation to the actual structure of the protein. While many multi-dimensional methods to facilitate numerical characterization of protein sequences have been proposed, a 2D representation of amino acid sequences in a Polar Coordinate plane may better help in visualizing sequence changes. Also, by analyzing the differences between two sequences by defining any descriptor or parameter, we can determine phylogenetic relationships.
In a new approach to graphical representation of protein sequences, we represented the 2dimensional real co-ordinate in polar form (ℝ × ℝ) and assigned a particular angle to each amino acid for analyzing their properties. Each aa (amino acid) is represented as a vector and not as a scalar. This enables us in the new representation to visually identify the separation of one protein sequence from another and also quantify intra-sequence and inter-sequence differences. We expect more features to be identified from this approach. Here we report primarily the methodology we have adopted.

AMINO ACID PROPERTIES:
Depending upon the structure an amino acid can be polar or non-polar. Polar amino acids tend to be on the surface most the time as the cellular milieu is mostly polar. And there are amino acids which along with being polar, show acidic or basic character. We give below the classification scheme of the 20 most common amino acids: • We also characterize all the amino acids based on their Hydrophobicity Index. We consider the data of hydrophobicity found in the world wide web.

CO-ORDINATE ASSIGNMENT:
We assigned co-ordinates to each of the amino acids using their hydrophobicity index. Assigning the different types of amino acids at different sectors will allow to determine prototype of the graph by simple observation. As there are 20 amino acids, so we define each amino acid at a particular angle in the Cartesian plane. Upon change of amino acid, we move our vector forward. Simple analysis tells that we need to assign each amino acid at a gap of 18°.
So, assigning the amino acids is done in following manner: The most hydrophobic amino acid is placed in +y axis and the most hydrophilic amino acid is placed at -y axis. Then accordingly we plot the other amino acids with hydrophilicity being vertically downwards.   Histidine  H  288  Isoleucine  I  108  Lysine  K  306  Leucine  L  90  Methionine  M  144  Asparagine  N  252  Proline  P  270   glutamine  Q  216  Arginine  R  234  Serine  S  342  Tyrosine  T  162  Valine  V  54  Tryptophan  W  126  Threonine  Y  180   Table 2. Angle table of amino acids

METHOD OF GRAPH INITIATION:
The graph starts from (0,0) and with each amino acid it moves 1 unit in the direction of that angle. When the first amino acid is drawn, we move the co-ordinate system then to that point and then calculate for the next amino acid.
For, each turn of amino acids the move in the graph is represented as, ⃗ = (coŝ+ sin) Here, ⃗ is a vector representing the presentation of an amino acid in the cartesian plane, being the angle of that amino acid and as for our system, we did not give any weight to count of amino acids, hence = 1, i.e. that is 1 unit in every direction. So, if the starting co-ordinate is (0,0) and then an amino acid with angle is found, the final coordinate should be (cos , sin ).

GRAPH PROPAGATION:
During each amino acid plotting, we assume the co-ordinate to be shifted to the final point and then plot.
Suppose we take our initial reference frame as S, with origin O at (0,0). Now suppose, the after a stretch of amino acids, we have arrived at a frame S' whose origin O' lies at ( , ). Now by our convention we assume the co-ordinate system to shift to the point ( , ). As we move forward in reading the sequence, we plot the next amino acid, suppose with angle . So, the next point with respect to the Frame of reference S' should be, (cos , sin ) We suppose this to be, ( ′ , ′).
As we see in the If the next amino acid angle is , it would be added up too as cos and sin components in x and y coordinate values in terms of the values cos and sin . So, the final co-ordinate will become, ( + cos + cos , + sin + sin ) So, in this way, whenever a next amino acid is read, we add up its cos value to x co-ordinate and sin value to y co-ordinate. So, the generalization of the above concept are as follows. Starting a sequence from (0,0) each amino acid is read and plotted with respective angle with the generalized formula given below. Final coordinate after placing sequential amino acids is represented as,

NUMERICAL CHARACTERIZATION:
Plotting the sequence in 2D Cartesian Plane will give it a pattern and base for observation, but characterizing it with a signifying mathematical number will help in comparison. We compute the weighted Centre of Mass of the graph plot and calculate the radius of graph which we name as Quotient Radius (qR) of that protein. And the graph will be:  [1]. We plotted the two sequences together superimposing each other and tried to interpret the difference in the two sequences.   Comparing the two graphs we see that,

AA x y
• There are around 19000 bases in the Fig. 3., whereas only 2000 amino acids in Fig. 4. So, clearly, the inference can be drawn more clearly in this case, from the changes in the sequence. • In Figure 4 the sequence seem to be all along of same pattern, just the 2018 sequence, a bit lagging behind.
Whereas in Polar plot of Figure 5 it can be figured out that there are a lot more changes in the seuqnece and hence, the amino acids differ a lot more. In another study we plot the seven types of Corona Virus, namely, 229E alpha, HKU1 beta, NL63 alpha, OC43 beta, MERS, SARS and SARS-CoV-2 (COVID-19) full genomes ( Figure 6). This graph gives us some idea about the similarities between SARS and COVID-19. So we plotted the two of them seperately. The parallel nature between them shows the abundance of similar amino acids in the two sequences. To highlight their differences, we plotted their Spike Glycoproteins. This shows the changes that happened in the surface glycoproteins, and also the shifting of clusters and the ending similarities.

PROTEIN FAMILIES:
In another study, we collected various proteins in our sample space. Our goal is to see whether proteins of different families have different representation. To serve this purpose, we collected the following,   Clearly, we see the comparison of six different protein families and their polar co-ordinate plots. It is partinent to say, the characteristic types of graph -they all are unambiguous and easily seperable from each other. So, we come to the conclusion that this method also provides different characteristic graphs for different protein families.

DISCUSSION:
Here we used our polar plot method to represent graphically various types of prteins including the Ebola Virus, MERS, SARS and SARS-CoV-2 (COVID-19). The preliminary results reported here indicate that our new method is a useful tool for the characterization of different types of proteins. We are carrying out further studies in order to find relationships between sequence and surface exposure of different parts of protein from their graph. Such results will be published subsequently.