Pairwise Ortholog Detection in Related Yeast Species by Using Big Data Supervised Classifications

: Orthology detection still requires more effective scaling algorithms. Combinations of alignment, synteny, evolutionary distances and protein interactions have been used in different unsupervised algorithms to improve effectiveness while many available databases are concerned with the scaling problem. In this paper, a set of gene pair


Introduction
Ortholog detection (OD) algorithms should distinguish orthologous genes from other types of homologs such as paralogs evolving from a common ancestor through a duplication event.A great deal of unsupervised graph-based approaches has been developed to identify orthologs resulting in corresponding repositories for pre-computed orthology relationships.
When OD is based only on sequence similarity, it has been limited by evolutionary processes such as recent paralogy events, horizontal gene transfers, gene fusions and fissions, domain recombinations or different genetic events [1][2].In fact, the identification of homologs is a difficult task in the presence of short sequences, those that evolved in a convergent way, and the ones that share less than 30% of amino acid identities (twilight zone).Algorithm failures have been particularly shown in benchmark datasets from Saccharomycete yeast species that underwent whole genome duplications (WGD) presenting rampant paralogies and differential gene losses [3].To tackle these shortcomings, some OD solutions merge sequence similarity with synteny genome rearrangements, protein interactions, domain architectures and evolutionary distances.
On the other hand, the integration of different gene or protein information and the massive increase in complete proteomes highly increase the dimensionality of the OD problem and the total number of proteins to be classified.In a thorough paper from the Quest for Orthologs consortium [4], the authors emphasize the idea that this increase in proteome data brings out the need to work out not only efficient but effective OD algorithms.As they mention, the increase in computational demands in sequence analyses is not easily met by an increase in computational capacities but rather calls for new approaches or algorithmic implementations [4].They summarized some methodological shortcuts implemented by the existing orthology databases to deal with the scaling problem.
In this paper, we propose a new supervised approach for pairwise OD (POD) that combines several gene pairwise features (alignment-based and synteny measures with others derived from the pairwise comparison of the physicochemical properties of amino acids) to address big data problems [4].Our big data supervised POD approach allows scaling to related species and data imbalance management (low ortholog ratio found in two or more genomes) for an effective OD.The methodology consists of three steps: (1) the calculation of gene pair features to be combined, (2) the building of the classification model using machine learning algorithms to deal with big data from a pairwise dataset, and (3) the classification of related gene pairs.
Since traditional supervised classifiers cannot scale large datasets, the supervised classification for the POD problem should be addressed as a big data classification problem according to [5][6][7] and big data solutions should be applied for binary classification in imbalanced data such as the ones presented in [8].http://sciforum.net/conference/mol2net-1Finally, we evaluate the application of several big data supervised techniques that manage imbalanced datasets [8][9] such as cost-sensitive Random Forest (RF-BDCS), Random Oversampling with Random Forest (ROS+RF-BD) and the Apache Spark Support Vector Machines (SVM-BD) [9] combined with MapReduce ROS (ROS+SVM-BD).The effectiveness of the supervised approach is compared to RBH, RSD and OMA algorithms, taking data imbalance into account.All the algorithms were evaluated on benchmark datasets derived from the following yeast genome pairs: S. cerevisiae and K. lactis, S. cerevisiae and C. glabrata [3] and S. cerevisiae and S. pombe [10].The S. cerevisiae and C. glabrata pair is particularly complex for OD since both species had undergone WGD.We found that our supervised approach outperformed traditional methods, mainly when we applied ROS combined with SVM-BD.

Results and Discussion
For the evaluation of POD algorithms, we compare the supervised solutions and the unsupervised ones following the evaluation scheme in Figure 1.The process separates the pairs into train and test sets and calculates pairwise similarity measures (average of local and global alignment similarity measures, length of sequences, gene membership to conserved regions (synteny), and physicochemical profiles within 3, 5 and 7 window sizes) for the pairs of both sets.The sequences of the test sets should be used to run the unsupervised reference algorithms.The train set should be used for building the supervised models to be tested only with the test set.
The performance quality evaluation involves the calculation of the Geometric Mean (G-Mean) [11], seeking to maximize the accuracy of the two classes (orthologs and non-orthologs) by achieving a good balance between sensitivity and specificity that consider misclassification costs; and the Under the ROC Curve (AUC) [12] to show the classifier performance over a range of data distributions [13].
In Experiment 1, we evaluated the algorithms inside a genome by partitioning at random 75% of the complete set of pairs for training and 25% for testing, while in Experiment 2 we built the model from a genome pair and tested it in two different pairs.Specifically, in Experiment

Comparison of big data supervised classifiers
The G-Mean values of the supervised algorithms change only slightly with the selection of different alignment parameters (Table 1).These results may be either caused by the aggregation of global and local alignment scores in a single similarity measure or by the appropriate combination of scoring matrices and gap penalties in relation to the sequence diversity between the two yeast genomes [14].
The average results of AUC and G-Mean obtained in experiments 1 and 2 for the supervised algorithms with different parameter values are shown in Table 1.The average   and   are also depicted in Figure 2. SVM-BD has been left out from the table due to its very poor performance in G-Mean caused by its imbalance between   and   .Both Table 2 and Figure 2 prove that big data http://sciforum.net/conference/mol2net-1supervised classifiers managing imbalance outdo their corresponding big data supervised versions.
The ROS pre-processing method for big data makes SVM-BD useful for POD and improves the performance of RF-BD even more with a higher value for the resampling size parameter of 130% [15].In contrast, both experiments show that the variation in this parameter value from 100% to 130% does not significantly influence on the performance of the SVM-BD classifier with different regulation values.
Specifically, RF-BDCS shows the best performance in S. cerevisiae -C.glabrata and S .cerevisiae -K.lactis when the classification quality is measured by G-Mean and AUC metrics, because it enhances the learning of the minority class.The criterion used to select the best tree split is based on the weighting of the instances according to their misclassification costs, and such costs are also considered to calculate the class associated with a leaf [8].This cost treatment does not explicitly change the sample distribution and avoids the possible overtraining, that it is present in the ROS solutions due to replicated cases.The election of the cost values ((+| −) =  and (−| +) = 1) may also define the success of the algorithm.
In the case of SVM-BD, the fixed regularization parameter defines the trade-off between the goal of minimizing the training error (i.e., the loss) and minimizing the model complexity to avoid overfitting.The higher is its value, the simpler the model.Nonetheless, setting an intermediate value, or one close to cero may produce a better performance in classification [16].This is the case of the ROS (RS: 100%) + SVM-BD (regParam: 0.5) classifier that exhibits the best AUC and G-Mean values in S. cerevisiae -S.pombe, and the best balance between   and   in the three datasets (Figure 2).
In order to balance time with classification quality, time consumption is another aspect to have in mind when comparing big data solutions.Table 3 contains run time in seconds for all big data solutions in each dataset and the faster algorithms are highlighted in bold face.These results allow us to prove that the time required is directly related to the operations needed for each method, as well as to the size of the datasets used to build the model.The fastest algorithm considering the average run time is SVM-BD followed by SVM-BD combined with ROS.Thus, the fastest algorithms coincide with the ones with better performance.In general, the ROS (RS: 100%) + SVM-BD (regParam: 0.5) classifier can be considered the best supervised solution considering both performance and time.

Comparison of supervised vs. unsupervised classifiers
The average results of AUC and G-Mean obtained for the best supervised algorithms and the unsupervised algorithms with different parameter values are shown in Table 4 for experiments 1 and 2. The supervised classifiers outperform the unsupervised ones.Among the unsupervised algorithms, RSD reaches the highest G-Measure value by setting E-value = 1e-05 and  = 0.8 (recommended values in [17]) in S. cerevisiae -C.glabrata where similar results can also be seen for AUC and   values.On the contrary, OMA was the best among the unsupervised algorithms in S. Cerevisiae -S.pombe datasets (Table 4).
In general, the performance of all classifiers declined in S. Cerevisiae -S.pombe datasets due to the fact that S. pombe is a distant relative of S. cerevisiae [18].The supervised classifiers performance is affected for the same reason and also, by the difference in data distribution between the train and test sets [19].On the contrary, ROS (RS: 100%) + SVM-BD (regParam: 0.5) remained stable in S. Cerevisiae http://sciforum.net/conference/mol2net-1-C.glabrata and S. Cerevisiae -S.pombe datasets when considering the balance between   and   .Superior results in S. cerevisiae -C.glabrata are outstanding, since both genomes underwent a WGD and a subsequent differential loss of gene duplicates, so that algorithms are prone to produce false positives.Thus, this dataset contains "traps" for OD algorithms [3].
The reduced quality shown by RBH, RSD and OMA, mainly in the case of RBH, could be caused by their initial assumption that the sequences of orthologous genes/proteins are more similar to each other than they are to any other genes from the compared organisms.This assumption may produce classification errors [1], in spite of the fact that BLAST parameters can be tuned as has been recommended in [20].Conversely, RSD not only compares the sequence similarity, but it relies on maximum likelihood estimation of evolutionary distances to detect orthologs between two genomes, and as a result, it finds many putative orthologs missed by RBH because it is less likely than RBH to be misled by existing close paralogs.
The OMA algorithm also displays advantages over RBH.It uses evolutionary distances instead of alignment scores.This algorithm allows the inclusion of one-to-many and many-to-many orthologs.It also considers the uncertainty in distance estimations and detects potential differential gene losses.
From the point of view of the intrinsic information managed by the algorithms, the success of big data supervised classifiers managing imbalance over RSD and OMA may be explained by feature combinations calculated for the datasets together with the learning from curated classifications.With the aggregation of global and local alignment scores we are combining protein structural and functional relationships between sequence pairs, respectively.Besides, we incorporate other gene pair features: (i) the periodicity of the physicochemical properties of amino acids that allows us to detect similarity among protein pairs in their spectral dimension [21]; (ii) the conserved neighbourhood information, which considers that genes belonging to the same conserved segment in genomes of different species will probably be orthologs; and (iii) the length of sequences

Datasets
The characteristics of the datasets are summarized in Table 5 where the label #Atts represents the number of attributes or gene pair features, and #Class (maj; min), the number of pairs in both classes.S. cerevisiae -S.pombe dataset contains ortholog pairs representing 95.18% of the union of the Inparanoid7.0and GeneDB classifications described in [10].On the other hand, S. cerevisiae -K.lactis and S. cerevisiae -C.glabrata datasets contain all ortholog pairs in the gold groups reported in [3].When we built the set of instances with all possible pairs, we excluded some genes since we didn't find their genome physical location data in the YGOB database [22], required for the conserved membership feature calculation.

Big data supervised classification managing data imbalance
We use the open-source project Hadoop [23] with its highly scalable and fault-tolerant Hadoop Distributed File System (HDFS).We also utilize the scalable Mahout data mining and machine learning library [24] with machine learning algorithms adapted according to the MapReduce scheme as the MapReduce implementation of the (Random Forest (RF) algorithm [25].Finally, we use the Apache Spark framework [9] interacting with HDFS, when the implementation of SVM-BD in the scalable MLLib machine learning library [16] is combined with the MapReduce ROS implementation [8].The development of effective supervised algorithms for POD in a big data scenario was made possible by: (i) the availability of curated databases (authentic orthologs), (ii) the combination of traditional alignment measures with other gene pair features (sequence length, gene membership to conserved regions and physicochemical profiles) to complement homology detection, and (iii) the treatment of the low ratio of orthologs to the total possible gene pairs between two genomes.By applying evaluation metrics such as G-mean, AUC and the balance between   and   , our results show that gene pairwise feature combinations provide excellent POD in a big data supervised scenario that consider data imbalance.The SVM-BD classifier combined with the ROS (RS: 100%) pre-processing with regulation parameter 0.5 outdid the rest of the big data supervised solutions and the popular unsupervised (RBH, RSD and OMA) algorithms even when the supervised model was extended to datasets containing "traps" for OD algorithms.The classification performance of the supervised algorithms measured by G-Mean and AUC metrics did not significantly change in the four test sets obtained with different alignment parameter settings.When the balance between time and classification quality is considered, ROS (RS: 100%) + SVM-BD (regParam: 0.5) also proves to be the algorithm of choice.In future research, the introduction of new gene pair features might improve the effectiveness and efficiency of the supervised algorithms for POD.

Figure 1 . 1 Figure 2 .
Figure 1.Workflow of the evaluation of supervised vs. unsupervised POD algorithms.
1 we divided the S. cerevisiae -K.lactis set into 16.986.996pairs for training and 5.662.332pairs for testing.The four datasets (BLOSUM50, BLOSUM62_1, BLOSUM 62_2 and PAM250) of each genome pair were built from combinations of alignment parameter settings.On the other hand, in Experiment 2, we built the classification model from 22.649.328pairs of S. cerevisiae and K. lactis genomes and tested it in 29.887.416pairs of S. cerevisiae and C. glabrata, and 8.095.907pairs of S. cerevisiae and S. pombe genomes.

Table 1 .
Geometric mean results of the best supervised classifiers in each dataset.

Table 2 .
AUC and G-Mean results of supervised classifiers in experiments 1 and 2.

Table 3 .
Run time results in seconds of the big data solutions in experiments 1 and 2.

Table 4 .
AUC and G-Mean of the unsupervised and the best supervised classifiers. S.

Table 5 .
Characteristics of the datasets.