Please login first
Ensemble K-means for semi-supervised learning in enzymatic activity classification of GH-70 enzymes
1 , * 2, 3 , 4 , * 5, 6
1  Centro de Investigaciones de Informáticas. Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), Santa Clara, 54830, Cuba.
2  Departamento de Ciencias de la Computación, Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), Santa Clara, 54830, Cuba
3  Centro de Investigaciones de Informáticas. Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), Santa Clara, 54830, Cuba
4  Centro de Bioactivos Químicos, Universidad Central “Marta Abreu” de Las Villas (UCLV), Santa Clara, 54830, Cuba
5  CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos s/n 4450-208 Matosinhos, Porto, Portugal.
6  Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal

https://doi.org/10.3390/mol2net-06-06891 (registering DOI)
Abstract:

The enzymatic activity classification of the GH-70 enzymes is a challenge in Bioinformatics due to the high diversity of these sequences. From the 501 sequences reported when we accessed Cazy.org, just 58 were labeled into 6 EC number classes. In this paper we propose a semi-supervised classification algorithm based on the k-mers frequency descriptors with k equals to 2, 3, 4, 5 and 6 as alignment-free measures extracted from the sequences. The high dimensionality of the k-mers ( vectors and the increasing number of sequences lead to the application of big data Spark classifiers such as the ones in Apache MLlib. Specifically, the K-means clustering applied in an iterative way yields multiple results that can be ensemble in a semi-supervised second-round clustering step capable of detecting groups of similar sequences including the labeled and the unlabeled ones. Finally, external measures validate the ensemble clustering for the labeled sequences. Further improvements in the clustering and ensemble steps could raise the quality of classification.

Keywords: alignment-free; k-mers; semi-supervised classification
Top