Please login first
Improved Taxonomy Re-structuring using Modified K-means Clustering for Efficient Large-scale Text Classification
* 1 , 1 , 2 , 3 , 4 , 1
1  Faculty of Computing, Northwest University, Kano, Nigeria
2  Faculty of Computer Science and Mathematics, Universiti Teknologi Mara, Shah Alam, Selangor, Malaysia
3  Department of Computer Science, Faculty of Computing and Mathematical Science, Aliko Dangote University of Science and Technology, Wudil, Nigeria
4  Department of Software Engineering, Northwest University, Kano, Nigeria
Academic Editor: Lucia Billeci

Abstract:

Textual classification for a hierarchical taxonomy of classes is a common and well-known problem associated with Large-Scale Text classifications (LSTCs). Existing approaches simply re-structure the hierarchy of classes prior to classification and have achieved better results. However, when there are many classes with an increased number of features, traditional hierarchy re-structuring tends to produce many nodes with similar granularities. This results in misclassification, and it is computationally expensive or not scalable for many classification models, especially when the hierarchy is longer. In this paper, we propose an improved hierarchy re-structuring algorithm that uses modified k-means clustering. The method uses a k-weight and backtracking, where necessary, to cluster nodes with similar granularities into a few generalized classes, reducing the number of nodes and hierarchy length as well. In addition, the proposed approach can handle overfitting, which usually occurs as a result of the unbalanced nature of LSHT datasets, where the features in each class vary extensively. Experimental results on 20NG, IPC, and DMOZ-small datasets using TD-LR and TD-SVM show that our approach can effectively improve large-scale hierarchical text classification performance over traditional and existing re-structuring approaches. In terms of scalability, our approach increases the number of scalable instances by about 10%; hence, it records the best and fastest running time.

Keywords: Hierarchical Classification; Hierarchy; Large-scale; Re-structuring,; TD-SVM; TD-LR
Comments on this paper
Currently there are no comments available.


 
 
Top