The increasing number of online communities has led to significant growth in digital data in multiple languages on the Internet. Consequently, language processing and information retrieval have become important fields in the era of the Internet. Stemming, a crucial preprocessing tool in natural language processing and information retrieval, has been extensively explored for high-resource languages like English, German, and French. However, more extensive studies regarding stemming in the context of the Hausa language, an international language that is widely spoken in West Africa and one of the fastest-growing languages globally, are required.
This paper presents a rule-based model for stemming Hausa words. The proposed model relies on a set of rules derived from the analysis of Hausa word morphology and the rules for extracting stem forms. The rules consider the syntactic constraints, e.g., affixation rules, and performs a morphological analysis of the properties of the Hausa language, such as word formation and distribution.
The proposed model’s performance is evaluated against existing models using standard evaluation metrics. The evaluation method employed Sirstat’s approach, and a language expert assessed the system’s results. The model is evaluated using manual annotation of a set of 5,077 total words used in the algorithm, including 2,630 unique words and 3,766 correctly stemmed Hausa words. The model achieves an overall accuracy of 98.8%, demonstrating its suitability for use in applications such as natural language processing and information retrieval.