An innovative automatic indexing method for Arabic text

Ramzi A. Haraty, Sanaa Kaddoura, Sultan Al Jahdali, Nour K. Masri


Automatic indexing and text retrieval methods for languages have been studied for a long time. Automatic indexing is a process of extracting words from a document to classify the documents per subject and to enhance the information retrieval process. Compared to other languages, there is still limited research conducted for automated Arabic text categorization. In this work, we present an innovative method to reinforce the accuracy of automatic indexing of Arabic texts by introducing and integrating a thesaurus. Our model extracts new relevant words by referring to the created thesaurus, which contains and identifies words, synonyms, and correlations. This thesaurus is built using a natural language toolkit, which contains a library that lists the synonyms of a particular word available in the WordNet library. The words that have the same meaning and frequently appear together are grouped under one umbrella using a JavaScript Object Notation dictionary, making it leisurely to identify the topic of the text. Our results exhibit notable improvement in accuracy and efficiency compared to previous works.


Arabic Text, Automatic Indexing, Building Thesaurus, Frequent Sets, JSON Dictionary, Synonyms

