An innovative automatic indexing method for Arabic text

Ramzi A. Haraty, Sanaa Kaddoura, Sultan Al Jahdali, Nour K. Masri

Abstract


Automatic indexing and text retrieval methods for languages have been studied for a long time. Automatic indexing is a process of extracting words from a document to classify the documents per subject and to enhance the information retrieval process. Compared to other languages, there is still limited research conducted for automated Arabic text categorization. In this work, we present an innovative method to reinforce the accuracy of automatic indexing of Arabic texts by introducing and integrating a thesaurus. Our model extracts new relevant words by referring to the created thesaurus, which contains and identifies words, synonyms, and correlations. This thesaurus is built using a natural language toolkit, which contains a library that lists the synonyms of a particular word available in the WordNet library. The words that have the same meaning and frequently appear together are grouped under one umbrella using a JavaScript Object Notation dictionary, making it leisurely to identify the topic of the text. Our results exhibit notable improvement in accuracy and efficiency compared to previous works.


Keywords


Arabic Text, Automatic Indexing, Building Thesaurus, Frequent Sets, JSON Dictionary, Synonyms

Full Text:

PDF

References


“Number of internet and social media users worldwide as of January 2023,”statistica.com, https://www.statista.com/statistics/617136/digital-population-worldwide (accessed March 12, 2023).

M. K. Bergman, “White paper: the deep web: surfacing hidden value,” Journal of electronic publishing, vol. 7, no. 1, 2001.

C. Schneider, “The biggest data challenges that you might not even know you have,” IBM Blog AI for the Enterprise, 2016.

N. Mansour, R. A. Haraty, W. Daher, and M. Houri, “An auto-indexing method for Arabic text,” Inf Process Manag, vol. 44, no. 4, 2008, doi: 10.1016/j.ipm.2007.12.007.

M. H. Ibrahim and A. G. Chejne, “The Arabic Language: Its Role in History,”Language (Baltim), vol. 48, no. 3, 1972, doi: 10.2307/412051.

R. A. Nicholson, A literary history of the Arabs. 2013. doi: 10.4324/9780203038956.

The Qu’ran, Surat 12, Verse 2, New York, USA: Oxford University Press, 2015.

A. S. Khatib, “Terminological specifications and applications in the Arabic language,” In Proc. Cultural Fifteenth Season of the Arabic Language - Academy of Jordan, Amman, Jordan, pp. 177–213, 1997.

“United Nations – Official Languages,” UN.org, https://www.un.org/en/ourwork/official-languages (accessed March 12, 2023).

A. Issa and A. Siddeik, “Arabic language and computational linguistics, ”International Journal on Studies in English Language and Literature, vol. 6, no. 11, pp. 4-13, November 2018.

R. A. Haraty, N. Mansour, and W. Daher, “An Arabic auto-indexing system for information retrieval,” in IASTED International Multi-Conference on Applied Informatics, 2003, vol. 21.

R. A. Haraty and R. Nasrallah, “Indexing Arabic texts using association rule data mining,” Library Hi Tech, vol. 37, no. 1, 2019, doi: 10.1108/LHT-07-2017-0147.

C. L. Borgman, “Multi-media, multi-cultural, and multi-lingual digital libraries: Or how do we exchange data in 400 languages?,” D-Lib Magazine, vol. 3, no. 6. 1997. Journal of Advances in Computing and Engineering (ACE) Volume 3, Issue 1, June 2023- ISSN 2735-5985

A. J. Warner, “Natural language processing,” Annual review of information science and technology, vol. 22, pp. 79–108, 1987.

C. Fellbaum, “WordNet: An electronic lexical database. 1998,” Br J Hosp Med (Lond), vol. 71, no. 3, 1998.

R. Mihalcea and D. I. Moldovan, “AutoASC – A system for automatic acquisition

of sense tagged corpora,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 14, no. 1, pp. 3-17, 2000.

W. Black et al., “Introducing the Arabic WordNet project,” in GWC 2006: 3rd International Global WordNet Conference, Proceedings, 2005.

S. Feldman, “NLP meets the jabberwocky natural language processing in information retrieval,” Online (Wilton, Connecticut), vol. 23, no. 3, 1999.

J. Devlin, M.-W. Chang, K. Lee, K. T. Google, and A. I. Language, “BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding,” Naacl-Hlt 2019, no. Mlm, 2018.

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, ”in EMNLP 2018 - 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting

Neural Networks for NLP, Proceedings of the 1st Workshop, 2018. doi: 10.18653/v1/w18-5446.

R. A. Haraty, M. M. Allaham, and A. El-Homaissi, “Towards diactritizing Arabic text,” in 26th International Conference on Computer Applications in Industry and Engineering, CAINE 2013, 2013.

R. Alnefaie and A. M. Azmi, “Automatic minimal diacritization of Arabic texts, ”in Procedia Computer Science, 2017, vol. 117. doi: 10.1016/j.procs.2017.10.106.

M. V. Koroteev, “BERT: a review of applications in natural language processing and understanding,” 2021, arXiv:2103.11943.

R. A. Haraty and C. Ghaddar, “Arabic Text Recognition,” International Arab Journal of Information Technology, vol, 1, no. 2, pp. 156-163, July 2004.

R. A. Haraty and S. A. Khatib, “T-Stem - A Superior Stemmer and Temporal Extractor for Arabic Texts,” Journal of Digital Information Management, vol. 3, no. 3, pp. 173-180, September 2005.

R. A. Haraty and R. Varjabedian, “ADD: Arabic duplicate detector - a duplicate detection data cleansing tool,” 2004. doi: 10.1109/aiccsa.2003.1227569.

J. Xu, A. Fraser, and R. Weischedel, “Empirical studies in strategies for Arabic retrieval,” in SIGIR Forum (ACM Special Interest Group on Information Retrieval), 2002. doi: 10.1145/564376.564424.

A. McCallum and K. Nigam, “A Comparison of Event Models for Naive Bayes Text Classification,” AAAI/ICML-98 Workshop on Learning for Text Categorization, 1998, doi: 10.1.1.46.1529.

M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian approach to filtering junk e-mail,” Learning for Text Categorization: Papers from the AAAI Workshop, vol. WS-98-05, no. Cohen, 1998.

T. Joachims, “Text categorization with Support Vector Machines: Learning with many relevant features,” 1998. doi: 10.1007/bfb0026683.

S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh, “Automatic Arabic Text Classification,” Text, no. August, 2008.

S. Khoja, “APT : Arabic Part-Of-speech Tagger,” Proceedings of the Student Workshop at NAACL, 2001.

National Information Standards Organization, “ANSI/NISO Z39.19-2005: Guidelines for the Construction , Format , and Management of Monolingual Controlled Vocabularies,” 2005.

W. R. Hersh, D. H. Hickam, and T. J. Leone, “Words, concepts, or both: optimal indexing units for automated information retrieval.,” Proceedings / the ... Annual Symposium on Computer Application [sic] in Medical Care. Symposium on Computer Applications in Medical Care, 1992.

O. Medelyan and I. H. Witten, “Thesaurus based automatic keyphrase indexing,” in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 2006, vol. 2006. doi: 10.1145/1141753.1141819.

C. M. Rahman, F. A. Sohel, P. Naushad, and S. Kamruzzaman, “Text classification using the concept of association rule of data mining,” in Proc. International Conference on Information Technology, Kathmandu, Nepal, May 2003, pp. 234-241.

B. Sharef, N. Omar, and Z. Sharef, “An automated Arabic Text Categorization based on the Frequency Ratio Accumulation,” International Arab Journal of Information Technology, vol. 11, no. 2, 2014.

M. Lassi, “Automatic thesaurus construction,” University College of Boars, Tech. Rep. 2002.

G. Kanaan and M. Wedyan, “Constructing an automatic thesaurus to enhance Arabic information retrieval system,” in Proc. of The 2nd Jordanian International Conference on Computer Science and Engineering, (JICCSE), 2006, pp. 89–97.

M. Rushdi-Saleh, M. T. Martín-Valdivia, L. A. Ureña-López, and J. M. Perea-Ortega, “OCA: Opinion corpus for Arabic,” Journal of the American Society for Information Science and Technology, vol. 62, no. 10. 2011. doi: 10.1002/asi.21598.

M. A. Abderrahim, M. Dib, M. E. A. Abderrahim, and M. A. Chikh, “Semantic indexing of Arabic texts for information retrieval system,” Int J Speech Technol, vol. 19, no. 2, 2016, doi: 10.1007/s10772-015-9307-3.

G. S. Kaseb and M. F. Ahmed, “Arabic Sentiment Analysis approaches: An analytical survey,” Int J Sci Eng Res, vol. 7, no. 10, 2016.

A. El-halees, “Arabic opinion mining using combined classification approach, ”Proceeding The International Arab Conference On Information Technology, Azrqa, Jordan., 2011.

“42Saudi twitter corpus for sentiment analysis.pdf.”

H. Al-Rubaiee, R. Qiu, and D. Li, “Identifying Mubasher software products through sentiment analysis of Arabic tweets,” in 2016 International Conference on Industrial Informatics and Computer Systems, CIICS 2016, 2016. doi: 10.1109/ICCSII.2016.7462396.

R. M. Duwairi, M. Alfaqeh, M. Wardat, and A. Alrabadi, “Sentiment analysis for Arabizi text,” in 2016 7th International Conference on Information and Communication Systems, ICICS 2016, 2016. doi: 10.1109/IACS.2016.7476098.

P. Gillman, Text Retrieval: The State of the Art, London, UK: Taylor Graham Publishing, 1990.

E. Deeb, New Arabic Grammar Rules, Beirut, Lebanon: Lebanese Book Publishing, 1970.

A. Kindery, F. Rajihy, and F. Shimry, Arabic Grammar Book, Kuwait City, Kuwait: Rissala Publishing, 1996.

R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” in Proc. of 20th International Conference on Very Large Data Bases, {VLDB’94},1994.




DOI: http://dx.doi.org/10.21622/ACE.2023.03.1.001

Refbacks



Copyright (c) 2023 Ramzi A. Haraty, Sanaa Kaddoura, Sultan Al Jahdali, Nour K. Masri

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Advances in Computing and Engineering
E-ISSN: 2735-5985
P-ISSN: 2735-5977

Published by:

Academy Publishing Center (APC)
Arab Academy for Science, Technology and Maritime Transport (AASTMT)
Alexandria, Egypt
ace@aast.edu