Classifying offensive language in Arabic: a novel taxonomy and dataset

Chaya Liebeskind; Ali Afawi; Marina Litvak; Natalia Vanetik

doi:10.1515/lpp-2024-0034

Article

Classifying offensive language in Arabic: a novel taxonomy and dataset

Chaya Liebeskind
Chaya Liebeskind is a senior lecturer and researcher in the Department of Computer Science at the Jerusalem College of Technology. Her research interests span both Natural Language Processing and data mining. Especially, her scientific interests include Semantic Similarity, Language Technology for Cultural Heritage, Morphologically rich languages (MRL), Multi-word Expressions (MWEs), Information Retrieval (IR), and Text Classification (TC). Much of her recent work has been focusing on analysing offensive language. She has published a variety of studies and a few of her articles are under review or in preparation. She is a member of several international research actions funded by the EU.
, Ali Afawi
Ali Afawi is a student at Shamoon College of Engineering. He studies on the cyber track at the Department of Software Engineering. His research interests are cyber security and NLP applications for Semitic languages.
, Marina Litvak
Marina Litvak is a Senior Lecturer at Shamoon College of Engineering, Department of Software Engineering. Marina’s research focuses mainly on Multilingual Text Analysis, Social Networks, Knowledge Extraction from Text, and Summarization. Marina published over 90 academic papers, including journal and top-level conference publications. She constantly serves on the program committees and editorial boards in multiple journals and conferences and collaborates on different research projects in Israel and abroad. She is a co-organizer of the MultiLing, FNP, Text2Story, and IACT workshops, collocated with top-level conferences.
and Natalia Vanetik
Natalia Vanetik is a senior lecturer and researcher in the Department of Software Engineering at the Shamoon College of Engineering. Her research interests include Natural Language Processing, text mining, and optimization. Specifically, her research covers diverse range of topics in NLP and machine learning, including social media analysis, job vacancy ranking, and the development of evaluation systems for summarization tasks. Her research also extends to graph theory applications in data mining and cross-lingual transfer learning.

Published/Copyright: December 10, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Lodz Papers in Pragmatics Volume 20 Issue 2

Abstract

This paper presents a streamlined taxonomy for categorizing offensive language in Arabic, specifically Modern Standard Arabic (MSA) and the Levantine dialect. Addressing a gap in the existing literature, which has mainly focused on Indo-European languages, our taxonomy divides offensive language into seven levels (six explicit and one implicit). We adapted our framework from the simplified offensive language (SOL) taxonomy by (Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Anna Bączkowska, Chaya Liebeskind, Jelena Mitrovic & Giedre Valunaite Oleškeviciente. 2021a. Lod-connected offensive language ontology and tagset enrichment. In Shubert R. Carvalho & Renato R. Souza (eds.), Proceedings of the workshops and tutorials held at ldk 2021 co-located with the 3rd language, data and knowledge conference, Vol. 3064, 135–150. CEUR Workshop Proceedings), customizing it to reflect the unique linguistic and cultural nuances of Arabic. To validate this taxonomy, we created a new dataset from various social media platforms, primarily focusing on Twitter. This dataset was manually curated by human annotators and is described in detail within the paper, serving as both a validation tool for our taxonomy and a foundation for future research on offensive language detection in Arabic. Initial analysis of the dataset reveals complex patterns of offensive expressions in MSA and Levantine Arabic, underscoring the need to account for linguistic and cultural variations in studying online abuse. Our taxonomy and dataset are vital for advancing research in Arabic sociocultural studies, natural language processing, and linguistic analysis, and contribute to the study of low-resource languages.

Keywords: offensive language; Arabic; taxonomy; dataset

Corresponding author: Chaya Liebeskind, Department of Computer Science, Jerusalem College of Technology, 21 Havaad Haleumi St., P.O.B. 16031, Jerusalem, 9116001, Israel, E-mail: liebchaya@gmail.com

Funding source: Israel Innovation Authority

Award Identifier / Grant number: N\A

About the authors

Chaya Liebeskind

Chaya Liebeskind is a senior lecturer and researcher in the Department of Computer Science at the Jerusalem College of Technology. Her research interests span both Natural Language Processing and data mining. Especially, her scientific interests include Semantic Similarity, Language Technology for Cultural Heritage, Morphologically rich languages (MRL), Multi-word Expressions (MWEs), Information Retrieval (IR), and Text Classification (TC). Much of her recent work has been focusing on analysing offensive language. She has published a variety of studies and a few of her articles are under review or in preparation. She is a member of several international research actions funded by the EU.

Ali Afawi

Ali Afawi is a student at Shamoon College of Engineering. He studies on the cyber track at the Department of Software Engineering. His research interests are cyber security and NLP applications for Semitic languages.

Marina Litvak

Marina Litvak is a Senior Lecturer at Shamoon College of Engineering, Department of Software Engineering. Marina’s research focuses mainly on Multilingual Text Analysis, Social Networks, Knowledge Extraction from Text, and Summarization. Marina published over 90 academic papers, including journal and top-level conference publications. She constantly serves on the program committees and editorial boards in multiple journals and conferences and collaborates on different research projects in Israel and abroad. She is a co-organizer of the MultiLing, FNP, Text2Story, and IACT workshops, collocated with top-level conferences.

Natalia Vanetik

Natalia Vanetik is a senior lecturer and researcher in the Department of Software Engineering at the Shamoon College of Engineering. Her research interests include Natural Language Processing, text mining, and optimization. Specifically, her research covers diverse range of topics in NLP and machine learning, including social media analysis, job vacancy ranking, and the development of evaluation systems for summarization tasks. Her research also extends to graph theory applications in data mining and cross-lingual transfer learning.

Acknowledgments

The authors express their appreciation to Mohamad Abu Jafar and Samer Abo Hasan for their valuable help with data collection and annotation. We also thank Yossef Haim Shrem for his valuable advice and help with taxonomy translation. The subject of this program is the development of a dataset and a language model for identifying offensive language in Hebrew and Arabic.

Research funding: This work was supported by the Israeli Innovation Authority.

References

Abdelhakim, Mohamed, Bingquan Liu & Chengie Sun. 2023. Ar-Pufi: A short-text dataset to identify the offensive messages towards public figures in the arabian community. Expert Systems with Applications 233. 120888. https://doi.org/10.1016/j.eswa.2023.120888.Search in Google Scholar

Ahmad, Ashraf, Mohammad Azzeh, Eman Elnagi, Qasem Abu Al-Haija, Dana Halabi, Abdullah Aref & AbuHour. N. d. Yousef. 2024. Hate speech detection in the Arabic language: Corpus design, construction and evaluation. Frontiers in Artificial Intelligence 7. 1345445. https://doi.org/10.3389/frai.2024.1345445.Search in Google Scholar

Al Jazeera. N.d. Egypt news. https://www.aljazeera.com/where/egypt/ (Accessed 16 July 2024).Search in Google Scholar

Alakrot, Azalden, Liam Murray & Nikola S. Nikolov. 2018. Dataset construction for the detection of anti-social behaviour in online communication in Arabic. Procedia Computer Science 142. 174–181. https://doi.org/10.1016/j.procs.2018.10.473.Search in Google Scholar

Albadi, Nuha, Maram Kurdi & Shivakant Mishra. 2018. Are they our brothers? Analysis and detection of religious hate speech in the Arabic Twittersphere. In 2018 IEEE/ACM international conference on advances in social networks analysis and mining (asonam), 69–76.10.1109/ASONAM.2018.8508247Search in Google Scholar

Alhazmi, Ali. 2023. Hate speech dataset for the Saudi dialect. Mendeley Data. Version V1.Search in Google Scholar

Aljuhani, Khulood O., Khaled H. Alyoubi & Fahd S. Alotaibi. 2022. Detecting Arabic offensive language in microblogs using domain-specific word embeddings and deep learning. Tehnički glasnik 16(3). 394–400. https://doi.org/10.31803/tg-20220305120018.Search in Google Scholar

Althobaiti, Maha Jarallah. 2022. Bert-based approach to Arabic hate speech and offensive language detection in Twitter: Exploiting emojis and sentiment analysis. International Journal of Advanced Computer Science and Applications 13(5). https://doi.org/10.14569/ijacsa.2022.01305109.Search in Google Scholar

Aref, Abdullah, Rana Husni Al Mahmoud, Khaled Taha & Mahmoud Al-Sharif. 2020. Hate speech detection of Arabic shorttext. In 9th International conference on information technology convergence and services (ITCSE 2020), Vol. 10, 81–94. Computer Science & Information Technology.10.5121/csit.2020.100507Search in Google Scholar

Barakat, Halim. 1993. The arab world: Society, culture, and state. Univ of California Press.10.1525/9780520914421Search in Google Scholar

Belkina, Anna C., Christopher O. Ciccolella, Rina Anno, Richard Halpert, Josef Spidlen and Jennifer E. Snyder-Cappione. 2019. Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nature Communications 10(1). 5415.10.1038/s41467-019-13055-ySearch in Google Scholar

Boucherit, Oussama & Kheireddine Abainia. 2021. Offensive language detection in under-resourced algerian dialectal Arabic language. In International conference on big data, machine learning, and applications, 639–647.10.1007/978-981-99-3481-2_49Search in Google Scholar

Caselli, Tommaso, Valerio Basile, Jelena Mitrovic, Inga Kartoziya & Michael Granitzer. 2020. I feel offended, don’t be abusive! Implicit/explicit messages in offensive and abusive language. In Proceedings of the twelfth language resources and evaluation conference, 6193–6202. Marseille, France: The European Language Resources Association (ELRA).Search in Google Scholar

Chowdhury, Shammur Absar, Hamdy Mubarak, Ahmed Abdelali, Soon-gyo Jung, Bernard J. Jansen & Joni Salminen. 2020. A multi-platform Arabic news comment dataset for offensive language detection. In Proceedings of the twelfth language resources and evaluation conference, 6203–6212.Search in Google Scholar

Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1). 37–46. https://doi.org/10.1177/001316446002000104.Search in Google Scholar

Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer & Veselin Stoyanov. 2020. XLM-RoBERTa. Available at: https://huggingface.co/xlm-roberta.Search in Google Scholar

Grice, H. Paul. 1990a. Logic and conversation. 1975. A. P. Martinich (ed.), The philosophy of language, 67–87. Oxford: Oxford University Press.Search in Google Scholar

Haddad, Hatem, Hala Mulki & Asma Oueslati. 2019. T-Hsab: A tunisian hate speech and abusive dataset. In International conference on Arabic language processing, 251–263.10.1007/978-3-030-32959-4_18Search in Google Scholar

Haddad, Bushr, Zoher Orabe, Anas Al-Abood & Nada Ghneim. 2020. Arabic offensive language detection with attention-based deep neural networks. In Proceedings of the 4th workshop on open-source Arabic corpora and processing tools, with a shared task on offensive language detection, 76–81.Search in Google Scholar

Haugh, Michael & Valeria Sinkeviciute. 2019. Offence and conflict talk. In Matthew Evans, Lesley Jeffries & Jim O’Driscoll (eds.), The Routledge handbook of language in conflict, 196–214. Routledge.10.4324/9780429058011-12Search in Google Scholar

Husain, Fatemah. 2020. Arabic offensive language detection using machine learning and ensemble machine-learning approaches. arXiv preprint arXiv:2005.08946.Search in Google Scholar

Husain, Fatemah & Ozlem Uzuner. 2021. Transfer learning approach for Arabic offensive language detection system–bert-based model. arXiv preprint arXiv:2102.05708.Search in Google Scholar

Husain, Fatemah & Ozlem Uzuner. 2022. Transfer learning across Arabic dialects for offensive language detection. In 2022 International conference on asian language processing (IALP), 196–205.10.1109/IALP57159.2022.9961263Search in Google Scholar

Husain, Fatemah, Jooyeon Lee, Samuel Henry & Ozlem Uzuner. 2020. Salamnet at semeval-2020 task12: Deep learning approach for Arabic offensive language detection. arXiv preprint arXiv:2007.13974. https://doi.org/10.18653/v1/2020.semeval-1.283.Search in Google Scholar

Inoue, Go, Bashar Alhafni, Nurpeiis Baimukan, Houda Bouamor & Nizar Habash. 2021. The interplay of variant, size, and task type in Arabic pre-trained language models. In Proceedings of the sixth Arabic natural language processing workshop. Kyiv, Ukraine (On-line): Association for Computational Linguistics.Search in Google Scholar

Khairy, Marwa, Tarek M. Mahmoud, Ahmed Omar & Tarek Abd El-Hafeez. 2023. Comparative performance of ensemble machine learning for Arabic cyberbullying and offensive language detection. Language Resources and Evaluation 58. 1–18. https://doi.org/10.1007/s10579-023-09683-y.Search in Google Scholar

Kogilavani, S. V., S. Malliga, K. R. Jaiabinaya, M. Malini & M. Manisha Kokila. 2021. Characterization and mechanical properties of offensive language taxonomy and detection techniques. Materials Today: Proceedings 81. 630–633. https://doi.org/10.1016/j.matpr.2021.04.102.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara. 2023. A simplified taxonomy of offensive language (sol) for computational applications. Konin Language Studies 10(3). 213–227.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Anna Bączkowska, Chaya Liebeskind, Jelena Mitrovic & Giedre Valunaite Oleškeviciente. 2021a. Lod-connected offensive language ontology and tagset enrichment. In Shubert R. Carvalho & Renato R. Souza (eds.), Proceedings of the workshops and tutorials held at ldk 2021 co-located with the 3rd language, data and knowledge conference, Vol. 3064, 135–150. CEUR Workshop Proceedings.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Anna Bączkowska, Chaya Liebeskind, Jelena Mitrović & Giedrė Valūnaitė Oleškevičienė. 2021b. Lod-connected offensive language ontology and tagset enrichment. In CEUR workshop proceedings, Vol. 3064.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Baczkowska Anna, Chaya Liebeskind, Giedre Valunaite Oleskeviciene & Slavko Žitnik. 2023a. An integrated explicit and implicit offensive language taxonomy. Lodz Papers in Pragmatics 19(1). 7–48. https://doi.org/10.1515/lpp-2023-0002.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Chaya Liebeskind, Giedrė Valūnaitė Oleškevičienė, Anna Bączkowska, Paul A. Wilson, Marcin Trojszczak, Ivana Brač, Lobel Filipić, Ana Ostroški Anić, and Olga Dontcheva-Navratilova. 2023b. Annotation scheme and evaluation: The case of offensive language. Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje 49(1). 155–175.10.31724/rihjj.49.1.8Search in Google Scholar

Liebeskind, Chaya, Natalia Vanetik & Marina Litvak. 2023. Hebrew offensive language taxonomy and dataset. Lodz Papers in Pragmatics 19(2). 325–351. https://doi.org/10.1515/lpp-2023-0017.Search in Google Scholar

Litvak, Marina, Natalia Vanetik, Yaser Nimer, Abdulrhman Skout & Israel Beer-Sheba. 2021. Offensive language detection in Semitic languages. In Multimodal hate speech workshop, Vol. 2021, 7–12. Düsseldorf, Germany: ACL.Search in Google Scholar

Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Search in Google Scholar

Mubarak, Hamdy, Kareem Darwish & Walid Magdy. 2017. Abusive language detection on Arabic social media. In Proceedings of the first workshop on abusive language online, 52–56.10.18653/v1/W17-3008Search in Google Scholar

Mueller, Andreas. 2017. WordCloud: A little word cloud generator in Python. Available at: https://github.com/amueller/word_cloud.Search in Google Scholar

Mulki, Hala & Bilal Ghanem. 2021. Let-mi: An Arabic levantine twitter dataset for misogynistic language. arXiv preprint arXiv:2103.10195.Search in Google Scholar

Mulki, Hala, Hatem Haddad, Chedi Bechikh Ali & Halima Alshabani. 2019. L-Hsab: A levantine twitter dataset for hate speech and abusive language. In Proceedings of the third workshop on abusive language online, 111–118.10.18653/v1/W19-3512Search in Google Scholar

OpenAI. 2023. ChatGPT: Generative pre-trained transformer. https://www.openai.com/chatgpt (Accessed 18 July 2024).Search in Google Scholar

Ousidhoum, Nedjma, Zizheng Lin, Hongming Zhang, Yangqiu Song & Dit-Yan Yeung. 2019. Multilingual and multi-aspect hate speech analysis. arXiv preprint arXiv:1908.11049.10.18653/v1/D19-1474Search in Google Scholar

Pan-European anti-racism network. 2022. ENAR shadow report 2006. Available at: https://www.enareu.org/shadow-reports-on-racism-in-europe-203/.Search in Google Scholar

Shannaq, Fatima, Bassam Hammo, Hossam Faris & Pedro A. Castillo-Valdivieso. 2022. Offensive language detection in Arabic social networks using evolutionary-based classifiers learned from fine-tuned embeddings. IEEE Access 10. 75018–75039. https://doi.org/10.1109/access.2022.3190960.Search in Google Scholar

Twitter. 2022. Twitter developer policy. https://developer.twitter.com/en/developer-terms/policy (Accessed 1 March 2024).Search in Google Scholar

Wine, Michael. 2016. National monitoring of hate crime in Europe: The case for a European level policy. In Jennifer Schweppe & Mark Austin Walters (eds.), The globalization of hate the globalization of hate: Internationalizing hate crime? 213–232. Oxford University Press.10.1093/acprof:oso/9780198785668.003.0014Search in Google Scholar

Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra & Ritesh Kumar. 2019a. Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666.10.18653/v1/N19-1144Search in Google Scholar

Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra & Ritesh Kumar. 2019b. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983.10.18653/v1/S19-2010Search in Google Scholar

Zampieri, Marcos, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis & Çağrı Çöltekin. 2020. Semeval-2020 task 12: Multilingual offensive language identification in social media (offenseval 2020). arXiv preprint arXiv:2006.07235.10.18653/v1/2020.semeval-1.188Search in Google Scholar

Zerrouki, Taha. 2023. Arabic stop words. Available at: https://github.com/linuxscout/arabicstopwords.Search in Google Scholar

Version 0.4.3

Grice, Herbert Paul. 1990b. Logic and conversation, 41–58. In Speech acts 1975. Brill: Leiden, Netherlands.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Chaya Liebeskind, Giedre Valunaite Oleskevicienė, Anna Bączkowska, Paul A. Wilson, Marcin Trojszczak. 2023b. Annotation scheme and evaluation: The case of OFFENSIVE language. Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje 49(1). 155–175.Search in Google Scholar

Received: 2024-08-28

Accepted: 2024-11-18

Published Online: 2024-12-10

Published in Print: 2024-12-17

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/lpp-2024-0034

Keywords for this article

offensive language; Arabic; taxonomy; dataset