Hebrew offensive language taxonomy and dataset

Chaya Liebeskind; Natalia Vanetik; Marina Litvak

doi:10.1515/lpp-2023-0017

Article

Hebrew offensive language taxonomy and dataset

Chaya Liebeskind
Chaya Liebeskind is a lecturer and researcher in the Department of Computer Science at the Jerusalem College of Technology. Her research interests span both Natural Language Processing and data mining. Especially, her scientific interests include Semantic Similarity, Language Technology for Cultural Heritage, Morphologically rich languages (MRL), Multi-word Expressions (MWEs), Information Retrieval (IR), and Text Classification (TC). Much of her recent work has been focusing on analysing offensive language. She has published a variety of studies and a few of her articles are under review or in preparation. She is a member of several international research actions funded by the EU.
, Natalia Vanetik
is a senior lecturer and researcher in the Department of Software Engineering at the Shamoon College of Engineering. Her research interests include Natural Language Processing, text mining, and optimization. Specifically, her research covers diverse range of topics in NLP and machine learning, including social media analysis, job vacancy ranking, and the development of evaluation systems for summarization tasks. Her research also extends to graph theory applications in data mining and cross-lingual transfer learning.
and Marina Litvak
Marina Litvak is a Senior Lecturer at Shamoon College of Engineering, Department of Software Engineering. Marina’s research focuses mainly on Multilingual Text Analysis, Social Networks, Knowledge Extraction from Text, and Summarization. Marina published over 90 academic papers, including journal and top-level conference publications. She constantly serves on the program committees and editorial boards in multiple journals and conferences and collaborates on different research projects in Israel and abroad. She is a co-organizer of the MultiLing, FNP, Text2Story, and IACT workshops, collocated with top-level conferences.

Published/Copyright: December 12, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Lodz Papers in Pragmatics Volume 19 Issue 2

Abstract

This paper introduces a streamlined taxonomy for categorizing offensive language in Hebrew, addressing a gap in the literature that has, until now, largely focused on Indo-European languages. Our taxonomy divides offensive language into seven levels (six explicit and one implicit level). We based our work on the simplified offensive language (SOL) taxonomy introduced in (Lewandowska-Tomaszczyk et al. 2021a) hoping that our adjustment of SOL to the Hebrew language will be capable of reflecting the unique linguistic and cultural nuances of Hebrew. The study involves both linguistic and cultural analysis beyond Natural Language Processing (NLP). We employed manual linguistic analysis to understand the nuances of offensive language in Hebrew.

An accompanying dataset, gathered on Twitter and manually curated by human annotators, is described in detail. This dataset was constructed to both validate the taxonomy and serve as a foundation for future research on offensive language detection and analysis in Hebrew. Preliminary analysis of the dataset reveals intriguing patterns and distributions, underscoring the complexity and specificity of offensive expressions in the Hebrew language.

The aim of our work is to capture the complexity and specificity of offensive expressions in Hebrew beyond what automated NLP methods alone can provide. Our findings highlight the significance of considering linguistic and cultural variations when researching and correcting abusive language online. We believe that our streamlined taxonomy and associated dataset will be crucial in improving research in Hebrew language sociocultural studies, natural language processing, and offensive language detection. Our study also makes a substantial contribution to the study of low-resource languages and can be used as a model for future research on other languages.

Keywords: offensive language; low-resource languages; taxonomy; Hebrew offensive language dataset

About the authors

Chaya Liebeskind

Chaya Liebeskind is a lecturer and researcher in the Department of Computer Science at the Jerusalem College of Technology. Her research interests span both Natural Language Processing and data mining. Especially, her scientific interests include Semantic Similarity, Language Technology for Cultural Heritage, Morphologically rich languages (MRL), Multi-word Expressions (MWEs), Information Retrieval (IR), and Text Classification (TC). Much of her recent work has been focusing on analysing offensive language. She has published a variety of studies and a few of her articles are under review or in preparation. She is a member of several international research actions funded by the EU.

Natalia Vanetik

is a senior lecturer and researcher in the Department of Software Engineering at the Shamoon College of Engineering. Her research interests include Natural Language Processing, text mining, and optimization. Specifically, her research covers diverse range of topics in NLP and machine learning, including social media analysis, job vacancy ranking, and the development of evaluation systems for summarization tasks. Her research also extends to graph theory applications in data mining and cross-lingual transfer learning.

Marina Litvak

Marina Litvak is a Senior Lecturer at Shamoon College of Engineering, Department of Software Engineering. Marina’s research focuses mainly on Multilingual Text Analysis, Social Networks, Knowledge Extraction from Text, and Summarization. Marina published over 90 academic papers, including journal and top-level conference publications. She constantly serves on the program committees and editorial boards in multiple journals and conferences and collaborates on different research projects in Israel and abroad. She is a co-organizer of the MultiLing, FNP, Text2Story, and IACT workshops, collocated with top-level conferences.

References

Belkina, Anna C, Christopher O. Ciccolella, Rina Anno, Richard Halpert, Josef Spidlen & Jennifer E. Snyder-Cappione. 2019. Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nature communications 10(1). 5415.10.1038/s41467-019-13055-ySearch in Google Scholar

Bojanowski, Piotr, Edouard Grave, Armand Joulin & Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the association for computational linguistics 5. 135–146.10.1162/tacl_a_00051Search in Google Scholar

Bright, J. 2022. History under attack: Holocaust denial and distortion on social media. Supporting Data. United Nations Educational, Scientific and Cultural Organization (UNESCO), Paris, France, and the United Nations Department of Global Communications, United Nations, New York, USA.Search in Google Scholar

Caselli, Tommaso, Valerio Basile, Jelena Mitrovic, Inga Kartoziya & Michael Granitzer. 2020. I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language. In Proceedings of the twelfth language resources and evaluation conference, 6193–6202. The European Language Resources Association (ELRA), Marseille, France.Search in Google Scholar

Chiril, Patricia, Farah Benamara, Véronique Moriceau, Marlene Coulomb-Gully & Abhishek Kumar. 2019. Multilingual and multitarget hate speech detection in tweets. In Conférence sur le traitement automatique des langues naturelles (TALN-PFIA 2019), 351–360. Toulouse, France, ATALA.10.18653/v1/S19-2087Search in Google Scholar

Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1). 37–46.10.1177/001316446002000104Search in Google Scholar

Çöltekin, Çagrı. 2020. A corpus of Turkish offensive language on social media. In Proceedings of the twelfth language resources and evaluation conference, 6174–6184. The European Language Resources Association (ELRA), Marseille, France.Search in Google Scholar

Davidson, Thomas, Dana Warmsley, Michael Macy & Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the international AAAI conference on web and social media, vol. 11, 512–515. San Francisco, California USA, AAAI Press.10.1609/icwsm.v11i1.14955Search in Google Scholar

Fišer, Darja, Tomaž Erjavec & Nikola Ljubešic. 2017. Legal framework, dataset and annotation schema for socially unacceptable online discourse practices in Slovene. In Proceedings of the first workshop on abusive language online, 46–51. Long Beach, California, USA, Curran Associates, Inc.10.18653/v1/W17-3007Search in Google Scholar

Fortuna, Paula, Joao Rocha da Silva, Leo Wanner, Sérgio Nunes, et al. 2019. A hierarchically labeled Portuguese hate speech dataset. In Proceedings of the third workshop on abusive language online, 94–104. Florence, Italy, ACL.10.18653/v1/W19-3510Search in Google Scholar

Grice, Herbert Paul. 1990 [1975]. Logic and conversation. In Peter Cole and Jerry L. Morgan (eds.), Syntax and Semantics, Vol. 3, Speech acts, 41–58. New York: Academic Press.10.1163/9789004368811_003Search in Google Scholar

Hamad, Nagham, Mustafa Jarrar, Mohammad Khalilia & Nadim Nashif. 2023. Offensive Hebrew corpus and detection using bert. arXiv preprint arXiv:2309.02724.10.1109/AICCSA59173.2023.10479258Search in Google Scholar

Haugh, Michael & Valeria Sinkeviciute. 2019. Offence and conflict talk. In Matthew Evans, Lesley Jeffries & Jim O'Driscoll (eds.), The Routledge handbook of language in conflict, 196–214. London: Routledge.10.4324/9780429058011-12Search in Google Scholar

Klie, Jan-Christoph, Michael Bugert, Beto Boullosa, Richard Eckart de Castilho & Iryna Gurevych. 2018. The inception platform: machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th international conference on computational linguistics: system demonstrations, 5–9. Santa Fe, New Mexico, USA, ACL.Search in Google Scholar

Kogilavani, SV, S Malliga, KR Jaiabinaya, M. Malini & M. Manisha Kokila. 2023. Characterization and mechanical properties of offensive language taxonomy and detection techniques. Materials Today: Proceedings, vol. 81, part 2, 630–633, Elsevier10.1016/j.matpr.2021.04.102Search in Google Scholar

Lakoff, George & Mark Johnson. 1980. Metaphors We Live By. Chicago: Chicago University Press.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara. 2023. A simplified taxonomy of offensive language (sol) for computational applications. Konin Language Studies 10(3). 213–227.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Anna Bączkowska, Chaya Liebeskind, Giedre Valunaite Oleskeviciene & Slavko Žitnik. 2023. An integrated explicit and implicit offensive language taxonomy. Lodz Papers in Pragmatics 19(1). 7–48.10.1515/lpp-2023-0002Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Anna Bączkowska, Chaya Liebeskind, Jelena Mitrovic & Giedre Valunaite Oleškeviciente. 2021a. Lod-connected offensive language ontology and tagset enrichment. In Shubert R. Carvalho and Renato R. Souza (eds.), proceedings of the workshops and tutorials held at ldk 2021 co-located with the 3rd language, data and knowledge conference, vol. 3064, 135–150. CEUR Workshop Proceedings.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Anna Bączkowska, Chaya Liebeskind, Jelena Mitrović, and Giedrė Valūnaitė Oleškevičienė. 2021b. Lod-connected offensive language ontology and tagset enrichment. In CEUR workshop proceedings, vol. 3064.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Chaya Liebeskind, Giedre Valunaite Oleskevicienė, Anna Bączkowska, Paul A. Wilson, Marcin Trojszczak et al. 2023. Annotation Scheme and Evaluation: The Case of OFFENSIVE Language. Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje 49(1), 155–175.10.31724/rihjj.49.1.8Search in Google Scholar

Liebeskind, Chaya & Shmuel Liebeskind. 2018. Identifying abusive comments in Hebrew Facebook. In 2018 IEEE international conference on the science of electrical engineering in Israel (ICSEEL), 1–5. IEEE, Eilat, Israel.10.1109/ICSEE.2018.8646190Search in Google Scholar

Litvak, Marina, Natalia Vanetik, Chaya Liebeskind, Omar Hmdia & Rizek Abu Madeghem. 2022. Offensive language detection in Hebrew: can other languages help? In Proceedings of the thirteenth language resources and evaluation conference, 3715–3723. Marseille, France: The European Language Resources Association (ELRA).Search in Google Scholar

Litvak, Marina, Natalia Vanetik, Yaser Nimer, Abdulrhman Skout & Israel Beer-Sheba. 2021. Offensive language detection in Semitic languages. In Multimodal hate speech workshop, vol. 2021, 7–12. Düsseldorf, Germany: ACL.Search in Google Scholar

Liu, Ping, Wen Li & Liang Zou. 2019. NULI at SemEval-2019 task 6: Transfer learning for offensive language detection using bidirectional transformers. In Proceedings of the 13th international workshop on semantic evaluation, 87–91. Minneapolis, Minnesota, USA: ACL.10.18653/v1/S19-2011Search in Google Scholar

Mandl, Thomas, Sandip Modha, Anand Kumar M & Bharathi Raja Chakravarthi. 2020. Overview of the HASOC track at FIRE 2020: hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In Proceedings of the 12th annual meeting of the forum for information retrieval evaluation, 29–32. Hyderabad India: Association for Computing Machinery (ACM).10.1145/3441501.3441517Search in Google Scholar

Mandl, Thomas, Sandip Modha, Prasenjit Majumder, Daksh Patel, Mohana Dave, Chintak Mandlia & Aditya Patel. 2019. Overview of the HASOC track at FIRE 2019: hate speech and offensive content identification in Indo-European languages. In Proceedings of the 11th annual meeting of the forum for information retrieval evaluation, 14–17. Hyderabad India: Association for Computing Machinery (ACM).10.1145/3368567.3368584Search in Google Scholar

Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Search in Google Scholar

Mohaouchane, Hanane, Asmaa Mourhir & Nikola S Nikolov. 2019. Detecting offensive language on Arabic social media using deep learning. In 2019 sixth international conference on social networks analysis, management and security (SNAMS), 466–471. Granada, Spain: IEEE.10.1109/SNAMS.2019.8931839Search in Google Scholar

Pan-European anti-racism network. 2022. ENAR Shadow Report 2006. https://www.enareu.org/shadow-reports-on-racism-in-europe-203/.Search in Google Scholar

Pitenis, Zeses, Marcos Zampieri & Tharindu Ranasinghe. 2020. Offensive language identification in Greek. arXiv preprint arXiv:2003.07459.Search in Google Scholar

Poletto, Fabio, Marco Stranisci, Manuela Sanguinetti, Viviana Patti, Cristina Bosco, et al. 2017. Hate speech annotation: analysis of an Italian Twitter corpus. In CEUR workshop proceedings, vol. 2006, 1–6. Rome, Italy: CEUR-WS.10.4000/books.aaccademia.2448Search in Google Scholar

Ranasinghe, Tharindu, Marcos Zampieri & Hansi Hettiarachchi. 2019. Brums at HASOC 2019: deep learning models for multilingual hate speech and offensive language identification. In FIRE 2019 proceedings, 199–207. Kolkata, India: CEUR-WS.Search in Google Scholar

Schütze, Hinrich & Jan O Pedersen. 1997. A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing & Management 33(3). 307–318.10.1016/S0306-4573(96)00068-4Search in Google Scholar

Shlens, Jonathon. 2014. A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100.Search in Google Scholar

Sigurbergsson, Gudbjartur Ingi & Leon Derczynski. 2019. Offensive language and hate speech detection for Danish. arXiv preprint arXiv:1908.04531.Search in Google Scholar

Smadja, Frank, Kathleen R McKeown & Vasileios Hatzivassiloglou. 1996. Translating collocations for bilingual lexicons: a statistical approach. Computational linguistics 22(1). 1–38.Search in Google Scholar

Technologies, Mindpool. 2023. Mindpool Technologies. Available at: http://www.mindpool.com (accessed 6 September 2023).Search in Google Scholar

Tova Hartman. 2022. The challenges of multiculturalism in Israel’s shared society – opinion.Search in Google Scholar

Jerusalem Post. Available at: https://www.jpost.com/opinion/article-705192 (accessed 10 September 2023).Search in Google Scholar

Tulkens, Stéphan, Lisa Hilte, Elise Lodewyckx, Ben Verhoeven & Walter Daelemans. 2016. A dictionary-based approach to racism detection in dutch social media. arXiv preprint arXiv:1608.08738.Search in Google Scholar

WALLA! TECH. 2022. Social media plays large role in fomenting online hate - report. Jerusalem Post. Available at: https://www.jpost.com/international/article712070 (accessed 6 September 2023).Search in Google Scholar

Wine, M. 2016. National monitoring of hate crime in Europe: the case for a European level policy. In Jennifer Schweppe and Mark Austin Walters (eds.) The Globalization of Hate The Globalization of Hate: Internationalizing Hate Crime? 213–32. New York: Oxford University Press.10.1093/acprof:oso/9780198785668.003.0014Search in Google Scholar

Yasaswini, Konthala, Karthik Puranik, Adeep Hande, Ruba Priyadharshini, Sajeetha Thavareesan & Bharathi Raja Chakravarthi. 2021. IIITT@ DravidianLangTech-EACL2021: Transfer learning for offensive language detection in Dravidian languages. In Proceedings of the first workshop on speech and language technologies for Dravidian languages, 187–194. Online, ACL.Search in Google Scholar

Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra & Ritesh Kumar. 2019a. Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666.10.18653/v1/N19-1144Search in Google Scholar

Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra & Ritesh Kumar. 2019b. SemEval-2019 task 6: identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983.10.18653/v1/S19-2010Search in Google Scholar

Published Online: 2023-12-12

Published in Print: 2023-12-15

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/lpp-2023-0017

Keywords for this article

offensive language; low-resource languages; taxonomy; Hebrew offensive language dataset