Startseite Building natural language processing tools for Runyakitara
Artikel
Lizenziert
Nicht lizenziert Erfordert eine Authentifizierung

Building natural language processing tools for Runyakitara

  • Fridah Katushemererwe EMAIL logo , Andrew Caines und Paula Buttery
Veröffentlicht/Copyright: 2. Dezember 2021
Veröffentlichen auch Sie bei De Gruyter Brill

Abstract

This paper describes an endeavour to build natural language processing (NLP) tools for Runyakitara, a group of four closely related Bantu languages spoken in western Uganda. In contrast with major world languages such as English, for which corpora are comparatively abundant and NLP tools are well developed, computational linguistic resources for Runyakitara are in short supply. First therefore, we need to collect corpora for these languages, before we can proceed to the design of a spell-checker, grammar-checker and applications for computer-assisted language learning (CALL). We explain how we are collecting primary data for a new Runya Corpus of speech and writing, we outline the design of a morphological analyser, and discuss how we can use these new resources to build NLP tools. We are initially working with Runyankore–Rukiga, a closely-related pair of Runyakitara languages, and we frame our project in the context of NLP for low-resource languages, as well as CALL for the preservation of endangered languages. We put our project forward as a test case for the revitalization of endangered languages through education and technology.


Corresponding author: Fridah Katushemererwe, Department of Linguistics, Makerere University, Kampala, Uganda, E-mail:

Funding source: The research that resulted into this article was funded by The Cambridge-Africa Programme for Research Excellence (CAPREx) and the ALBORADA Fund, UK

Appendix A: Expanded graded intergenerational disruption scale (EGIDS), from Lewis and Simons (2010)

Level Label Description UNESCO
0 International The language is used internationally for a broad range of functions. Safe
1 National The language is used in education, work, mass media, government at the nationwide level. Safe
2 Regional The language is used for local and regional mass media and governmental services. Safe
3 Trade The language is used for local and regional work by both insiders and outsiders. Safe
4 Educational Literacy in the language is being transmitted through a system of public education. Safe
5 Written The language is used orally by all generations and is effectively used in written form in parts of the community. Safe
6a Vigorous The language is used orally by all generations and is being learned by children as their first language. Safe
6b Threatened The language is used orally by all generations but only some of the child-bearing generation are transmitting it to their children. Vulnerable
7 Shifting The child-bearing generation knows the language well enough to use it among themselves but none are transmitting it to their children. Definitely endangered
8a Moribund The only remaining active speakers of the language are members of the grandparent generation. Severely endangered
8b Nearly extinct The only remaining speakers of the language are members of the grandparent generation or older who have little opportunity to use the language. Critically endangered
9 Dormant The language serves as a reminder of heritage identity for an ethnic community. No one has more than symbolic proficiency. Extinct
10 Extinct No one retains a sense of ethnic identity associated with the language, even for symbolic purposes. Extinct

References

Abidi, Syed. 1989. Modern communication and national identity: An issue in East African context. In Jude J. Ongong’a & Kenneth R. Gray (eds.), Bottlenecks to national identity: Ethnic cooperation towards nation building. Nairobi: Professor World Peace Academy of Kenya.Suche in Google Scholar

Abney, Steven & Steven Bird. 2010. The human language project: Building a universal corpus of the world’s languages. In Proceedings of the 48th annual meeting of the association for computational linguistics, 88–97. Uppsala, Sweden.Suche in Google Scholar

Agic, Zeljko, Dirk Hovy & Anders Søgaard. 2015. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd annual meeting of the association for computational linguistics, 268–272. Beijing, China.10.3115/v1/P15-2044Suche in Google Scholar

Allwood, Jens, Harald Hammarström, Andries Hendrikse, Mtholeni N. Ngcobo, Nozibele Nomdebevana, Laurette Pretorius & Mac van der Merwe. 2010. Work on spoken (multimodal) language corpora in South Africa. In Proceedings of the seventh international conference on language resources and evaluation, 885–889. Valletta, Malta.Suche in Google Scholar

Barlow, Michael. 1996. Corpora for theory and practice. International Journal of Corpus Linguistics 1(1). 1–37. https://doi.org/10.1075/ijcl.1.1.03bar.Suche in Google Scholar

Bernsten, Jan. 1998. Runyakitara: Uganda’s ‘new’ language. Journal of Multilingual and Multicultural Development 19(2). 93–107. https://doi.org/10.1080/01434639808666345.Suche in Google Scholar

Crystal, David. 2003. English as a global language, 2nd edn. Cambridge: Cambridge University Press.10.1017/CBO9780511486999Suche in Google Scholar

Emerson, Guy, Liling Tan, Susanne Fertmann, Alexis Palmer & Michaela Regneri. 2014. Seedling: Building and using a seed corpus for the human language project. In Proceedings of the 2014 workshop on the use of computational methods in the study of endangered languages, 77–85. Baltimore, MD.10.3115/v1/W14-2211Suche in Google Scholar

Fishman, Joshua. 2000. The status agenda in corpus planning. In Richard D. Lambert & Elana Shohamy (eds.), Language policy and pedagogy, 43–52. Philadelphia: John Benjamins.10.1075/z.96.03fisSuche in Google Scholar

Hale, John. 2001. A probabilistic earley parser as a psycholinguistic model. In Proceedings of the second meeting of the North American chapter of the association for computational linguistics (NAACL).10.3115/1073336.1073357Suche in Google Scholar

Hugo, Russell. 2015. Endangered languages, technology and learning: Immediate applications and long-term considerations. In Mari Jones (ed.), Endangered languages and new technologies, 95–112. Cambridge: Cambridge University Press.10.1017/CBO9781107279063.009Suche in Google Scholar

Jurafsky, Daniel & James H. Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, 2nd edn. Upper Saddle River, NJ: Prentice Hall.Suche in Google Scholar

Kateeba, Connie. 2009. The thematic curriculum: Implications for mother tongue education in Uganda. Kyambogo, Uganda: National Curriculum Development Centre.Suche in Google Scholar

Katushemererwe, Fridah. 2013. Computational morphology and Bantu language learning: An implementation for Runyakitara: University of Groningen PhD thesis, Groningen, The Netherlands.Suche in Google Scholar

Katushemererwe, Fridah & Thomas Hanneforth. 2010. Fsm2 and the morphological analysis of Bantu nouns – first experiences from Runyakitara. International Journal of Computing and ICT Research 4(1). 58–69.Suche in Google Scholar

Katushemererwe, Fridah & John Nerbonne. 2015. Computer-assisted language learning (CALL) in support of (re-)learning native languages: The case of Runyakitara. Computer Assisted Language Learning 28(2). 112–129. https://doi.org/10.1080/09588221.2013.792842.Suche in Google Scholar

Katushemererwe, Fridah & Rehema Baguma. 2012. RUMORPH: The morphological analyzer of Runyakitara: Approach, results and issues. In Proceedings of the 8th annual international conference on computing & ICT research, 269–294. Kampala, Uganda. http://cit.mak.ac.ug/iccir/?p=iccir_12 (accessed 15 December 2016).Suche in Google Scholar

Katushemererwe, Fridah & Arvi Hurskainen. 2011. Intelligent language learning model: Implementation on Runyakitara. In Proceedings of the 7th annual international conference on computing & ICT research, 426–444. Kampala, Uganda. http://cit.mak.ac.ug/iccir/?p=iccir_11 (accessed 15 Dec 2016).Suche in Google Scholar

Kornai, Andras (ed.). 1999. Extended finite state models of language. Cambridge: Cambridge University Press.Suche in Google Scholar

Krauss, Michael. 1992. The world’s languages in crisis. Language 68(1). 4–10. https://doi.org/10.1353/lan.1992.0075.Suche in Google Scholar

Leech, Geoffrey. 2000. Grammars of spoken English: New outcomes of corpus-oriented research. Language Learning 50(4). 675–724. https://doi.org/10.1111/0023-8333.00143.Suche in Google Scholar

Levy, Roger. 2008. Expectation-based syntactic comprehension. Cognition 106(3), 1126–1177. https://doi.org/10.1016/j.cognition.2007.05.006.Suche in Google Scholar

Lewis, M. Paul & Gary F. Simons. 2010. Assessing endangerment: Expanding fishman’s GIDS. Revue Roumaine de Linguistique 55(2). 103–120. https://doi.org/10.7202/602504ar.Suche in Google Scholar

Lewis, M. Paul, Gary F. Simons & Charles D. Fennig (eds.). 2015. Ethnologue: languages of the world, 18th edn. Dallas: SIL International. http://www.ethnologue.com (accessed 2016-12-15).Suche in Google Scholar

Manning, Christopher & Hinrich Schütze. 1999. Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Suche in Google Scholar

McCarthy, Michael (ed.). 2016. The Cambridge guide to blended learning for language teaching. Cambridge: Cambridge University Press.10.1017/9781009024754Suche in Google Scholar

Meurers, Detmar. 2012. Natural language processing and language learning. In Carol A. Chapelle (ed.), Encyclopedia of applied linguistics. Oxford: Blackwell.10.1002/9781405198431.wbeal0858Suche in Google Scholar

Moseley, Christopher (ed.). 2010. Atlas of the world’s languages in danger, 3rd edn. Paris: UNESCO Publishing. http://www.unesco.org/culture/en/endangeredlanguages/atlas (accessed 15 December 2016).Suche in Google Scholar

Muranga, Manuel. 2009. ‘What about our mother tongues? Linguistic patriotism and non-patriotism in Uganda: Some observations, reflections and recommendations’. Inaugural professorial lecture. Kampala: Makerere University.Suche in Google Scholar

Nagata, Noriko. 2009. Robo-Sensei. CALICO Journal 26(3). 562–579. https://doi.org/10.1558/cj.v26i3.562-579.Suche in Google Scholar

Namyalo, Saudah & Judith Nakayiza. 2014. Dilemmas in implementing language rights in multilingual Uganda. Current Issues in Language Planning 16(4). 409–424. https://doi.org/10.1080/14664208.2014.987425.Suche in Google Scholar

Ndoleriire, Oswald & Celestino Oriikiriza. 1990. Runyakitara studies. Unpublished manuscript. Uganda: Makerere University.Suche in Google Scholar

Ng, Hwee Tou, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto & Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error detection. In Proceedings of the 18th conference on computational natural language learning, 1–14. Baltimore, MD.10.3115/v1/W14-1701Suche in Google Scholar

Nicholls, Diane. 2003. The Cambridge learner corpus: Error coding and analysis for lexicography and ELT. In Proceedings of the corpus linguistics 2003 conference, 572–581. UCREL technical paper number 16: Lancaster University.Suche in Google Scholar

O’Keeffe, Anne, Michael McCarthy & Ronald Carter. 2007. From corpus to classroom: Language use and language teaching. Cambridge: Cambridge University Press.10.1017/CBO9780511497650Suche in Google Scholar

Ojijo, Pascal. 2012. Review of education policy in Uganda. Paper submitted to young leaders’ think tank on policy alternatives in Uganda. http://www.slideshare.net/ojijop/review-of-education-policy-in-uganda (accessed Dec 15, 2016).Suche in Google Scholar

Paulston, Christina. 1994. Linguistic minorities in multilingual settings: Implications for language policies. Amsterdam: John Benjamins.10.1075/sibil.4Suche in Google Scholar

Rice, Andrew, Paula Buttery, Idris A. Rai & Alastair Beresford. 2009. Language learning on a next-generation service platform for Africa. Africa perspective on the role of mobile technologies in fostering social and economic development. Maputo, Mozambique: Worldwide Web Consortium Workshop.Suche in Google Scholar

Rubongoya, L. T. 1999. A modern runyoro-rutooro grammar. Cologne: Rüdiger Köppe Verlag.Suche in Google Scholar

Shaalan, Khaled. 2005. An intelligent computer-assisted language learning system for Arabic learners. Computer Assisted Language Learning 18(1-2). 81–108. https://doi.org/10.1080/09588220500132399.Suche in Google Scholar

Smith, Ray. 2007. An overview of the tesseract OCR engine. In Proceedings of the 9th IEEE international conference on document analysis and recognition, 629–633. Brazil: Parana.10.1109/ICDAR.2007.4376991Suche in Google Scholar

Taylor, Charles. 1985. Nkore-Kiga. London: Croom Helm.Suche in Google Scholar

Ward, Monica. 2004. The additional uses of CALL in the endangered language context. ReCALL 16(2). 345–359. https://doi.org/10.1017/s0958344004000722.Suche in Google Scholar

Ward, Monica & Joseph van Genabith. 2003. CALL for endangered languages: Challenges and rewards. CALL Journal 16(2-3). 233–258. https://doi.org/10.1076/call.16.2.233.15885.Suche in Google Scholar

Yannakoudakis, Helen, Ted Briscoe & Ben Medlock. 2011. A new dataset and method for automatically grading ESOL text. In Proceedings of the 49th annual meeting of the association for computational linguistics, 180–189. Portland, Oregon.Suche in Google Scholar

Yngve, Victor. 1960. A model and an hypothesis for language structure. Proceedings of the American Philosophical Society, 104(5). 444–466.Suche in Google Scholar

Published Online: 2021-12-02
Published in Print: 2021-11-25

© 2020 Walter de Gruyter GmbH, Berlin/Boston

Heruntergeladen am 24.9.2025 von https://www.degruyterbrill.com/document/doi/10.1515/applirev-2020-2004/html
Button zum nach oben scrollen