Building natural language processing tools for Runyakitara

Fridah Katushemererwe; Andrew Caines; Paula Buttery

doi:10.1515/applirev-2020-2004

Article

Building natural language processing tools for Runyakitara

Fridah Katushemererwe , Andrew Caines and Paula Buttery

Published/Copyright: December 2, 2021

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Applied Linguistics Review Volume 12 Issue 4

Abstract

This paper describes an endeavour to build natural language processing (NLP) tools for Runyakitara, a group of four closely related Bantu languages spoken in western Uganda. In contrast with major world languages such as English, for which corpora are comparatively abundant and NLP tools are well developed, computational linguistic resources for Runyakitara are in short supply. First therefore, we need to collect corpora for these languages, before we can proceed to the design of a spell-checker, grammar-checker and applications for computer-assisted language learning (CALL). We explain how we are collecting primary data for a new Runya Corpus of speech and writing, we outline the design of a morphological analyser, and discuss how we can use these new resources to build NLP tools. We are initially working with Runyankore–Rukiga, a closely-related pair of Runyakitara languages, and we frame our project in the context of NLP for low-resource languages, as well as CALL for the preservation of endangered languages. We put our project forward as a test case for the revitalization of endangered languages through education and technology.

Keywords: natural language processing; endangered languages; language corpus; morphological analyser; CALL

Corresponding author: Fridah Katushemererwe, Department of Linguistics, Makerere University, Kampala, Uganda, E-mail: katu@chuss.mak.ac.ug

Funding source: The research that resulted into this article was funded by The Cambridge-Africa Programme for Research Excellence (CAPREx) and the ALBORADA Fund, UK

Appendix A: Expanded graded intergenerational disruption scale (EGIDS), from Lewis and Simons (2010)

Level	Label	Description	UNESCO
0	International	The language is used internationally for a broad range of functions.	Safe
1	National	The language is used in education, work, mass media, government at the nationwide level.	Safe
2	Regional	The language is used for local and regional mass media and governmental services.	Safe
3	Trade	The language is used for local and regional work by both insiders and outsiders.	Safe
4	Educational	Literacy in the language is being transmitted through a system of public education.	Safe
5	Written	The language is used orally by all generations and is effectively used in written form in parts of the community.	Safe
6a	Vigorous	The language is used orally by all generations and is being learned by children as their first language.	Safe
6b	Threatened	The language is used orally by all generations but only some of the child-bearing generation are transmitting it to their children.	Vulnerable
7	Shifting	The child-bearing generation knows the language well enough to use it among themselves but none are transmitting it to their children.	Definitely endangered
8a	Moribund	The only remaining active speakers of the language are members of the grandparent generation.	Severely endangered
8b	Nearly extinct	The only remaining speakers of the language are members of the grandparent generation or older who have little opportunity to use the language.	Critically endangered
9	Dormant	The language serves as a reminder of heritage identity for an ethnic community. No one has more than symbolic proficiency.	Extinct
10	Extinct	No one retains a sense of ethnic identity associated with the language, even for symbolic purposes.	Extinct

References

Abidi, Syed. 1989. Modern communication and national identity: An issue in East African context. In Jude J. Ongong’a & Kenneth R. Gray (eds.), Bottlenecks to national identity: Ethnic cooperation towards nation building. Nairobi: Professor World Peace Academy of Kenya.Search in Google Scholar

Abney, Steven & Steven Bird. 2010. The human language project: Building a universal corpus of the world’s languages. In Proceedings of the 48th annual meeting of the association for computational linguistics, 88–97. Uppsala, Sweden.Search in Google Scholar

Agic, Zeljko, Dirk Hovy & Anders Søgaard. 2015. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd annual meeting of the association for computational linguistics, 268–272. Beijing, China.10.3115/v1/P15-2044Search in Google Scholar

Allwood, Jens, Harald Hammarström, Andries Hendrikse, Mtholeni N. Ngcobo, Nozibele Nomdebevana, Laurette Pretorius & Mac van der Merwe. 2010. Work on spoken (multimodal) language corpora in South Africa. In Proceedings of the seventh international conference on language resources and evaluation, 885–889. Valletta, Malta.Search in Google Scholar

Barlow, Michael. 1996. Corpora for theory and practice. International Journal of Corpus Linguistics 1(1). 1–37. https://doi.org/10.1075/ijcl.1.1.03bar.Search in Google Scholar

Bernsten, Jan. 1998. Runyakitara: Uganda’s ‘new’ language. Journal of Multilingual and Multicultural Development 19(2). 93–107. https://doi.org/10.1080/01434639808666345.Search in Google Scholar

Crystal, David. 2003. English as a global language, 2nd edn. Cambridge: Cambridge University Press.10.1017/CBO9780511486999Search in Google Scholar

Emerson, Guy, Liling Tan, Susanne Fertmann, Alexis Palmer & Michaela Regneri. 2014. Seedling: Building and using a seed corpus for the human language project. In Proceedings of the 2014 workshop on the use of computational methods in the study of endangered languages, 77–85. Baltimore, MD.10.3115/v1/W14-2211Search in Google Scholar

Fishman, Joshua. 2000. The status agenda in corpus planning. In Richard D. Lambert & Elana Shohamy (eds.), Language policy and pedagogy, 43–52. Philadelphia: John Benjamins.10.1075/z.96.03fisSearch in Google Scholar

Hale, John. 2001. A probabilistic earley parser as a psycholinguistic model. In Proceedings of the second meeting of the North American chapter of the association for computational linguistics (NAACL).10.3115/1073336.1073357Search in Google Scholar

Hugo, Russell. 2015. Endangered languages, technology and learning: Immediate applications and long-term considerations. In Mari Jones (ed.), Endangered languages and new technologies, 95–112. Cambridge: Cambridge University Press.10.1017/CBO9781107279063.009Search in Google Scholar

Jurafsky, Daniel & James H. Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, 2nd edn. Upper Saddle River, NJ: Prentice Hall.Search in Google Scholar

Kateeba, Connie. 2009. The thematic curriculum: Implications for mother tongue education in Uganda. Kyambogo, Uganda: National Curriculum Development Centre.Search in Google Scholar

Katushemererwe, Fridah. 2013. Computational morphology and Bantu language learning: An implementation for Runyakitara: University of Groningen PhD thesis, Groningen, The Netherlands.Search in Google Scholar

Katushemererwe, Fridah & Thomas Hanneforth. 2010. Fsm2 and the morphological analysis of Bantu nouns – first experiences from Runyakitara. International Journal of Computing and ICT Research 4(1). 58–69.Search in Google Scholar

Katushemererwe, Fridah & John Nerbonne. 2015. Computer-assisted language learning (CALL) in support of (re-)learning native languages: The case of Runyakitara. Computer Assisted Language Learning 28(2). 112–129. https://doi.org/10.1080/09588221.2013.792842.Search in Google Scholar

Katushemererwe, Fridah & Rehema Baguma. 2012. RUMORPH: The morphological analyzer of Runyakitara: Approach, results and issues. In Proceedings of the 8th annual international conference on computing & ICT research, 269–294. Kampala, Uganda. http://cit.mak.ac.ug/iccir/?p=iccir_12 (accessed 15 December 2016).Search in Google Scholar

Katushemererwe, Fridah & Arvi Hurskainen. 2011. Intelligent language learning model: Implementation on Runyakitara. In Proceedings of the 7th annual international conference on computing & ICT research, 426–444. Kampala, Uganda. http://cit.mak.ac.ug/iccir/?p=iccir_11 (accessed 15 Dec 2016).Search in Google Scholar

Kornai, Andras (ed.). 1999. Extended finite state models of language. Cambridge: Cambridge University Press.Search in Google Scholar

Krauss, Michael. 1992. The world’s languages in crisis. Language 68(1). 4–10. https://doi.org/10.1353/lan.1992.0075.Search in Google Scholar

Leech, Geoffrey. 2000. Grammars of spoken English: New outcomes of corpus-oriented research. Language Learning 50(4). 675–724. https://doi.org/10.1111/0023-8333.00143.Search in Google Scholar

Levy, Roger. 2008. Expectation-based syntactic comprehension. Cognition 106(3), 1126–1177. https://doi.org/10.1016/j.cognition.2007.05.006.Search in Google Scholar

Lewis, M. Paul & Gary F. Simons. 2010. Assessing endangerment: Expanding fishman’s GIDS. Revue Roumaine de Linguistique 55(2). 103–120. https://doi.org/10.7202/602504ar.Search in Google Scholar

Lewis, M. Paul, Gary F. Simons & Charles D. Fennig (eds.). 2015. Ethnologue: languages of the world, 18th edn. Dallas: SIL International. http://www.ethnologue.com (accessed 2016-12-15).Search in Google Scholar

Manning, Christopher & Hinrich Schütze. 1999. Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Search in Google Scholar

McCarthy, Michael (ed.). 2016. The Cambridge guide to blended learning for language teaching. Cambridge: Cambridge University Press.10.1017/9781009024754Search in Google Scholar

Meurers, Detmar. 2012. Natural language processing and language learning. In Carol A. Chapelle (ed.), Encyclopedia of applied linguistics. Oxford: Blackwell.10.1002/9781405198431.wbeal0858Search in Google Scholar

Moseley, Christopher (ed.). 2010. Atlas of the world’s languages in danger, 3rd edn. Paris: UNESCO Publishing. http://www.unesco.org/culture/en/endangeredlanguages/atlas (accessed 15 December 2016).Search in Google Scholar

Muranga, Manuel. 2009. ‘What about our mother tongues? Linguistic patriotism and non-patriotism in Uganda: Some observations, reflections and recommendations’. Inaugural professorial lecture. Kampala: Makerere University.Search in Google Scholar

Nagata, Noriko. 2009. Robo-Sensei. CALICO Journal 26(3). 562–579. https://doi.org/10.1558/cj.v26i3.562-579.Search in Google Scholar

Namyalo, Saudah & Judith Nakayiza. 2014. Dilemmas in implementing language rights in multilingual Uganda. Current Issues in Language Planning 16(4). 409–424. https://doi.org/10.1080/14664208.2014.987425.Search in Google Scholar

Ndoleriire, Oswald & Celestino Oriikiriza. 1990. Runyakitara studies. Unpublished manuscript. Uganda: Makerere University.Search in Google Scholar

Ng, Hwee Tou, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto & Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error detection. In Proceedings of the 18th conference on computational natural language learning, 1–14. Baltimore, MD.10.3115/v1/W14-1701Search in Google Scholar

Nicholls, Diane. 2003. The Cambridge learner corpus: Error coding and analysis for lexicography and ELT. In Proceedings of the corpus linguistics 2003 conference, 572–581. UCREL technical paper number 16: Lancaster University.Search in Google Scholar

O’Keeffe, Anne, Michael McCarthy & Ronald Carter. 2007. From corpus to classroom: Language use and language teaching. Cambridge: Cambridge University Press.10.1017/CBO9780511497650Search in Google Scholar

Ojijo, Pascal. 2012. Review of education policy in Uganda. Paper submitted to young leaders’ think tank on policy alternatives in Uganda. http://www.slideshare.net/ojijop/review-of-education-policy-in-uganda (accessed Dec 15, 2016).Search in Google Scholar

Paulston, Christina. 1994. Linguistic minorities in multilingual settings: Implications for language policies. Amsterdam: John Benjamins.10.1075/sibil.4Search in Google Scholar

Rice, Andrew, Paula Buttery, Idris A. Rai & Alastair Beresford. 2009. Language learning on a next-generation service platform for Africa. Africa perspective on the role of mobile technologies in fostering social and economic development. Maputo, Mozambique: Worldwide Web Consortium Workshop.Search in Google Scholar

Rubongoya, L. T. 1999. A modern runyoro-rutooro grammar. Cologne: Rüdiger Köppe Verlag.Search in Google Scholar

Shaalan, Khaled. 2005. An intelligent computer-assisted language learning system for Arabic learners. Computer Assisted Language Learning 18(1-2). 81–108. https://doi.org/10.1080/09588220500132399.Search in Google Scholar

Smith, Ray. 2007. An overview of the tesseract OCR engine. In Proceedings of the 9th IEEE international conference on document analysis and recognition, 629–633. Brazil: Parana.10.1109/ICDAR.2007.4376991Search in Google Scholar

Taylor, Charles. 1985. Nkore-Kiga. London: Croom Helm.Search in Google Scholar

Ward, Monica. 2004. The additional uses of CALL in the endangered language context. ReCALL 16(2). 345–359. https://doi.org/10.1017/s0958344004000722.Search in Google Scholar

Ward, Monica & Joseph van Genabith. 2003. CALL for endangered languages: Challenges and rewards. CALL Journal 16(2-3). 233–258. https://doi.org/10.1076/call.16.2.233.15885.Search in Google Scholar

Yannakoudakis, Helen, Ted Briscoe & Ben Medlock. 2011. A new dataset and method for automatically grading ESOL text. In Proceedings of the 49th annual meeting of the association for computational linguistics, 180–189. Portland, Oregon.Search in Google Scholar

Yngve, Victor. 1960. A model and an hypothesis for language structure. Proceedings of the American Philosophical Society, 104(5). 444–466.Search in Google Scholar

Published Online: 2021-12-02

Published in Print: 2021-11-25

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/applirev-2020-2004

Keywords for this article

natural language processing; endangered languages; language corpus; morphological analyser; CALL