Abstract
This paper describes an endeavour to build natural language processing (NLP) tools for Runyakitara, a group of four closely related Bantu languages spoken in western Uganda. In contrast with major world languages such as English, for which corpora are comparatively abundant and NLP tools are well developed, computational linguistic resources for Runyakitara are in short supply. First therefore, we need to collect corpora for these languages, before we can proceed to the design of a spell-checker, grammar-checker and applications for computer-assisted language learning (CALL). We explain how we are collecting primary data for a new Runya Corpus of speech and writing, we outline the design of a morphological analyser, and discuss how we can use these new resources to build NLP tools. We are initially working with Runyankore–Rukiga, a closely-related pair of Runyakitara languages, and we frame our project in the context of NLP for low-resource languages, as well as CALL for the preservation of endangered languages. We put our project forward as a test case for the revitalization of endangered languages through education and technology.
Funding source: The research that resulted into this article was funded by The Cambridge-Africa Programme for Research Excellence (CAPREx) and the ALBORADA Fund, UK
Appendix A: Expanded graded intergenerational disruption scale (EGIDS), from Lewis and Simons (2010)
Level | Label | Description | UNESCO |
---|---|---|---|
0 | International | The language is used internationally for a broad range of functions. | Safe |
1 | National | The language is used in education, work, mass media, government at the nationwide level. | Safe |
2 | Regional | The language is used for local and regional mass media and governmental services. | Safe |
3 | Trade | The language is used for local and regional work by both insiders and outsiders. | Safe |
4 | Educational | Literacy in the language is being transmitted through a system of public education. | Safe |
5 | Written | The language is used orally by all generations and is effectively used in written form in parts of the community. | Safe |
6a | Vigorous | The language is used orally by all generations and is being learned by children as their first language. | Safe |
6b | Threatened | The language is used orally by all generations but only some of the child-bearing generation are transmitting it to their children. | Vulnerable |
7 | Shifting | The child-bearing generation knows the language well enough to use it among themselves but none are transmitting it to their children. | Definitely endangered |
8a | Moribund | The only remaining active speakers of the language are members of the grandparent generation. | Severely endangered |
8b | Nearly extinct | The only remaining speakers of the language are members of the grandparent generation or older who have little opportunity to use the language. | Critically endangered |
9 | Dormant | The language serves as a reminder of heritage identity for an ethnic community. No one has more than symbolic proficiency. | Extinct |
10 | Extinct | No one retains a sense of ethnic identity associated with the language, even for symbolic purposes. | Extinct |
References
Abidi, Syed. 1989. Modern communication and national identity: An issue in East African context. In Jude J. Ongong’a & Kenneth R. Gray (eds.), Bottlenecks to national identity: Ethnic cooperation towards nation building. Nairobi: Professor World Peace Academy of Kenya.Suche in Google Scholar
Abney, Steven & Steven Bird. 2010. The human language project: Building a universal corpus of the world’s languages. In Proceedings of the 48th annual meeting of the association for computational linguistics, 88–97. Uppsala, Sweden.Suche in Google Scholar
Agic, Zeljko, Dirk Hovy & Anders Søgaard. 2015. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd annual meeting of the association for computational linguistics, 268–272. Beijing, China.10.3115/v1/P15-2044Suche in Google Scholar
Allwood, Jens, Harald Hammarström, Andries Hendrikse, Mtholeni N. Ngcobo, Nozibele Nomdebevana, Laurette Pretorius & Mac van der Merwe. 2010. Work on spoken (multimodal) language corpora in South Africa. In Proceedings of the seventh international conference on language resources and evaluation, 885–889. Valletta, Malta.Suche in Google Scholar
Barlow, Michael. 1996. Corpora for theory and practice. International Journal of Corpus Linguistics 1(1). 1–37. https://doi.org/10.1075/ijcl.1.1.03bar.Suche in Google Scholar
Bernsten, Jan. 1998. Runyakitara: Uganda’s ‘new’ language. Journal of Multilingual and Multicultural Development 19(2). 93–107. https://doi.org/10.1080/01434639808666345.Suche in Google Scholar
Crystal, David. 2003. English as a global language, 2nd edn. Cambridge: Cambridge University Press.10.1017/CBO9780511486999Suche in Google Scholar
Emerson, Guy, Liling Tan, Susanne Fertmann, Alexis Palmer & Michaela Regneri. 2014. Seedling: Building and using a seed corpus for the human language project. In Proceedings of the 2014 workshop on the use of computational methods in the study of endangered languages, 77–85. Baltimore, MD.10.3115/v1/W14-2211Suche in Google Scholar
Fishman, Joshua. 2000. The status agenda in corpus planning. In Richard D. Lambert & Elana Shohamy (eds.), Language policy and pedagogy, 43–52. Philadelphia: John Benjamins.10.1075/z.96.03fisSuche in Google Scholar
Hale, John. 2001. A probabilistic earley parser as a psycholinguistic model. In Proceedings of the second meeting of the North American chapter of the association for computational linguistics (NAACL).10.3115/1073336.1073357Suche in Google Scholar
Hugo, Russell. 2015. Endangered languages, technology and learning: Immediate applications and long-term considerations. In Mari Jones (ed.), Endangered languages and new technologies, 95–112. Cambridge: Cambridge University Press.10.1017/CBO9781107279063.009Suche in Google Scholar
Jurafsky, Daniel & James H. Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, 2nd edn. Upper Saddle River, NJ: Prentice Hall.Suche in Google Scholar
Kateeba, Connie. 2009. The thematic curriculum: Implications for mother tongue education in Uganda. Kyambogo, Uganda: National Curriculum Development Centre.Suche in Google Scholar
Katushemererwe, Fridah. 2013. Computational morphology and Bantu language learning: An implementation for Runyakitara: University of Groningen PhD thesis, Groningen, The Netherlands.Suche in Google Scholar
Katushemererwe, Fridah & Thomas Hanneforth. 2010. Fsm2 and the morphological analysis of Bantu nouns – first experiences from Runyakitara. International Journal of Computing and ICT Research 4(1). 58–69.Suche in Google Scholar
Katushemererwe, Fridah & John Nerbonne. 2015. Computer-assisted language learning (CALL) in support of (re-)learning native languages: The case of Runyakitara. Computer Assisted Language Learning 28(2). 112–129. https://doi.org/10.1080/09588221.2013.792842.Suche in Google Scholar
Katushemererwe, Fridah & Rehema Baguma. 2012. RUMORPH: The morphological analyzer of Runyakitara: Approach, results and issues. In Proceedings of the 8th annual international conference on computing & ICT research, 269–294. Kampala, Uganda. http://cit.mak.ac.ug/iccir/?p=iccir_12 (accessed 15 December 2016).Suche in Google Scholar
Katushemererwe, Fridah & Arvi Hurskainen. 2011. Intelligent language learning model: Implementation on Runyakitara. In Proceedings of the 7th annual international conference on computing & ICT research, 426–444. Kampala, Uganda. http://cit.mak.ac.ug/iccir/?p=iccir_11 (accessed 15 Dec 2016).Suche in Google Scholar
Kornai, Andras (ed.). 1999. Extended finite state models of language. Cambridge: Cambridge University Press.Suche in Google Scholar
Krauss, Michael. 1992. The world’s languages in crisis. Language 68(1). 4–10. https://doi.org/10.1353/lan.1992.0075.Suche in Google Scholar
Leech, Geoffrey. 2000. Grammars of spoken English: New outcomes of corpus-oriented research. Language Learning 50(4). 675–724. https://doi.org/10.1111/0023-8333.00143.Suche in Google Scholar
Levy, Roger. 2008. Expectation-based syntactic comprehension. Cognition 106(3), 1126–1177. https://doi.org/10.1016/j.cognition.2007.05.006.Suche in Google Scholar
Lewis, M. Paul & Gary F. Simons. 2010. Assessing endangerment: Expanding fishman’s GIDS. Revue Roumaine de Linguistique 55(2). 103–120. https://doi.org/10.7202/602504ar.Suche in Google Scholar
Lewis, M. Paul, Gary F. Simons & Charles D. Fennig (eds.). 2015. Ethnologue: languages of the world, 18th edn. Dallas: SIL International. http://www.ethnologue.com (accessed 2016-12-15).Suche in Google Scholar
Manning, Christopher & Hinrich Schütze. 1999. Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Suche in Google Scholar
McCarthy, Michael (ed.). 2016. The Cambridge guide to blended learning for language teaching. Cambridge: Cambridge University Press.10.1017/9781009024754Suche in Google Scholar
Meurers, Detmar. 2012. Natural language processing and language learning. In Carol A. Chapelle (ed.), Encyclopedia of applied linguistics. Oxford: Blackwell.10.1002/9781405198431.wbeal0858Suche in Google Scholar
Moseley, Christopher (ed.). 2010. Atlas of the world’s languages in danger, 3rd edn. Paris: UNESCO Publishing. http://www.unesco.org/culture/en/endangeredlanguages/atlas (accessed 15 December 2016).Suche in Google Scholar
Muranga, Manuel. 2009. ‘What about our mother tongues? Linguistic patriotism and non-patriotism in Uganda: Some observations, reflections and recommendations’. Inaugural professorial lecture. Kampala: Makerere University.Suche in Google Scholar
Nagata, Noriko. 2009. Robo-Sensei. CALICO Journal 26(3). 562–579. https://doi.org/10.1558/cj.v26i3.562-579.Suche in Google Scholar
Namyalo, Saudah & Judith Nakayiza. 2014. Dilemmas in implementing language rights in multilingual Uganda. Current Issues in Language Planning 16(4). 409–424. https://doi.org/10.1080/14664208.2014.987425.Suche in Google Scholar
Ndoleriire, Oswald & Celestino Oriikiriza. 1990. Runyakitara studies. Unpublished manuscript. Uganda: Makerere University.Suche in Google Scholar
Ng, Hwee Tou, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto & Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error detection. In Proceedings of the 18th conference on computational natural language learning, 1–14. Baltimore, MD.10.3115/v1/W14-1701Suche in Google Scholar
Nicholls, Diane. 2003. The Cambridge learner corpus: Error coding and analysis for lexicography and ELT. In Proceedings of the corpus linguistics 2003 conference, 572–581. UCREL technical paper number 16: Lancaster University.Suche in Google Scholar
O’Keeffe, Anne, Michael McCarthy & Ronald Carter. 2007. From corpus to classroom: Language use and language teaching. Cambridge: Cambridge University Press.10.1017/CBO9780511497650Suche in Google Scholar
Ojijo, Pascal. 2012. Review of education policy in Uganda. Paper submitted to young leaders’ think tank on policy alternatives in Uganda. http://www.slideshare.net/ojijop/review-of-education-policy-in-uganda (accessed Dec 15, 2016).Suche in Google Scholar
Paulston, Christina. 1994. Linguistic minorities in multilingual settings: Implications for language policies. Amsterdam: John Benjamins.10.1075/sibil.4Suche in Google Scholar
Rice, Andrew, Paula Buttery, Idris A. Rai & Alastair Beresford. 2009. Language learning on a next-generation service platform for Africa. Africa perspective on the role of mobile technologies in fostering social and economic development. Maputo, Mozambique: Worldwide Web Consortium Workshop.Suche in Google Scholar
Rubongoya, L. T. 1999. A modern runyoro-rutooro grammar. Cologne: Rüdiger Köppe Verlag.Suche in Google Scholar
Shaalan, Khaled. 2005. An intelligent computer-assisted language learning system for Arabic learners. Computer Assisted Language Learning 18(1-2). 81–108. https://doi.org/10.1080/09588220500132399.Suche in Google Scholar
Smith, Ray. 2007. An overview of the tesseract OCR engine. In Proceedings of the 9th IEEE international conference on document analysis and recognition, 629–633. Brazil: Parana.10.1109/ICDAR.2007.4376991Suche in Google Scholar
Taylor, Charles. 1985. Nkore-Kiga. London: Croom Helm.Suche in Google Scholar
Ward, Monica. 2004. The additional uses of CALL in the endangered language context. ReCALL 16(2). 345–359. https://doi.org/10.1017/s0958344004000722.Suche in Google Scholar
Ward, Monica & Joseph van Genabith. 2003. CALL for endangered languages: Challenges and rewards. CALL Journal 16(2-3). 233–258. https://doi.org/10.1076/call.16.2.233.15885.Suche in Google Scholar
Yannakoudakis, Helen, Ted Briscoe & Ben Medlock. 2011. A new dataset and method for automatically grading ESOL text. In Proceedings of the 49th annual meeting of the association for computational linguistics, 180–189. Portland, Oregon.Suche in Google Scholar
Yngve, Victor. 1960. A model and an hypothesis for language structure. Proceedings of the American Philosophical Society, 104(5). 444–466.Suche in Google Scholar
© 2020 Walter de Gruyter GmbH, Berlin/Boston
Artikel in diesem Heft
- Frontmatter
- Editorial
- Introduction to the special issue “African languages and development”
- Research Articles
- Educational language policy in an African country: Making a place for code-switching/translanguaging
- Voices of ignorance versus voices of knowledge: Debates on English as medium of instruction in Malawian primary schools
- The role of communities in Uganda’s mother tongue-based education: Perspectives from a literacy learning enhancement project in Arua district
- Teaching multilingual literacy in Ugandan classrooms: The promise of the African Storybook
- Building natural language processing tools for Runyakitara
- Tributes
- Gregory Hankoni Kamwendo (1965–2018): a tribute
- Dr. Juliet Tembe (1954–2016)
Artikel in diesem Heft
- Frontmatter
- Editorial
- Introduction to the special issue “African languages and development”
- Research Articles
- Educational language policy in an African country: Making a place for code-switching/translanguaging
- Voices of ignorance versus voices of knowledge: Debates on English as medium of instruction in Malawian primary schools
- The role of communities in Uganda’s mother tongue-based education: Perspectives from a literacy learning enhancement project in Arua district
- Teaching multilingual literacy in Ugandan classrooms: The promise of the African Storybook
- Building natural language processing tools for Runyakitara
- Tributes
- Gregory Hankoni Kamwendo (1965–2018): a tribute
- Dr. Juliet Tembe (1954–2016)