The issues of developing the historical subcorpus of the National Corpus of the Kazakh Language

Anar Fazylzhanova; Ainur Seitbekova; Gulzhihan Kobdenova; Assel Seidamat; Galymzhan Ayazbayev

doi:10.1515/lpp-2024-0038

Article

The issues of developing the historical subcorpus of the National Corpus of the Kazakh Language

Anar Fazylzhanova , Ainur Seitbekova , Gulzhihan Kobdenova , Assel Seidamat and Galymzhan Ayazbayev

Published/Copyright: April 21, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Lodz Papers in Pragmatics Volume 21 Issue 1

Abstract

The purpose of this study is to investigate the main issues related to the further stage of development of the historical subcorpus of the National Corpus of the Kazakh Language. Through the study of the historical subcorpus in the Kazakh language, issues such as metatext markup, software improvement, and transcription were examined. When analysing the historical subcorpus the following points were noted: texts of XII, XIV–XX centuries were placed in it, and the search was carried out with the help of such parameters as author, text style, text graphics, text title, text genre, century, also Arabic, Cyrillic and Latin graphics were presented, in genre terms poems, prose, heroic songs, articles, epic of religious character, novels were considered. During the study of the first phase of the historical subcorpus, it was learnt that there is a need to incorporate the experience of other National Corpora, to develop mechanisms for the active inclusion of texts from different periods, in particular from the fifth to nineteenth centuries, the tenth to fifteenth centuries and the sixteenth to nineteenth centuries. Improving metatextual markup, providing more information about texts, and solving problems related to transcription are also important issues.

Keywords: meta-markup; transcription; software; graphics; genre

Corresponding author: Anar Fazylzhanova, Institute of Linguistics named after A. Baitursynov, 050010, 29 Kurmangazy Str., Almaty, Republic of Kazakhstan, E-mail: fazylzhan.anar@gmail.com

Conflict of interest: The authors declare that there is no conflict of interests.
Data availability: The data that support the findings of this study are available on request from the corresponding author.

References

Ahsanuddin, Muhammad, Yuyun Hanafi, Yunita Basthomi, Fardhani Taufiqurrahman, Hasan Abdul Bukhori, Joko Samodra, Utami Widiati & Pratiwi Hayuning Wijayati. 2022. Building a corpus-based academic vocabulary list of four languages. Pegem Journal of Education and Instruction 12(1). 159–167. https://doi.org/10.47750/pegegog.12.01.15.Search in Google Scholar

Aizstrauta, Dace & Egils Ginters. 2016. Using market data of technologies to build a dynamic integrated acceptance and sustainability assessment model. Procedia Computer Science 104. 501–508. https://doi.org/10.1016/j.procs.2017.01.165.Search in Google Scholar

Alsehibany, Reem A. & Sameh M. Abdelhalim. 2023. Enhancing academic writing vocabulary use through direct corpus consultation: Saudi English majors’ perceptions and experiences. Arab World English Journal 14(1). 166–182. https://doi.org/10.24093/awej/vol14no1.11.Search in Google Scholar

Alsop, Steve, Victoria King, Gianluca Giaimo & Xinyi Xu. 2020. Uses of corpus linguistics in higher education research: An adjustable lens. In J. Huisman & M. Tight (eds.), Theory and method in higher education research, vol. 6, 21–40. Leeds: Emerald Publishing Limited.10.1108/S2056-375220200000006003Search in Google Scholar

Badel, Alexander M., Tianyi Zhong, Wei Tai & Fan Zhou. 2023. Somali information retrieval corpus: Bridging the gap between query translation and dedicated language resources. In Houda Bouamor, Juan Pino & Kalika Bali (eds.), Proceedings of the 2023 conference on empirical methods in natural language processing, 7463–7469. Stroudsburg: Association for Computational Linguistics.10.18653/v1/2023.emnlp-main.462Search in Google Scholar

Bazaluk, Oleg. 2018. The feature transformations of the basic meanings of Greek paideia in the educational theories in the middle ages. Schole 12(1). 243–258. https://doi.org/10.21267/AQUILO.2018.12.10428.Search in Google Scholar

Chen, Liang-Chun, Kuan-Hua Chang, Shang-Chi Yang & Shao-Chung Chen. 2023. A corpus-based word classification method for detecting difficulty level of English proficiency tests. Applied Sciences 13(3). 1699. https://doi.org/10.3390/app13031699.Search in Google Scholar

Costa, Bruno Silvério, Jorge Viana Santos, Cristiane Namiuti & Aline Silva Costa. 2022. The systematic construction of multiple types of corpora through the Lapelinc Framework. In Vladia Pinheiro, Pablo Gamallo, Raquel Amaro, Carolina Scarton, Fernando Batista, Diego Silva, Catarina Magro & Hugo Pinto (eds.), Computational processing of the Portuguese language, 401–406. Cham: Springer.10.1007/978-3-030-98305-5_37Search in Google Scholar

Dandan, Li, Norazah Noordin & Lily Ismail. 2023. An exploration on integration of corpus-based language pedagogy into senior high school English curriculum. International Journal of Academic Research in Business and Social Sciences 13(11). 1035–1047. https://doi.org/10.6007/IJARBSS/v13-i11/19405.Search in Google Scholar

Diab, Mona, Nizar Habash & Imed Zitouni. 2017. NLP for Arabic and related languages. TAL 58(3). 9–13.Search in Google Scholar

Durrant, Philip. 2022. Studying children’s writing development with a corpus. Applied Corpus Linguistics 2(3). 100026. https://doi.org/10.1016/j.acorp.2022.100026.Search in Google Scholar

Fang, Liuqin, Qing Ma & Jie Yan. 2021. The effectiveness of corpus-based training on collocation use in L2 writing for Chinese senior secondary school students. Journal of China Computer-Assisted Language Learning 1(1). 80–109. https://doi.org/10.1515/jccall-2021-2004.Search in Google Scholar

Gries, Stefan T. 2022. Toward more careful corpus statistics: Uncertainty estimates for frequencies, dispersions, association measures, and more. Research Methods in Applied Linguistics 1. 100002. https://doi.org/10.1016/j.rmal.2021.100002.Search in Google Scholar

Gruszczynski, Wojciech, Dominik Adamiec, Renata Bronikowska, Wojciech Kieras, Elzbieta Modrzejewski, Aleksander Wieczorek & Maciej Wolinski. 2022. The electronic corpus of 17th- and 18th-century Polish texts. Language Resources and Evaluation 56(1). 309–332. https://doi.org/10.1007/s10579-021-09549-1.Search in Google Scholar

Hengchen, Simon & Nina Tahmasebi. 2021. A collection of Swedish diachronic word embedding models trained on historical newspaper data. Journal of Open Human Data 7(2). https://doi.org/10.5334/johd.22.Search in Google Scholar

Kaya, Ozgur F., Kazim Uzun & Hakan Cangır. 2022. Using corpora for language teaching and assessment in L2 writing: A narrative review. Focus on ELT Journal 4(3). 46–62. https://doi.org/10.14744/felt.2022.4.3.4.Search in Google Scholar

Kerimkhulle, Seyit, Nataliia Obrosova, Shananin Alexander & Akylbek Tokhmetov. 2023. Young duality for variational inequalities and nonparametric method of demand analysis in input–output models with inputs substitution: Application for Kazakhstan economy. Mathematics 11(19). 4216. https://doi.org/10.3390/math11194216.Search in Google Scholar

Khassanov, Yernar, Saltanat Mussakhojayeva, Arman Mirzakhmetov, Aidar Adiyev, Marat Nurpeiissov & Huseyin A. Varol. 2021. A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline. In Paola Merlo, Jorg Tiedemann & Reut Tsarfaty (eds.), Proceedings of the 16th conference of the European chapter of the association for computational linguistics: Main volume, 697–706. Stroudsburg: Association for Computational Linguistics.10.18653/v1/2021.eacl-main.58Search in Google Scholar

Kolbayev, Nurbolat, Kalima Tuyenbayeva, Danakul Seitimbetova & Nurlan Apakhayev. 2024. Methods of modelling electronic academic libraries: Technological concept of electronic libraries. Preservation, Digital Technology and Culture 53(2). 81–90. https://doi.org/10.1515/pdtc-2024-0001.Search in Google Scholar

Kongyratbay, Tynysbek Auelbekuly. 2021. The ethnic nature of the Kazakh heroic epic Alpamys. Eposovedenie 21(1). 14–29. https://doi.org/10.25587/w4013-1717-6780-j.Search in Google Scholar

Kongyratbay, Tynysbek Auelbekuly. 2022. Once again about the epic heritage of Korkut. Eposovedenie(2). 28–39. https://doi.org/10.25587/n8974-5171-2692-o.Search in Google Scholar

Leonow, Alexander I., Mariia N. Koniagina, Svetlana V. Petrova, Elena V. Grunt, Seyit Ye Kerimkhulle & Veronika G. Shubaeva. 2019. Application of information technologies in marketing: Experience of developing countries. Espacios 40(38).Search in Google Scholar

Madmarova, Gulipa A., Syuita R. Abdykadyrova, Rakhat K. Ormokeeva, Ikibal A. Temirkulova & Rakhat Zh Sagyndykova. 2021. Lexical units objectifying the intercultural concept in “Babur-Nameh”. Studies in Systems, Decision and Control 314. 1065–1070. https://doi.org/10.1007/978-3-030-56433-9_111.Search in Google Scholar

Marcellino, William. 2019. Seniority in writing studies: A corpus analysis. Journal of Writing Analytics 3. 183–205. https://doi.org/10.37514/JWA-J.2019.3.1.09.Search in Google Scholar

Martínez Martín, Laura & Guadalupe Adámez Castro. 2022. Letters for History in a world 2.0: The construction and study of a Spanish-Portuguese digital epistolary corpus from the Modern Age. Historia Critica 2022(83). 99–123. https://doi.org/10.7440/histcrit83.2022.05.Search in Google Scholar

Mizin, Kostiantyn, Liudmyla Slavova, Liubov Letiucha & Oleksandr Petrov. 2023. Emotion concept disgust and its German counterparts: Equivalence determination based on language corpora data. Forum for Linguistic Studies 5(1). 72–90. https://doi.org/10.18063/FLS.V5I1.1552.Search in Google Scholar

Nurbatyrova, Raushan, Boris Japarov, Nurlan Apakhayev, Biyakhmet Abdulaziz & Sandugash Khushkeldiyeva. 2024. Digital transformation of archives in the context of the introduction of an electronic document management system in Kazakhstan. Preservation, Digital Technology and Culture 53(3). 147–155. https://doi.org/10.1515/pdtc-2024-0017.Search in Google Scholar

Patil, Ramesh & Venkat Gudivada. 2024. A review of current trends, techniques, and challenges in large language models (LLMs). Applied Sciences 14(5). 2074. https://doi.org/10.3390/app14052074.Search in Google Scholar

Ponti, Edoardo M., Henry O’Horan, Yonatan Berzak, Ivona Vulis, Roi Reichart, Thierry Poibeau, Ekaterina Shutova & Anna Korhonen. 2019. Modeling language variation and universals: A survey on typological linguistics for natural language processing. Computational Linguistics 45(3). 559–601. https://doi.org/10.1162/coli_a_00357.Search in Google Scholar

Rovelli, Giulia. 2023. Towards a historical corpus of Canadian English letters and diaries. Token 16. 300–323. https://doi.org/10.25951/11268.Search in Google Scholar

Sakhipov, Aivar, Talgat Baidildinov, Madina Yermaganbetova & Nurzhan Ualiyev. 2023. Design of an educational platform for professional development of teachers with elements of blockchain technology. International Journal of Advanced Computer Science and Applications 14(7). 519–527. https://doi.org/10.14569/IJACSA.2023.0140757.Search in Google Scholar

Sun, Zhong. 2022. Development of corpus linguistic using lexical teaching to improve English writing. Wireless Communications and Mobile Computing 2022(1). 1–7. https://doi.org/10.1155/2022/4024149.Search in Google Scholar

Tkachenko, Oleksandr, Maksym Chernykh, Ilia Kuznetcov, Vitaliy Karpovich & Przemyslaw Jatkiewicz. 2024. An impact of web animation on user perception and engagement. Journal of the Balkan Tribological Association 30(5). 875–897.Search in Google Scholar

Treviso, Marcos, Jin-U. Lee, Ji Tong, Betty van Aken, Qingqing Cao, Manuel Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro Martins, Andre Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych & Roy Schwartz. 2023. Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics 11. 826–860. https://doi.org/10.1162/tacl_a_00577.Search in Google Scholar

Ueno, Shoichi & Osamu Takeuchi. 2023. Effective corpus use in second language learning: A meta-analytic approach. Applied Corpus Linguistics 3(3). 100076. https://doi.org/10.1016/j.acorp.2023.100076.Search in Google Scholar

Unuabonah, Foluke Olayinka, Adebola Adebileje, Rotimi Olanrele Oladipupo, Bernard Fyanka, Mba Odim & Oluwateniola Kupolati. 2022. Introducing the historical corpus of English in Nigeria (HiCE-Nig). English Today 38(3). 178–184. https://doi.org/10.1017/S0266078422000037.Search in Google Scholar

Walkden, George. 2016. The HeliPaD: A parsed corpus of Old Saxon. International Journal of Corpus Linguistics 21(4). 559–571. https://doi.org/10.1075/ijcl.21.4.05wal.Search in Google Scholar

Zampieri, Marcos, Preslav Nakov & Yves Scherrer. 2020. Natural language processing for similar languages, varieties, and dialects: A survey. Natural Language Engineering 26(6). 595–612. https://doi.org/10.1017/S1351324920000492.Search in Google Scholar

Received: 2024-08-31

Accepted: 2025-04-01

Published Online: 2025-04-21

Published in Print: 2025-05-26

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/lpp-2024-0038

Keywords for this article

meta-markup; transcription; software; graphics; genre