Abstract
The purpose of this study is to investigate the main issues related to the further stage of development of the historical subcorpus of the National Corpus of the Kazakh Language. Through the study of the historical subcorpus in the Kazakh language, issues such as metatext markup, software improvement, and transcription were examined. When analysing the historical subcorpus the following points were noted: texts of XII, XIV–XX centuries were placed in it, and the search was carried out with the help of such parameters as author, text style, text graphics, text title, text genre, century, also Arabic, Cyrillic and Latin graphics were presented, in genre terms poems, prose, heroic songs, articles, epic of religious character, novels were considered. During the study of the first phase of the historical subcorpus, it was learnt that there is a need to incorporate the experience of other National Corpora, to develop mechanisms for the active inclusion of texts from different periods, in particular from the fifth to nineteenth centuries, the tenth to fifteenth centuries and the sixteenth to nineteenth centuries. Improving metatextual markup, providing more information about texts, and solving problems related to transcription are also important issues.
-
Conflict of interest: The authors declare that there is no conflict of interests.
-
Data availability: The data that support the findings of this study are available on request from the corresponding author.
References
Ahsanuddin, Muhammad, Yuyun Hanafi, Yunita Basthomi, Fardhani Taufiqurrahman, Hasan Abdul Bukhori, Joko Samodra, Utami Widiati & Pratiwi Hayuning Wijayati. 2022. Building a corpus-based academic vocabulary list of four languages. Pegem Journal of Education and Instruction 12(1). 159–167. https://doi.org/10.47750/pegegog.12.01.15.Search in Google Scholar
Aizstrauta, Dace & Egils Ginters. 2016. Using market data of technologies to build a dynamic integrated acceptance and sustainability assessment model. Procedia Computer Science 104. 501–508. https://doi.org/10.1016/j.procs.2017.01.165.Search in Google Scholar
Alsehibany, Reem A. & Sameh M. Abdelhalim. 2023. Enhancing academic writing vocabulary use through direct corpus consultation: Saudi English majors’ perceptions and experiences. Arab World English Journal 14(1). 166–182. https://doi.org/10.24093/awej/vol14no1.11.Search in Google Scholar
Alsop, Steve, Victoria King, Gianluca Giaimo & Xinyi Xu. 2020. Uses of corpus linguistics in higher education research: An adjustable lens. In J. Huisman & M. Tight (eds.), Theory and method in higher education research, vol. 6, 21–40. Leeds: Emerald Publishing Limited.10.1108/S2056-375220200000006003Search in Google Scholar
Badel, Alexander M., Tianyi Zhong, Wei Tai & Fan Zhou. 2023. Somali information retrieval corpus: Bridging the gap between query translation and dedicated language resources. In Houda Bouamor, Juan Pino & Kalika Bali (eds.), Proceedings of the 2023 conference on empirical methods in natural language processing, 7463–7469. Stroudsburg: Association for Computational Linguistics.10.18653/v1/2023.emnlp-main.462Search in Google Scholar
Bazaluk, Oleg. 2018. The feature transformations of the basic meanings of Greek paideia in the educational theories in the middle ages. Schole 12(1). 243–258. https://doi.org/10.21267/AQUILO.2018.12.10428.Search in Google Scholar
Chen, Liang-Chun, Kuan-Hua Chang, Shang-Chi Yang & Shao-Chung Chen. 2023. A corpus-based word classification method for detecting difficulty level of English proficiency tests. Applied Sciences 13(3). 1699. https://doi.org/10.3390/app13031699.Search in Google Scholar
Costa, Bruno Silvério, Jorge Viana Santos, Cristiane Namiuti & Aline Silva Costa. 2022. The systematic construction of multiple types of corpora through the Lapelinc Framework. In Vladia Pinheiro, Pablo Gamallo, Raquel Amaro, Carolina Scarton, Fernando Batista, Diego Silva, Catarina Magro & Hugo Pinto (eds.), Computational processing of the Portuguese language, 401–406. Cham: Springer.10.1007/978-3-030-98305-5_37Search in Google Scholar
Dandan, Li, Norazah Noordin & Lily Ismail. 2023. An exploration on integration of corpus-based language pedagogy into senior high school English curriculum. International Journal of Academic Research in Business and Social Sciences 13(11). 1035–1047. https://doi.org/10.6007/IJARBSS/v13-i11/19405.Search in Google Scholar
Diab, Mona, Nizar Habash & Imed Zitouni. 2017. NLP for Arabic and related languages. TAL 58(3). 9–13.Search in Google Scholar
Durrant, Philip. 2022. Studying children’s writing development with a corpus. Applied Corpus Linguistics 2(3). 100026. https://doi.org/10.1016/j.acorp.2022.100026.Search in Google Scholar
Fang, Liuqin, Qing Ma & Jie Yan. 2021. The effectiveness of corpus-based training on collocation use in L2 writing for Chinese senior secondary school students. Journal of China Computer-Assisted Language Learning 1(1). 80–109. https://doi.org/10.1515/jccall-2021-2004.Search in Google Scholar
Gries, Stefan T. 2022. Toward more careful corpus statistics: Uncertainty estimates for frequencies, dispersions, association measures, and more. Research Methods in Applied Linguistics 1. 100002. https://doi.org/10.1016/j.rmal.2021.100002.Search in Google Scholar
Gruszczynski, Wojciech, Dominik Adamiec, Renata Bronikowska, Wojciech Kieras, Elzbieta Modrzejewski, Aleksander Wieczorek & Maciej Wolinski. 2022. The electronic corpus of 17th- and 18th-century Polish texts. Language Resources and Evaluation 56(1). 309–332. https://doi.org/10.1007/s10579-021-09549-1.Search in Google Scholar
Hengchen, Simon & Nina Tahmasebi. 2021. A collection of Swedish diachronic word embedding models trained on historical newspaper data. Journal of Open Human Data 7(2). https://doi.org/10.5334/johd.22.Search in Google Scholar
Kaya, Ozgur F., Kazim Uzun & Hakan Cangır. 2022. Using corpora for language teaching and assessment in L2 writing: A narrative review. Focus on ELT Journal 4(3). 46–62. https://doi.org/10.14744/felt.2022.4.3.4.Search in Google Scholar
Kerimkhulle, Seyit, Nataliia Obrosova, Shananin Alexander & Akylbek Tokhmetov. 2023. Young duality for variational inequalities and nonparametric method of demand analysis in input–output models with inputs substitution: Application for Kazakhstan economy. Mathematics 11(19). 4216. https://doi.org/10.3390/math11194216.Search in Google Scholar
Khassanov, Yernar, Saltanat Mussakhojayeva, Arman Mirzakhmetov, Aidar Adiyev, Marat Nurpeiissov & Huseyin A. Varol. 2021. A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline. In Paola Merlo, Jorg Tiedemann & Reut Tsarfaty (eds.), Proceedings of the 16th conference of the European chapter of the association for computational linguistics: Main volume, 697–706. Stroudsburg: Association for Computational Linguistics.10.18653/v1/2021.eacl-main.58Search in Google Scholar
Kolbayev, Nurbolat, Kalima Tuyenbayeva, Danakul Seitimbetova & Nurlan Apakhayev. 2024. Methods of modelling electronic academic libraries: Technological concept of electronic libraries. Preservation, Digital Technology and Culture 53(2). 81–90. https://doi.org/10.1515/pdtc-2024-0001.Search in Google Scholar
Kongyratbay, Tynysbek Auelbekuly. 2021. The ethnic nature of the Kazakh heroic epic Alpamys. Eposovedenie 21(1). 14–29. https://doi.org/10.25587/w4013-1717-6780-j.Search in Google Scholar
Kongyratbay, Tynysbek Auelbekuly. 2022. Once again about the epic heritage of Korkut. Eposovedenie(2). 28–39. https://doi.org/10.25587/n8974-5171-2692-o.Search in Google Scholar
Leonow, Alexander I., Mariia N. Koniagina, Svetlana V. Petrova, Elena V. Grunt, Seyit Ye Kerimkhulle & Veronika G. Shubaeva. 2019. Application of information technologies in marketing: Experience of developing countries. Espacios 40(38).Search in Google Scholar
Madmarova, Gulipa A., Syuita R. Abdykadyrova, Rakhat K. Ormokeeva, Ikibal A. Temirkulova & Rakhat Zh Sagyndykova. 2021. Lexical units objectifying the intercultural concept in “Babur-Nameh”. Studies in Systems, Decision and Control 314. 1065–1070. https://doi.org/10.1007/978-3-030-56433-9_111.Search in Google Scholar
Marcellino, William. 2019. Seniority in writing studies: A corpus analysis. Journal of Writing Analytics 3. 183–205. https://doi.org/10.37514/JWA-J.2019.3.1.09.Search in Google Scholar
Martínez Martín, Laura & Guadalupe Adámez Castro. 2022. Letters for History in a world 2.0: The construction and study of a Spanish-Portuguese digital epistolary corpus from the Modern Age. Historia Critica 2022(83). 99–123. https://doi.org/10.7440/histcrit83.2022.05.Search in Google Scholar
Mizin, Kostiantyn, Liudmyla Slavova, Liubov Letiucha & Oleksandr Petrov. 2023. Emotion concept disgust and its German counterparts: Equivalence determination based on language corpora data. Forum for Linguistic Studies 5(1). 72–90. https://doi.org/10.18063/FLS.V5I1.1552.Search in Google Scholar
Nurbatyrova, Raushan, Boris Japarov, Nurlan Apakhayev, Biyakhmet Abdulaziz & Sandugash Khushkeldiyeva. 2024. Digital transformation of archives in the context of the introduction of an electronic document management system in Kazakhstan. Preservation, Digital Technology and Culture 53(3). 147–155. https://doi.org/10.1515/pdtc-2024-0017.Search in Google Scholar
Patil, Ramesh & Venkat Gudivada. 2024. A review of current trends, techniques, and challenges in large language models (LLMs). Applied Sciences 14(5). 2074. https://doi.org/10.3390/app14052074.Search in Google Scholar
Ponti, Edoardo M., Henry O’Horan, Yonatan Berzak, Ivona Vulis, Roi Reichart, Thierry Poibeau, Ekaterina Shutova & Anna Korhonen. 2019. Modeling language variation and universals: A survey on typological linguistics for natural language processing. Computational Linguistics 45(3). 559–601. https://doi.org/10.1162/coli_a_00357.Search in Google Scholar
Rovelli, Giulia. 2023. Towards a historical corpus of Canadian English letters and diaries. Token 16. 300–323. https://doi.org/10.25951/11268.Search in Google Scholar
Sakhipov, Aivar, Talgat Baidildinov, Madina Yermaganbetova & Nurzhan Ualiyev. 2023. Design of an educational platform for professional development of teachers with elements of blockchain technology. International Journal of Advanced Computer Science and Applications 14(7). 519–527. https://doi.org/10.14569/IJACSA.2023.0140757.Search in Google Scholar
Sun, Zhong. 2022. Development of corpus linguistic using lexical teaching to improve English writing. Wireless Communications and Mobile Computing 2022(1). 1–7. https://doi.org/10.1155/2022/4024149.Search in Google Scholar
Tkachenko, Oleksandr, Maksym Chernykh, Ilia Kuznetcov, Vitaliy Karpovich & Przemyslaw Jatkiewicz. 2024. An impact of web animation on user perception and engagement. Journal of the Balkan Tribological Association 30(5). 875–897.Search in Google Scholar
Treviso, Marcos, Jin-U. Lee, Ji Tong, Betty van Aken, Qingqing Cao, Manuel Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro Martins, Andre Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych & Roy Schwartz. 2023. Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics 11. 826–860. https://doi.org/10.1162/tacl_a_00577.Search in Google Scholar
Ueno, Shoichi & Osamu Takeuchi. 2023. Effective corpus use in second language learning: A meta-analytic approach. Applied Corpus Linguistics 3(3). 100076. https://doi.org/10.1016/j.acorp.2023.100076.Search in Google Scholar
Unuabonah, Foluke Olayinka, Adebola Adebileje, Rotimi Olanrele Oladipupo, Bernard Fyanka, Mba Odim & Oluwateniola Kupolati. 2022. Introducing the historical corpus of English in Nigeria (HiCE-Nig). English Today 38(3). 178–184. https://doi.org/10.1017/S0266078422000037.Search in Google Scholar
Walkden, George. 2016. The HeliPaD: A parsed corpus of Old Saxon. International Journal of Corpus Linguistics 21(4). 559–571. https://doi.org/10.1075/ijcl.21.4.05wal.Search in Google Scholar
Zampieri, Marcos, Preslav Nakov & Yves Scherrer. 2020. Natural language processing for similar languages, varieties, and dialects: A survey. Natural Language Engineering 26(6). 595–612. https://doi.org/10.1017/S1351324920000492.Search in Google Scholar
© 2025 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Research Articles
- “I was embroidering the towel with a sincere hand and praying to God for fate and godsend”: hybrid beliefs presented on Ukrainian rushnyks
- Pragmatic characteristics of diminutive adjectives in Kazakh and English languages
- Anthroponyms: the lexico-semantic approach to word formation and its social and cultural implications
- Unveiling humour in digital discourse: the pragmatic functions of humorous stickers in Spanish WhatsApp chat groups
- Gender stereotype: the features of development and functioning in the Kazakh language
- Cognitive foundations of the formation of communicative competencies in the theory of dialogue
- Family(jiārénmen) is not a family: a study on the construction of pragmatic identities in the generalization of Internet address term “jiārénmen”
- On the social meanings of avoiding fully-articulated explicatures and the role of pragmatics in utterance explication
- The issues of developing the historical subcorpus of the National Corpus of the Kazakh Language
- Corrigendum
- Corrigendum to: Translation of Perso-Arabic loanwords from Hindi into Polish: a pilot study
- Book Review
- Rod Ellis, Carsten Roever, Natsuko Shintani & Yan Zhu: Measuring Second Language Pragmatic Competence: A Psycholinguistic Perspective
Articles in the same Issue
- Frontmatter
- Research Articles
- “I was embroidering the towel with a sincere hand and praying to God for fate and godsend”: hybrid beliefs presented on Ukrainian rushnyks
- Pragmatic characteristics of diminutive adjectives in Kazakh and English languages
- Anthroponyms: the lexico-semantic approach to word formation and its social and cultural implications
- Unveiling humour in digital discourse: the pragmatic functions of humorous stickers in Spanish WhatsApp chat groups
- Gender stereotype: the features of development and functioning in the Kazakh language
- Cognitive foundations of the formation of communicative competencies in the theory of dialogue
- Family(jiārénmen) is not a family: a study on the construction of pragmatic identities in the generalization of Internet address term “jiārénmen”
- On the social meanings of avoiding fully-articulated explicatures and the role of pragmatics in utterance explication
- The issues of developing the historical subcorpus of the National Corpus of the Kazakh Language
- Corrigendum
- Corrigendum to: Translation of Perso-Arabic loanwords from Hindi into Polish: a pilot study
- Book Review
- Rod Ellis, Carsten Roever, Natsuko Shintani & Yan Zhu: Measuring Second Language Pragmatic Competence: A Psycholinguistic Perspective