Home More similar than needed: Czech exam texts from the perspective of quantitative linguistics
Article
Licensed
Unlicensed Requires Authentication

More similar than needed: Czech exam texts from the perspective of quantitative linguistics

  • Michal Místecký ORCID logo EMAIL logo , Lucie Radková ORCID logo , Žaneta Stiborská ORCID logo and Darina Hrubá ORCID logo
Published/Copyright: October 31, 2025
Become an author with De Gruyter Brill
Glottotheory
From the journal Glottotheory

Abstract

The paper compares texts which are part of Czech language didactic tests, the examinations used as the acceptance prerequisite for secondary schools (eight-year [E8], six-year [E6], and four-year [E4] ones) or as the secondary school final check (called “maturita” [M] in Czech) of the level of knowledge of the pupils’ mother tongue. The corpus comprises 334 texts in 56 tests (more than 66,000 tokens) and covers the timespan of 2019–2023. The indexes employed in the comparison are average token length (ATL), activity (Q), moving-average type–token ratio (MATTR), moving-average morphological richness (MAMR), and verb distances (VD). The statistically significant differences are search for using Kruskal–Wallis test and Dwass–Steel–Critchlow–Fligner test. It has been found out that the texts differ as to ATL, Q, and VD, but further statistical testing declared mostly the E8–M difference as statistically significant. The correlation analysis has confirmed that the three indexes are correlated, this implying that the difference is a product of one, verbalisation-to-nominalization tendency. Finally, we performed a k-means cluster analysis (preceded by t-SNE, used to reduce the number of dimensions), which divided the texts into two groups – “A” (easy) and “B” (difficult). These two groups are distributed rather equally in E8, E6, E4, but in M, the B-texts considerably prevail.


Corresponding author: Michal Místecký, Department of Czech Language, Faculty of Arts, University of Ostrava, Ostrava, Czechia, E-mail:
Michal Místecký, Lucie Radková, Žaneta Stiborská, and Darina Hrubá contributed equally to this work.
  1. Conflict of interest: The authors report there are no competing interests to declare.

  2. Research funding: This work was supported by the University of Ostrava under Grant SGS13/FF/2024 (Reading Literacy through the Lens of Current Linguistic Methods).

Abbreviations

ATL

average token length

E4

entrance exam for four-year secondary schools

E6

entrance exam for six-year secondary schools

E8

entrance exam for eight-year secondary schools

M

Secondary school final check (called “maturita” in Czech)

MAMR

moving-average morphological richness

MATR

moving-average type–token ratio

Q

activity

VD

verb distances

References

Akoglu, Haldun. 2018. User’s guide to correlation coefficients. Turkish Journal of Emergency Medicine 18(3). 91–93. https://doi.org/10.1016/j.tjem.2018.08.001.Search in Google Scholar

Andreev, Sergey, Michal Místecký & Gabriel Altmann. 2018. Sonnets: Quantitative inquiries. Lüdenscheid: RAM-Verlag.Search in Google Scholar

Baumgartnerová, Gabriela. 2021–2022. Několik slov ke koncepci didaktických testů z českého jazyka a literatury [A few words about the concept of didactic tests in Czech language and literature]. Český jazyk a literatura 72(3). 126–133.Search in Google Scholar

Benjamin, Rebekah. 2012. Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty. Educational Psychology Review 24(1). 63–88. https://doi.org/10.1007/s10648-011-9181-8.Search in Google Scholar

Čech, Radek & Miroslav Kubát. 2018. Morphological richness of text. In M. Fidler & V. Cvrček (eds.), Taming the corpus. From inflection and lexis to interpretation, 63–77. Cham: Springer.10.1007/978-3-319-98017-1_4Search in Google Scholar

Cermat. 2019. Centrum pro zjišťování výsledků vzdělávání [Center for Educational Outcomes Research]. Centrum pro zjišťování výsledků vzdělávání. Available at: https://cermat.cz/.Search in Google Scholar

Covington, Michael A. & Joe D. McFall. 2010. Cutting the Gordian knot: The moving-average type–token ratio (MATTR). Journal of Quantitative Linguistics 17(2). 94–100. https://doi.org/10.1080/09296171003643098.Search in Google Scholar

Critchlow, Douglas E. & Michael A. Fligner. 1991. On distribution-free multiple comparisons in the one-way analysis of variance. Communications in Statistics - Theory and Methods 20(1). 127–139. https://doi.org/10.1080/03610929108830487.Search in Google Scholar

Cvrček, Václav, Radek Čech & Miroslav Kubát. 2020. QuitaUp – nástroj pro kvantitativní stylometrickou analýzu [QuitaUp – a tool for quantitative stylometric analysis]. Czech National Corpus and University of Ostrava. Available at: https://korpus.cz/quitaup/.Search in Google Scholar

Graesser, Arthur C., Danielle S. McNamara & Jonna M. Kulikowich. 2011. Coh-Metrix: Providing multilevel analyses of text characteristics. Educational Researcher 40(5). 223–234. https://doi.org/10.3102/0013189x11413260.Search in Google Scholar

Greger, David. 1999. Obtížnost textů učebnic českého jazyka pro 2. ročník ZŠ [Difficulty of Czech language textbook texts for the 2nd grade of elementary school]. Pedagogická orientace 9(2). 96–99.Search in Google Scholar

Harris, Charles, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke & Travis E. Oliphant. 2020. Array programming with NumPy. Nature 585(7825). 357–362. https://doi.org/10.1038/s41586-020-2649-2.Search in Google Scholar

Hinton, Geoffrey & Sam Roweis. 2002. Stochastic neighbor embedding. In S. Becker, S. Thrun & K. Obermayer (eds.), NIPS’02: Proceedings of the 15th international conference on neural information processing systems, 857–864. Cambridge: MIT Press.Search in Google Scholar

Hunter, John D. 2007. Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9(3). 90–95. https://doi.org/10.1109/mcse.2007.55.Search in Google Scholar

Jazykové gymnázium, Pavla Tigrida. 2025. Francouzština [French]. Available at: https://www.jazgym.cz/cz/zakum/muzu/francouzstina.html.Search in Google Scholar

Kubát, Miroslav. 2016. Kvantitativní analýza žánrů [Quantitative analysis of genres]. Ostrava: Filozofická fakulta Ostravské univerzity.Search in Google Scholar

McKinney, Wes. 2010. Data structures for statistical computing in Python. In S. van der Walt, J. Fernández & N. Varroquaux (eds.), Proceedings of the 9th Python in science conference, 51–56. Austin: SciPy.10.25080/Majora-92bf1922-00aSearch in Google Scholar

Ministerstvo školství, mládeže a tělovýchovy. 2021. Shrnutí aktualit k přijímacímu řízení na střední školy ve školním roce 2020/2021 [Summary of news on the admissions process to secondary schools in the 2020/2021 school year]. Available at: https://prijimacky.cermat.cz/files/files/dokumenty/Pravni-predpisy/Shrnuti_aktualit_PR_15-3-2021.pdf.Search in Google Scholar

Místecký, Michal & Tomi S. Melka. 2021. Literary “higher dimensions” quantified: A stylometric study of nine stories. Glottotheory 12(2). 129–157. https://doi.org/10.1515/glot-2021-2021.Search in Google Scholar

Místecký, Michal & Lucie Radková. 2020. School and gender in numbers: A stylometric insight into the lexis of teenagers’ description essays. Glottometrics 49. 52–65.Search in Google Scholar

Místecký, Michal & Lucie Radková. 2021–2022. Aktivní v popisu: Poznámky ke stylu žákovských slohových prací [Active in description: Notes on the style of pupilsʼ essays]. Český jazyk a literatura 72(3). 120–126.Search in Google Scholar

NCSS LLC. 2024. NCSS 2024 (Version 24.0.3) [Computer software]. Available at: https://www.ncss.com/download/ncss/latest/.Search in Google Scholar

OpenAI. 2024. ChatGPT: Language model for generating text-based outputs. OpenAI. Available at: https://openai.com.Search in Google Scholar

Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot & Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12(85). 2825–2830.Search in Google Scholar

Pluskal, Miroslav. 1996. Zdokonalení metody pro měření obtížnosti didaktických textů [Improving the method for measuring the difficulty of didactic texts]. Pedagogika 46(1). 62–76.Search in Google Scholar

Průcha, Jan. 1998. Učebnice: Teorie a analýzy edukačního média [Textbook: Theory and analysis of educational media]. Brno: Paido.Search in Google Scholar

Rafatbakhsh, Elaheh & Alireza Ahmadi. 2023. Predicting the difficulty of EFL reading comprehension tests based on linguistic indices. Asian-Pacific Journal of Second and Foreign Language Education 8. Article 40. https://doi.org/10.1186/s40862-023-00214-4.Search in Google Scholar

Rousseeuw, Peter. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20. 53–65. https://doi.org/10.1016/0377-0427(87)90125-7.Search in Google Scholar

Rupp, A. Andre, Paula Garcia & Joan Jamieson. 2001. Combining multiple regression and CART to understand difficulty in second language reading and listening comprehension test items. International Journal of Testing 1(3–4). 185–216. https://doi.org/10.1080/15305058.2001.9669470.Search in Google Scholar

Slávik, Jan. 2003. Lesk a bída oborových didaktik [The brilliance and misery of subject-specific didactics]. Pedagogika 53(2). 137–140.Search in Google Scholar

Štěpáník, Stanislav. 2020. Výuka češtiny mezi tradicí a inovací [Teaching Czech between tradition and Innovation]. Praha: Academia.Search in Google Scholar

van der Maaten, Laurens & Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9(86). 2579–2605.Search in Google Scholar

Vondrová, Naďa, Martina Šmejkalová & Irena Smetáčková. 2022. Zadání slovních úloh jako podklad pro rozvoj čtení s porozuměním a dovednosti slovní úlohy řešit [Assigning word problems as a basis for developing reading comprehension and word problem solving skills]. Pedagogika 72(1). 3–24.10.14712/23362189.2021.1945Search in Google Scholar

Wayne, Daniel. 1990. Applied nonparametric statistics. Boston: PWS-KENT.Search in Google Scholar

Wu, Junjie. 2012. Advances in K-means clustering. Cham: Springer.10.1007/978-3-642-29807-3Search in Google Scholar

Zörnig, Peter & Gabriel Altmann. 2016. Activity in Italian presidential speeches. Glottometrics 35. 38–48.Search in Google Scholar

Published Online: 2025-10-31

© 2025 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 1.11.2025 from https://www.degruyterbrill.com/document/doi/10.1515/glot-2025-2014/html
Scroll to top button