Home Linguistics & Semiotics An information-theoretic view on language complexity and register variation: Compressing naturalistic corpus data
Article
Licensed
Unlicensed Requires Authentication

An information-theoretic view on language complexity and register variation: Compressing naturalistic corpus data

  • Katharina Ehret

    Katharina Ehret studied English Language and Literature, Religious Studies and Catholic Theology at the Universities of Basel (Switzerland), and Freiburg (Germany). She holds BA, MA, and PhD degrees from the University of Freiburg and currently works as a Humboldt Fellow affiliated with Simon Fraser University and with the University of Freiburg. Her research interests include language variation, language complexity and its intersections with information-theory, cross-language typology, second language acquisition research, as well as discourse analysis.

    EMAIL logo
Published/Copyright: October 25, 2018

Abstract

This article utilises an innovative, information-theoretic metric to assess complexity variation across written and spoken registers of British English. This is novel because previous research on language complexity mainly analysed complexity variation in typological data, single language case studies or geographical varieties of the same language. The measure boils down to Kolmogorov complexity which can be conveniently approximated with off-the-shelf compression programs. Essentially, text samples that can be compressed more efficiently count as linguistically simple. The dataset covers a wide range of traditional written and spoken registers (e.g. broadsheet newspapers, courtroom debate or face-to-face conversation), as sampled in the British National Corpus. It turns out that Kolmogorov-based register variation coincides with register formality such that informal registers are overall and morphologically less complex than more formal registers, but more complex in regard to syntax (defined here as rigid word order). Generally, the results show that written and spoken registers vary along a continuum, and significantly trade-off morphological against syntactic complexity (and vice versa). Finally, the findings support proposals to view language as a complex adaptive system and demonstrate how language adapts to the situational context of language production and functional-communicative needs of its users.

About the author

Katharina Ehret

Katharina Ehret studied English Language and Literature, Religious Studies and Catholic Theology at the Universities of Basel (Switzerland), and Freiburg (Germany). She holds BA, MA, and PhD degrees from the University of Freiburg and currently works as a Humboldt Fellow affiliated with Simon Fraser University and with the University of Freiburg. Her research interests include language variation, language complexity and its intersections with information-theory, cross-language typology, second language acquisition research, as well as discourse analysis.

Appendix

Table 4:

Average adjusted overall complexity scores (regression residuals in bytes) by BNC register.

Register Overall complexity score
Academic writing 373.6080
Administrative writing 359.9150
Advertisement 320.3299
Biography 314.0586
Broadcast 292.2712
Classroom 515.2487
Commerce 202.1456
Consulting 602.3361
Conversation 552.9336
Courtroom 284.9615
Demonstration 404.1665
Email 105.0490
Essay 198.3486
Fiction 335.0127
Hansard 161.2273
Institutional documents 126.7962
Instructional writing 329.5085
Interview 304.7352
Lecture 129.5889
Letters 92.2422
Meeting 107.2040
Miscellaneous 469.6133
News scripts 242.9513
Broadsheet newspapers 540.0866
Other newspapers 437.4495
Tabloid newspapers 184.1163
Non-academic 484.7919
Parliament 290.6986
Popular lore 473.5847
Public debate 386.4912
Religion 211.1975
Sermon 357.7019
Speech 15.6051
Sportslive 85.7084
Tutorial 265.2109
Unclassified 288.4832
Table 5:

Average original uncompressed and compressed file sizes by BNC register. These file sizes were used to calculate the average adjusted overall complexity scores.

Register Uncompressed file size Compressed file size
Academic writing 17,293.5190 7,499.4120
Administrative writing 17,754.8770 6,922.6460
Advertisement 13,370.7960 6,113.2980
Biography 12,760.9250 5,899.8090
Broadcast 17,478.0770 7,480.7830
Classroom 8,592.8240 3,654.2930
Commerce 17,272.7240 7,320.8840
Consulting 6,980.5170 3,019.3870
Conversation 7,551.7320 3,262.8730
Courtroom 25,601.2850 9,663.5980
Demonstration 15,380.0260 6,071.4840
Email 11,022.1960 4,889.9280
Essay 13,595.3290 6,067.6070
Fiction 7,107.2650 3,329.7760
Hansard 14,090.3060 5,876.2110
Institutional documents 18,661.1930 7,717.2990
Instructional writing 15,704.2690 6,915.3280
Interview 12,987.5550 5,358.0180
Lecture 23,125.6670 9,237.0000
Letters 15,535.5160 6,620.7240
Meeting 15,458.7270 6,395.1870
Miscellaneous 14,720.5190 6,721.1810
News scripts 14,676.9870 6,479.7280
Broadsheet newspapers 14,652.3430 6,768.4900
Other newspapers 13,629.9140 6,318.4590
Tabloid newspapers 10,605.6290 5,037.5550
Non-academic 15,786.0810 7,098.4090
Parliament 36,463.3100 13,348.4850
Popular lore 13,283.7530 6,236.9780
Public debate 21,424.0230 8,142.7470
Religion 14,148.9780 6,268.5710
Sermon 21,587.1780 8,226.9720
Speech 15,341.0430 6,446.8000
Sportslive 16,542.9960 6,785.0880
Tutorial 18,864.7290 7,394.4480
Unclassified 13,384.8760 5,509.2690
Table 6:

Mean morphological complexity scores and their standard deviation across BNC registers.

Register Morphological score Standard deviation
Academic writing 1.0022 0.0051
Administrative writing 1.0468 0.0076
Advertisement 0.9852 0.0056
Biography 0.9757 0.0045
Broadcast 1.0043 0.0057
Broadsheet newspapers 0.9801 0.0047
Classroom 0.9970 0.0070
Commerce 1.0125 0.0056
Consulting 0.9873 0.0067
Conversation 0.9852 0.0093
Courtroom 1.0543 0.0113
Demonstration 1.0277 0.0088
Email 0.9913 0.0059
Essay 0.9887 0.0051
Fiction 0.9660 0.0053
Hansard 1.0163 0.0062
Institutional documents 1.0255 0.0060
Instructional writing 0.9955 0.0055
Interview 1.0063 0.0062
Lecture 1.0337 0.0062
Letters 1.0139 0.0074
Meeting 1.0140 0.0071
Miscellaneous 0.9819 0.0049
News scripts 0.9976 0.0058
Non-academic 0.9893 0.0050
Other newspapers 0.9791 0.0046
Parliament 1.0744 0.0053
Popular lore 0.9723 0.0043
Public debate 1.0550 0.0080
Religion 0.9904 0.0049
Sermon 1.0430 0.0136
Speech 1.0091 0.0066
Sportslive 1.0202 0.0057
Tabloid newspapers 0.9701 0.0047
Tutorial 1.0383 0.0070
Unclassified 1.0142 0.0078
Table 7:

Mean syntactic complexity scores and their standard deviation across BNC registers.

Register Syntactic score Standard deviation
Academic writing 0.9125 0.0039
Administrative writing 0.9191 0.0043
Advertisement 0.9115 0.0042
Biography 0.9103 0.0043
Broadcast 0.9145 0.0037
Broadsheet newspapers 0.9100 0.0042
Classroom 0.9169 0.0049
Commerce 0.9136 0.0039
Consulting 0.9158 0.0060
Conversation 0.9155 0.0055
Courtroom 0.9216 0.0038
Demonstration 0.9203 0.0043
Email 0.9132 0.0044
Essay 0.9119 0.0042
Fiction 0.9114 0.0059
Hansard 0.9170 0.0047
Institutional documents 0.9151 0.0039
Instructional writing 0.9133 0.0039
Interview 0.9170 0.0044
Lecture 0.9179 0.0036
Letters 0.9178 0.0047
Meeting 0.9161 0.0041
Miscellaneous 0.9108 0.0040
News scripts 0.9145 0.0042
Non-academic 0.9116 0.0040
Other newspapers 0.9104 0.0040
Parliament 0.9218 0.0030
Popular lore 0.9094 0.0041
Public debate 0.9210 0.0037
Religion 0.9120 0.0042
Sermon 0.9223 0.0044
Speech 0.9153 0.0041
Sportslive 0.9167 0.0037
Tabloid newspapers 0.9099 0.0043
Tutorial 0.9189 0.0039
Unclassified 0.9176 0.0045

Acknowledgements

My thanks go to Benedikt Szmrecsanyi for helpful comments and suggestions, and to two anonymous reviewers for critical feedback. Funding by the Cusanuswerk (Bonn, Germany) and by the Alexander von Humboldt Foundation (Funder Id: 10.13039/100005156, Bonn, Germany) is gratefully acknowledged. The usual disclaimers apply.

References

Ackerman, Farrell & Robert Malouf. 2013. Morphological organization: The low conditional entropy conjecture. Language 89(3). 429–464. https://www.linguisticsociety.org/sites/default/files/439-464.pdf (accessed 20 June 2018).10.1353/lan.2013.0054Search in Google Scholar

Aston, Guy & Lou Burnard. 1998. The BNC handbook: Exploring the British national Corpus with SARA. Edinburgh: Edinburgh University Press.Search in Google Scholar

Baechler, Raffaela & Guido Seiler (eds.). 2016. Complexity, isolation, and variation. Berlin & Boston: De Gruyter.10.1515/9783110348965Search in Google Scholar

Baerman, Matthew, Dunston Brown & Greville G. Corbett (eds.). 2015. Understanding and measuring morphological complexity. New York: Oxford University Press.10.1093/acprof:oso/9780198723769.001.0001Search in Google Scholar

Beckner, Clay, Richard Blythe, Joan Bybee, Morten H. Christiansen, William Croft, Nick C. Ellis, John Holland, Jinyun Ke, Diane Larsen-Freeman & Tom Schoenemann. 2009. Language is a complex adaptive system: Position paper. Language Learning 59 (s1).1–26. https://doi.org/10.1111/j.1467-9922.2009.00533.x (accessed 17 April 2018).Search in Google Scholar

Bentz, Christian & Aleksandrs Berdicevskis. 2016. Learning pressures reduce morphological complexity: Linking corpus, computational and experimental evidence. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity, Osaka, Japan, 222–232. http://www.aclweb.org/anthology/W16-4125 (accessed 3 April 2018).Search in Google Scholar

Bentz, Christian, Tatyana Ruzsics, Alexander Koplenig & Tanja Samardzic. 2016. A comparison between morphological complexity measures: Typological data vs. language corpora. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), Osaka, Japan. http://www.aclweb.org/anthology/W16-4117 (accessed 27 November 2017).Search in Google Scholar

Bentz, Christian & Bodo Winter. 2013. Languages with more second language learners tend to lose nominal case. Language Dynamics and Change 3. 1–27. http://booksandjournals.brillonline.com/content/journals/10.1163/22105832-13030105 (accessed 3 April 2018).10.1163/22105832-13030105Search in Google Scholar

Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.10.1017/CBO9780511621024Search in Google Scholar

Biber, Douglas. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.10.1017/CBO9780511519871Search in Google Scholar

Biber, Douglas, Bethany Gray & Kornwipa Poonpon. 2011. Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? TESOL Quarterly 45(1). 5–35. https://doi.org/10.5054/tq.2011.244483 (accessed 1 July 2014).Search in Google Scholar

Crystal, David. 1987. The Cambridge Encyclopedia of Language. Cambridge: Cambridge University Press.Search in Google Scholar

Dale, Rick & Gary Lupyan. 2012. Understanding the origins of morphological diversity: The linguistic niche hypothesis. Advances in Complex Systems 15(03n04). 1150017/1–1150017/16. https://doi.org/10.1142/S0219525911500172 (accessed 3 April 2018).Search in Google Scholar

Deutscher, Guy. 2009. “Overall complexity”: a wild goose chase? In Geoffrey Sampson, David Gil & Peter Trudgill (eds.), Language complexity as an evolving variable, 243–251. Oxford: Oxford University Press.10.1093/oso/9780199545216.003.0017Search in Google Scholar

Ehret, Katharina. 2014. Kolmogorov complexity of morphs and constructions in English. Linguistic Issues in Language Technology 2(11). 43–71. http://csli-lilt.stanford.edu/ojs/index.php/LiLT/article/view/18 (accessed 6 July 2014).10.33011/lilt.v11i.1363Search in Google Scholar

Ehret, Katharina. 2017. An information-theoretic approach to language complexity: Variation in naturalistic corpora. Freiburg: University of Freiburg dissertation. https://freidok.uni-freiburg.de/data/12243 (accessed 25 October 2017).Search in Google Scholar

Ehret, Katharina & Benedikt Szmrecsanyi. 2016a. An information-theoretic approach to assess linguistic complexity. In Raffaela Baechler & Guido Seiler (eds.), Complexity, isolation, and variation, 71–94. Berlin & Boston: Walter de Gruyter.10.1515/9783110348965-004Search in Google Scholar

Ehret, Katharina & Benedikt Szmrecsanyi. 2016b. Compressing learner language: An information-theoretic measure of complexity in SLA production data. Second Language Research (Sage Online First). http://journals.sagepub.com/doi/abs/10.1177/0267658316669559 (accessed 3 April 2018).10.1177/0267658316669559Search in Google Scholar

Fenk-Oczlon, Gertraud & August Fenk. 2014. Complexity trade-offs do not prove the equal complexity hypothesis. Poznan Studies in Contemporary Linguistics 50(2). 145–155. https://www.degruyter.com/view/j/psicl.2014.50.issue-2/psicl-2014-0010/psicl-2014-0010.xml (accessed 24 June 2018).10.1515/psicl-2014-0010Search in Google Scholar

Gell-Mann, Murray & Seth Lloyd. 1996. Information measures, effective complexity, and total information. Complexity 2(1). 44–52.10.1002/(SICI)1099-0526(199609/10)2:1<44::AID-CPLX10>3.0.CO;2-XSearch in Google Scholar

Greenberg, Joseph H. 1960. A quantitative approach to the morphological typology of language. International Journal of American Linguistics 26(3). 178–194.10.1086/464575Search in Google Scholar

Hawkins, John A. 2009. An efficiency theory of complexity and related phenomena. In Geoffrey Sampson, David Gil & Peter Trudgill (eds.), Language complexity as an evolving variable, 252–268. Oxford: Oxford University Press.10.1093/oso/9780199545216.003.0018Search in Google Scholar

Hockett, Charles Francis. 1958. A course in modern linguistics. New York: Macmillan.10.1111/j.1467-1770.1958.tb00870.xSearch in Google Scholar

Juola, Patrick. 1998. Measuring linguistic complexity: The morphological tier. Journal of Quantitative Linguistics 5(3). 206–213. http://dx.doi.org/10.1080/09296179808590128 (accessed 14 December 2010).Search in Google Scholar

Juola, Patrick. 2008. Assessing linguistic complexity. In Matti Miestamo, Kaius Sinnemäki & Fred Karlsson (eds.), Language complexity: Typology, contact, change, 89–107. Amsterdam & Philadelphia: John Benjamins.10.1075/slcs.94.07juoSearch in Google Scholar

Kolmogorov, Andrej. 1965. Three approaches to the quantitative definition of information. Problemy Peredachi Informatsii 1(1). 3–11.10.1080/00207166808803030Search in Google Scholar

Koplenig, Alexander, Peter Meyer, Sascha Wolfer & Carolin Müller-Spitzer. 2017, March. The statistical trade-off between word order and word structure - Large-scale evidence for the principle of least effort. PLOS ONE 12 (3).e0173614. http://dx.plos.org/10.1371/journal.pone.0173614 (accessed 17 March 2017).10.1371/journal.pone.0173614Search in Google Scholar

Kortmann, Bernd & Benedikt Szmrecsanyi (eds.). 2012. Linguistic complexity: Second language acquisition, indigenization, contact. Lingua & Litterae. Berlin & Boston: Walter de Gruyter.10.1515/9783110229226Search in Google Scholar

Kusters, Wouter. 2003. Linguistic complexity: The influence of social change on verbal inflection. Utrecht: LOT.Search in Google Scholar

Li, Ming, Xin Chen, Xin Li, Bin Ma & Paul M. B. Vitányi. 2004. The similarity metric. IEEE Transactions on Information Theory 50(12). 3250–3264. https://ieeexplore.ieee.org/document/1362909/ (accessed 3 April 2018).10.1109/TIT.2004.838101Search in Google Scholar

Li, Ming & Paul M. B. Vitányi. 1997. An introduction to Kolmogorov complexity and its applications. New York: Springer-Verlag.10.1007/978-1-4757-2606-0Search in Google Scholar

Lupyan, Gary & Rick Dale. 2010. Language structure is partly determined by social structure. PLoS ONE 5(1). 1–10. https://doi.org/10.1371/journal.pone.0008559 (accessed 4 November 2014).Search in Google Scholar

McWhorter, John. 2001. The world’s simplest grammars are creole grammars. Linguistic Typology 6. 125–166. https://doi.org/10.1515/lity.2001.001 (accessed 19 April 2018).Search in Google Scholar

McWhorter, John. 2012. Complexity hotspot: The copula in Saramaccan and its implications. In Bernd Kortmann & Benedikt Szmrecsanyi (eds.), Linguistic complexity: Second language acquisition, indigenization, contact, Linguae & Litterae, 243–246. Berlin & Boston: Walter de Gruyter.10.1515/9783110229226.243Search in Google Scholar

Miestamo, Matti. 2006. On the feasibility of complexity metrics. In Krista Kerge & Maria-Maren Sepper (eds.), FinEst Linguistics. Proceedings of the Annual Finnish and Estonian Conference of Linguistics [Publications of the Department of Estonian of Tallinn University 8], 11–26. Tallinn: Tallinna Ülikooli Kirjastus.Search in Google Scholar

Miestamo, Matti. 2008. Grammatical complexity in a cross-linguistic perspective. In Matti Miestamo, Kaius Sinnemäki & Fred Karlsson (eds.), Language complexity: Typology, contact, change, 23–41. Amsterdam & Philadelphia: John Benjamins.10.1075/slcs.94.04mieSearch in Google Scholar

Miestamo, Matti, Kaius Sinnemäki & Fred Karlsson (eds.). 2008. Language Complexity: Typology, contact, change. Amsterdam & Philadelphia: John Benjamins.10.1075/slcs.94Search in Google Scholar

Moscoso del Prado Martin, Fermin. 2011. The mirage of morphological complexity. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 3524–529.Search in Google Scholar

Nichols, Johanna. 2013. The vertical archipelago: Adding the third dimension to linguistic geography. In Peter Auer, Martin Hilpert, Anja Stukenbrock & Benedikt Szmrecsanyi (eds.), Space in language and linguistics: Geographical, interactional, and cognitive perspectives. Berlin & Boston: Walter de Gruyter.10.1515/9783110312027.38Search in Google Scholar

Nichols, Johanna. 2016. Complex edges, transparent frontiers: Grammatical complexity and language spreads. In Raffaela Baechler & Guido Seiler (eds.), Complexity, isolation, and variation, 117–138. Berlin & Boston: Walter de Gruyter.10.1515/9783110348965-006Search in Google Scholar

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.Search in Google Scholar

Sadeniemi, Markus, Kimmo Kettunen, Tiina Lindh-Knuutila & Timo Honkela. 2008. Complexity of European Union languages: A comparative approach. Journal of Quantitative Linguistics 15(2). 185–211. http://dx.doi.org/10.1080/09296170801961843 (accessed 10 December 2010).Search in Google Scholar

Sampson, Geoffrey, David Gil & Peter Trudgill (eds.). 2009. Language complexity as an evolving variable. Oxford: Oxford University Press.10.1093/oso/9780199545216.001.0001Search in Google Scholar

Schlegel, August Wilhelm von. 1818. Observations sur la langue et littérature provençale. Paris: Librairie Grecque-Latine-Allemande.Search in Google Scholar

Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27. 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x (accessed 6 April 2011).Search in Google Scholar

Steger, Maria & Edgar W. Schneider. 2012. Complexity as a function of iconicity: The case of complement clause constructions in New Englishes. In Bernd Kortmann & Benedikt Szmrecsanyi (eds.), Linguistic complexity: Second language acquisition, indigenization, contact, Linguae & Litterae, 156–191. Berlin & Boston: Walter de Gruyter.10.1515/9783110229226.156Search in Google Scholar

Szmrecsanyi, Benedikt. 2009. Typological parameters of intralingual variability: Grammatical analyticity versus syntheticity in varieties of English. Language Variation and Change 21(3). 319–353. https://doi.org/10.1017/S0954394509990123 (accessed 15 November 2012).Search in Google Scholar

Szmrecsanyi, Benedikt & Bernd Kortmann. 2009. Between simplification and complexification: non-standard varieties of English around the world. In Geoffrey Sampson, David Gil & Peter Trudgill (eds.), Language complexity as an evolving variable, 64–79. Oxford: Oxford University Press.10.1093/oso/9780199545216.003.0005Search in Google Scholar

Trudgill, Peter. 2011. Sociolinguistic typology: Social determinants of linguistic complexity. Oxford; New York: Oxford University Press.Search in Google Scholar

van der Lubbe, J. C. A. 1997. Information theory. Cambridge, England/New York: Cambridge University Press.Search in Google Scholar

Zipf, George Kingsley. 1949. Human behavior and the principle of least effort: An introduction to human ecology. Cambridge, MA: Addison-Wesley Press.Search in Google Scholar

Ziv, Jacob & Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory IT-23(3). 337–343. https://ieeexplore.ieee.org/document/1055714/ (accessed 6 April 2011).10.1109/TIT.1977.1055714Search in Google Scholar

Published Online: 2018-10-25
Published in Print: 2021-10-26

© 2018 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 26.1.2026 from https://www.degruyterbrill.com/document/doi/10.1515/cllt-2018-0033/html
Scroll to top button