Register variation remains stable across 60 languages

Haipeng Li; Jonathan Dunn; Andrea Nini

doi:10.1515/cllt-2021-0090

Abstract

This paper measures the stability of cross-linguistic register variation. A register is a variety of a language that is associated with extra-linguistic context. The relationship between a register and its context is functional: the linguistic features that make up a register are motivated by the needs and constraints of the communicative situation. This view hypothesizes that register should be universal, so that we expect a stable relationship between the extra-linguistic context that defines a register and the sets of linguistic features which the register contains. In this paper, the universality and robustness of register variation is tested by comparing variation within versus between register-specific corpora in 60 languages using corpora produced in comparable communicative situations: tweets and Wikipedia articles. Our findings confirm the prediction that register variation is, in fact, universal.

Keywords: communicative situation; cross-linguistic variation; homogeneity; register similarity; register variation

Corresponding author: Jonathan Dunn, Department of Linguistics, University of Canterbury, Private Bag 4800, 8041, Christchurch, New Zealand, E-mail: jonathan.dunn@canterbury.ac.nz

Funding source: Science for Technological Innovation

Award Identifier / Grant number: E7222

Research funding: This study was funded by Science for Technological Innovation, grant number: E7222.

Appendix 1: Validating corpus similarity measures

This appendix describes validation experiments used to ensure that the corpus similarity measures provide robust measurements across the 60 languages discussed in the main paper. To evaluate the measures, we quantify the degree to which they make accurate predictions about the boundaries between corpora using a simple threshold. In other words, can corpus similarity measures be used to predict whether two sub-corpora come from the same or from different sources? This task (introduced by Kilgarriff 2001) provides a ground-truth validation for both the corpus similarity measures and the linguistic features they depend on.

The first step is to determine the best feature type for each language, using the independent background corpora described in the main paper for feature selection. We evaluate word 1-grams, word 2-grams, character 3-grams, and character 4-grams for each language. To ensure robustness, we employ a cross-validation framework: the corpora are divided into training and testing sets five times, until each subset of a corpus has appeared in the test set once. We average the accuracy of predictions across these five folds and choose the feature type for each language that achieves the highest accuracy.

The similarity measure based on Spearman’s rho returns a continuous value. To convert this into an accuracy evaluation, we set a threshold for making predictions about whether two input samples come from the same corpus or from different corpora. The more often this threshold leads to correct predictions, the more accurate the measure is. In other words, we draw samples from three distinct corpora (TW, WK, CC). We then use the similarity measures, together with a threshold, to predict whether two samples came from the same corpus. Measures with a high prediction accuracy are able to distinguish between same-corpus and cross-corpus pairs. We draw on previous methods for estimating the optimum thresholds, methods which have been demonstrated to work well in related problems (Leban et al. 2016; Nanayakkara and Ranathunga 2018).

The threshold calculation is shown below. We take the lowest average similarity for same-register pairs (for example, maybe CC-CC is the least homogenous register). Then we take the highest average similarity for different-register pairs (for example, maybe CC-WK are the most similar registers). The threshold is set halfway between these minimum and maximum values. This threshold is calculated on the training data for each fold.

T = 1 2 ( min ( S i m i l a r i t y C C − C C , S i m i l a r i t y T W − T W , S i m i l a r i t y W K − W K ) + max ( S i m i l a r i t y C C − W K , S i m i l a r i t y T W − W K , S i m i l a r i t y C C − T W ) )

The main experiments in the paper do not require a threshold for calculating accuracy because we are concerned with continuous relationships within and between register-specific corpora. However, here we evaluate accuracy because this allows us to determine how meaningful these measures are for the underlying task. For example, if corpus similarity measures for Mongolian make poor predictions about register boundaries, this tells us that our measure is not suitable for the comparison of register-specific corpora in Mongolian. Thus, the accuracy evaluation based on cross-fold validation ensures the robustness of the experiments in the main paper. This provides a cross-linguistic ground-truth to support our analysis.

We start by verifying the accuracy of these corpus similarity measures using the cross-fold validation experiment described above. The results are shown in Table A, together with the best feature type for each language. The accuracy value here is the average accuracy across training-testing folds for the corresponding feature type: W1 represents word 1-grams, C2 represents character 2-grams, and so on. For some languages, there are more than one feature type that produces the same or similar accuracy. For example, Bulgarian has similar accuracies with both W1 and C4 (98% vs. 97%) and Amharic has four types (C2, C3, C4, W1) that all achieve 100% accuracy. In the case of ties, we prefer character features over word features. In the case of a further tie, we prefer a higher n-gram (e.g., 4 over 3).

Table A:

Accuracy and best feature type by language.

Language	Features	Accuracy	Language	Features	Accuracy
amh	C4	100%	lit	C4	99%
ara	C4	99%	mal	C4	100%
aze	C4	96%	mar	C4	94%
ben	C4	100%	mkd	C4	99%
bul	W1	98%	mlg	C4	100%
cat	W1	100%	mon	W2	94%
ces	W1	98%	nld	W1	100%
dan	W1	99%	nor	W1	98%
deu	C4	98%	pan	C4	99%
ell	W1	97%	pol	W1	99%
eng	C4	98%	por	C4	98%
est	W1	98%	ron	W1	99%
eus	W1	100%	rus	C4	100%
fas	W1	96%	sin	C4	100%
fin	C4	94%	slk	C4	94%
fra	W1	100%	slv	C4	96%
gle	W1	90%	som	C4	100%
glg	C4	100%	spa	C4	99%
guj	C4	95%	sqi	W1	96%
hat	C4	100%	swe	C4	96%
hin	C4	88%	tam	C4	96%
hun	C4	95%	tel	C4	100%
ind	C4	99%	tgl	C4	100%
isl	W1	93%	tha	C3	90%
ita	W1	94%	tur	C4	100%
jpn	C2	88%	ukr	C4	99%
kan	C4	98%	urd	W1	100%
kat	W2	96%	uzb	W2	99%
kor	C4	99%	vie	C4	100%
lav	C4	99%	zho	C2	96%

This selection procedure gives a single best measure for each language. The accuracies range from 88% (Japanese and Hindi) to 100% (among others, Amharic and Bengali). Overall, 49 of 60 languages achieve 95% accuracy or higher; and all languages are above 88% accuracy. When a language has lower accuracy, this means that the boundary between two of the registers is not distinct using a similarity measure. For example, if the CC and TW corpora are very similar, then some samples of each will be misidentified. This means that, for our purposes, an accuracy of 88% is not problematic, rather indicating that the relationship between registers in this language is not as distinct as in other languages.

This accuracy-based evaluation tells us that the similarity measures make robust distinctions between register-specific corpora across all 60 languages, with some languages being 100% accurate and others retaining a small number of misclassifications. This prediction-based validation gives us confidence in the ability of these measures to capture variation within these languages.

References

Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.10.1017/CBO9780511621024Search in Google Scholar

Biber, Douglas. 1994. An analytical framework for register studies. In Douglas Biber & Edward Finnegan (eds.), Sociolinguistic perspectives on register, 31–56. New York: Oxford University Press.10.1093/oso/9780195083644.003.0003Search in Google Scholar

Biber, Douglas. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.10.1017/CBO9780511519871Search in Google Scholar

Biber, Douglas & Susan Conrad. 2009. Register, genre, and style. Cambridge: Cambridge University Press.10.1017/CBO9780511814358Search in Google Scholar

Biber, Douglas, Jesse Egbert & Daniel Keller. 2020. Reconceptualizing register in a continuous situational space. Corpus Linguistics and Linguistic Theory 16(3). 581–616. https://doi.org/10.1515/cllt-2018-0086.Search in Google Scholar

Christodoulopoulos, Christos & Mark Steedman. 2015. A massively parallel corpus: The Bible in 100 languages. Language Resources and Evaluation 49. 375–395. https://doi.org/10.1007/s10579-014-9287-y.Search in Google Scholar

Cook, Paul & Laurel Brinton. 2017. Building and evaluating web corpora representing national varieties of English. Language Resources and Evaluation 51. 643–662. https://doi.org/10.1007/s10579-016-9378-z.Search in Google Scholar

Cook, Paul & Graeme Hirst. 2012. Do Web corpora from top-level domains represent national varieties of English? In Proceedings of the 11th International Conference on Textual Data Statistical Analysis, 281–293. Liege, Belgium: Analyse statistique des données textuelles.Search in Google Scholar

Cvrček, Václav, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina & Vladimír Benko. 2020. Comparing web-crawled and traditional corpora. Language Resources and Evaluation 54. 713–745. https://doi.org/10.1007/s10579-020-09487-4.Search in Google Scholar

Dunn, Jonathan. 2020. Mapping languages: The corpus of global language use. Language Resources and Evaluation 54. 999–1018. https://doi.org/10.1007/s10579-020-09489-2.Search in Google Scholar

Dunn, Jonathan. 2021. Representations of language varieties are reliable given corpus similarity measures. In Proceedings of the Eighth Workshop on NLP for similar languages, varieties and dialects (EACL 21), 28–38. Association for Computational Linguistics. https://aclanthology.org/2021.vardial-1.4. Online.Search in Google Scholar

Egbert, Jesse & Douglas Biber. 2018. Do all roads lead to Rome? Modeling register variation with factor analysis and discriminant analysis. Corpus Linguistics and Linguistic Theory 14(2). 233–273. https://doi.org/10.1515/cllt-2016-0016.Search in Google Scholar

Egbert, Jesse, Douglas Biber & Mark Davies. 2015. Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology 66(9). 1817–1831. https://doi.org/10.1002/asi.23308.Search in Google Scholar

Fothergill, Richard, Paul Cook & Timothy Baldwin. 2016. Evaluating a topic modelling approach to measuring corpus similarity. In Proceedings of the 10th international conference on language resources and evaluation, 273–279. Portorož, Slovenia: European Language Resources Association. https://aclanthology.org/L16-1042.Search in Google Scholar

Kučera, Henry & W. Nelson Francis. 1967. Computational Analysis of present-day American English. Providence, RI: Brown University Press.Search in Google Scholar

Kilgarriff, Adam. 2001. Comparing corpora. International Journal of Corpus Linguistics 6(1). 97–133. https://doi.org/10.1075/ijcl.6.1.05kil.Search in Google Scholar

Kouwenhoven, Huib, Mirjam Ernestus & Margot van Mulken. 2018. Register variation by Spanish users of English: The Nijmegen corpus of Spanish English. Corpus Linguistics and Linguistic Theory 14(1). 35–63. https://doi.org/10.1515/cllt-2013-0054.Search in Google Scholar

Leban, Gregor, Blǎz Fortuna & Marko Grobelnik. 2016. Using news articles for realtime cross-lingual event detection and filtering. In Proceedings of the recent trends in news information retrieval workshop, 33–38. Padua, Italy: European Conference on Information Retrieval. http://ceur-ws.org/Vol-1568/paper6.pdf.Search in Google Scholar

Li, Haipeng & Jonathan Dunn. 2022. Corpus similarity measures remain robust across diverse languages. Lingua 275. 103377. https://doi.org/10.1016/j.lingua.2022.103377.Search in Google Scholar

Nanayakkara, Purnima & Surangika Ranathunga. 2018. Clustering Sinhala news articles using corpus-based similarity measures. In Proceedings of the Moratuwa engineering research conference, 437–442. Moratuwa, Sri Lanka: Institute of Electrical and Electronics Engineers.10.1109/MERCon.2018.8421890Search in Google Scholar

Nini, Andrea. 2019. The multi-dimensional analysis tagger. In Tony Berber Sardinha & Marcia Veirano Veirano (eds.), Multi-dimensional analysis: Research methods and current issues, 67–94. London & New York: Bloomsbury Publishing PLC.10.5040/9781350023857.0012Search in Google Scholar

Sardinha, Tony Berber. 2018. Dimensions of variation across Internet registers. International Journal of Corpus Linguistics 23(2). 125–157. https://doi.org/10.1075/ijcl.15026.ber.Search in Google Scholar

Sardinha, Tony Berber, Carlos Kauffmann & Cristina Mayer Acunzo. 2014. A multi-dimensional analysis of register variation in Brazilian Portuguese. Corpora 9(2). 239–271. https://doi.org/10.3366/cor.2014.0059.Search in Google Scholar

Tiedemann, Jörg. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the international conference on language resources and evaluation, 2214–2218. Istanbul, Turkey: European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.Search in Google Scholar

Supplementary Material

The online version of this article offers supplementary material (https://doi.org/10.1515/cllt-2021-0090).

Received: 2021-09-05

Accepted: 2022-09-01

Published Online: 2022-09-20

Published in Print: 2023-10-26

You are currently not able to access this content.