Accounting for the relationship between lexical prevalence and acquisition with Bayesian networks and population dynamics

Andreas Baumann; Katharina Sekanina

doi:10.1515/lingvan-2021-0038

Artikel Open Access

Accounting for the relationship between lexical prevalence and acquisition with Bayesian networks and population dynamics

Andreas Baumann und Katharina Sekanina

Veröffentlicht/Copyright: 24. November 2022

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Linguistics Vanguard Band 8 Heft 1

Abstract

Lexical dispersion and acquisition are evidently linked to each other. In one direction, the acquisition of a word is promoted by it being used frequently and in diverse contexts. Conversely, words that are acquired early might have higher chances of being produced frequently and diversely. In this study, we analyze various measures of lexical dispersion and assess the extent to which they are linked to age of acquisition by means of a Bayesian network model. We find that lexical prevalence, that is, the fraction of individuals knowing a word, is most closely linked to acquisition and argue that this can be partially explained by the population dynamics of lexical spread. We also highlight related cognitive mechanisms in language acquisition.

Keywords: age of acquisition; Bayesian network; lexical dispersion; population dynamics; prevalence

1 Introduction

Lexical dispersion – that is, the extent to which words are used within a population of speakers and distributed in linguistic usage – is evidently connected with lexical acquisition. After all, words need to be acquired first by either children or adults before they are used. Conversely, it is evident that words which are used frequently and in many different contexts (like food or walk) are also acquired earlier than comparably rare lexical items (like sonogram or hyrax; e.g., Tomasello 2003; Kuperman et al. 2012).

Lexical dispersion can be measured in various different ways, such as through usage frequency (Bybee 2007); in terms of the number of texts, contexts, or utterances a word surfaces in (i.e., lexical dispersion in the narrow sense; Gries 2016); or by determining the area in which a word is established or the variety of social strata in which a word is known (Trudgill 2001; Mufwene 2001). More recently, Keuleers et al. (2015) highlighted the concept of “word prevalence”, that is, the percentage of people that know a word; a concept also referred to as “abundance” earlier by Nowak (2000) in the context of population dynamic models of lexical spread. It is expected that all these variables are negatively related to age of acquisition (i.e., they promote or are promoted by early acquisition). The question is, which of these variables affects or is affected most strongly by acquisition?

In this paper, we compare different measures of lexical dispersion and empirically assess the extent to which each of these measures relates to lexical acquisition. We show that lexical prevalence is most closely related to the acquisition of words and argue that this close relationship is to be expected given (a) cognitive mechanisms in language acquisition and usage and (b) that lexical proliferation represents a population dynamic spreading phenomenon.

Why is this relevant? Lexical dispersion measures differ as to how they are distributed. As is well known, token frequency typically follows a Zipfian distribution. A few words are highly frequent, while the majority of words displays a relatively low token frequency (Zipf 1949; Piantadosi 2014; Corominas-Murtra and Solé 2010). Because of this, the relationship between frequency rank and token frequency follows an inverse power law (with an exponent of about −1, depending on the actual language). This relationship is displayed in Figure 1 (dark orange line). Lexical prevalence, in contrast, follows a substantially different distribution. Here, most words are known by (almost) everyone while relatively few words are known only by a smaller share of the speaker population. When plotting prevalence against (logged) frequency rank (Figure 1, light orange line), the relationship is concave, rather than convex as is the case for Zipf’s law, and almost linear up to a knee at about 10,000. That is, the two dispersion measures seem to contain different amounts of information about lexical core and periphery. This can be clearly seen in Figure 1. While frequency accounts for much of the variability in the core lexicon (maybe up to 5,000 words), prevalence can be exploited to differentiate between lexical periphery items (about rank 5,000–10,000 and upward).

Figure 1:

Normalized token frequency (dark; diamonds) and prevalence (light; circles) depending on frequency rank, for about 30,000 words (cf. the description of Data Set 2 below). In the case of token frequency, the relationship is convex and follows Zipf’s frequency law (dark orange line), i.e., an inverse power law with an exponent of about −0.73. In contrast, the relationship between rank and prevalence is concave and relatively flat in the core-lexicon regime, a result formally derived by Nowak (2000: Figure 3 and Section 4.2).

The question remains as to whether these distributional differences influence how strongly the respective dispersion measures (in this case, frequency and prevalence) relate to age of acquisition. Indeed, Brysbaert et al. (2016) reveal a processing advantage of words with high prevalence, which is robust even if additional variables like word frequency or phonological properties are considered. Thus, prevalence seems to provide information about linguistic processing that is not already contained in token frequency. By consequence, we argue that if prevalence and age of acquisition turn out to be closely linked, lexical prevalence should not be neglected as relevant factor in studies of lexical acquisition. In this contribution, we provide arguments for this link by means of empirical data analysis and theoretical considerations.

In Section 2, we describe the data sets that we used and combined for the purpose of our analysis, as well as the crowdsourcing study that we conducted to gather information in addition to that provided by extant resources. More specifically, we collected estimates for passive as well as active prevalence (i.e., the difference between having only heard or also actively used a given word), detailed geographic information, and social information. In Section 3, we present our data analysis involving statistical modeling and Bayesian networks to learn about dependencies among dispersion measures. This is complemented with a follow-up analysis of the functional relationship between age of acquisition and prevalence based on a larger data set (Brysbaert et al. 2019). In Section 4, we first discuss cognitive factors involved in the relationship between prevalence and acquisition. Second, we relate our empirical findings to population dynamic models of linguistic spread. We conclude our paper in Section 5.

2 Data

Our empirical analysis is partially based on existing data resources and partially on data that we collected for this study. Age-of-acquisition (aoa) ratings were taken from a data set compiled by Kuperman et al. (2012). It consists of subjective age-of-acquisition estimates for about 30,000 words collected through crowdsourcing efforts. These estimates were shown to correlate robustly with age-of-acquisition norms obtained under laboratory conditions (Kuperman et al. 2012). Word frequencies (log freq; normalized) were also taken from this data set and subsequently log-transformed.

The data set compiled by Kuperman et al. (2012) also contains information about the fraction of participants who know a lexical item (“fraction known”), which already provides an estimate of lexical prevalence. However, estimates in that study are only based on roughly 20 participants per word. Brysbaert et al. (2019) collected prevalence data for more than 60,000 words with high accuracy (more than 300 data points per word); however, we wanted to add socio-geographic information, and to differentiate between passive and active knowledge. Thus, we extracted a stratified sample of 500 words from the list of Kuperman et al. (2012) in such a way that 50 words were randomly chosen for every decile of “fraction known”, thus covering rare as well as commonly known items. In addition, we sampled 25 high-frequency core items that we assumed to be known by all participants (e.g., rabbit). Together with a list of 25 nonce words,^[1] these core lexemes were added to the 500 words as control items. The distributions of age of acquisition in the data compiled by Kuperman et al. (2012) and that of this study are shown in Figure 2.

Figure 2:

Densities of age of acquisition (aoa) in the data collected by Kuperman et al. (2012; roughly 30,000 words; cf. Data Set 2 below; light orange) and in this study (500 words; cf. Data Set 1 below; dark orange). Mean aoa is µ = 11 and µ = 13.9, respectively. The present study largely covers relatively late acquired items.

Subsequently, we conducted a crowdsourcing survey to collect relevant information for each word. Participants were asked to decide whether every single word has been (1) heard but not used, (2) heard and used already, or (3) neither used nor heard at all. This enables a more differentiated assessment of lexical prevalence. Each participant had to rate all 550 items in this way. In addition, we collected geographic information (central coordinates of their hometown, i.e., longitude and latitude), and educational status (10 different options).

In total, we collected data from 100 participants. All data were collected with Prolific Academic (https://www.prolific.co/) and SoSci Survey (https://www.soscisurvey.de/), filtering for first language (English), nationality, and country of residence beforehand, and randomizing items. Participants received GBP 3.09 for their efforts. All answers were given anonymously, and no personal data was collected in the survey. After excluding unreliable responses based on the control items (threshold: at least 80% correct) and including only answers from Great Britain, 88 ratings per word remained in our data. This geographic limitation was employed to facilitate measuring spatial dispersion (see description of the variable area below).

Based on these data, we computed the fraction of speakers that have already heard (prev heard) and used (prev used) a word, giving us two different measures of lexical prevalence, corresponding to passive and active lexical knowledge, respectively. Given 88 ratings per word, the margin of error of these estimates is at most ±5%. This roughly corresponds to the accuracy for British participants in Brysbaert et al. (2019) and essentially doubles accuracy compared with the data of Kuperman et al. (2012). Note that lexical prevalence was assessed subjectively, very much like age-of-acquisition estimates in Kuperman et al. (2012) and scores based on vocabulary tests in Brysbaert et al. (2019).^[2]

We derived a proxy for the area in which the word is known (area) by means of the rectangle defined by the most extreme coordinates for each word. Thus, this proxy is monotonously related to the maximum geographic distance between speakers that know a word. In order to obtain a proxy for social dispersion (social disp), we computed entropy of educational status per word. Thus, high entropy relates to usage in diverse social strata while low entropy indicates a socially restricted usage (e.g., academic language).

Dispersion across texts (range) was taken from the Corpus of Contemporary American English (COCA, Davies 2008; SOAP sub-corpus; see Appendix A1). This score measures the number of texts in which a word is used (it was shown to be immediately related to lexical processing; cf. Gries 2008). Due to its skewed distribution, this variable was log-transformed as well (after uniformly adding 1 to all counts to avoid negative infinity values). Finally, dispersion across genres (genre disp) was determined by extracting token frequency distributions across all genres in the British National Corpus (BNC) for each word, and subsequently computing entropy for each distribution. Again, high entropy indicates usage in diverse genres whereas low entropy indicates a restriction to few genres (or just one). In case of a word not surfacing in BNC, entropy was set to zero.^[3]

After eliminating the control items, this leaves our data set at 500 instances with estimates for eight properties each (Data Set 1; see supplementary materials). Table 1 shows a sample of 10 entries in that data set. Table 2 shows summary measures for all variables.

Table 1:

Sample of 10 words in Data Set 1 with their dispersion measures and age of acquisition.

Word	aoa	log freq	prev heard	prev used	area	log range	social disp	genre disp
furze	13.86	−3.93	0.16	0.03	35.34	0	1.04	0.68
counterespionage	12.79	−1.85	0.68	0.32	58.41	0	1.4	1.39
midden	13.25	−2.32	0.4	0.15	50.73	0	1.48	0
ergonomic	14.29	−2.32	0.9	0.69	60.38	0	1.49	1.5
frankincense	10.14	−1.45	0.9	0.77	52.25	1.39	1.53	1.76
freshet	13.71	−3.93	0.08	0	38.93	0	1.75	0
comp	16.47	0.51	0.69	0.41	50.71	4.14	1.63	0
pratfall	13.5	−2.83	0.31	0.08	44.59	1.61	1.25	1.53
kaput	11.35	0	0.76	0.52	52.25	4.09	1.5	1.19
bolus	15.6	−0.41	0.33	0.17	50.15	2.2	1.2	1.04

Table 2:

Summary measures for all variables in Data Set 1 (after log-transformation, if applicable; before z-transformation, see Section 3).

	min	median	max
aoa	5.78	13.93	21
log freq	−3.93	−2.83	1.05
log range	0	0	5.6
prev heard	0	0.42	1
prev used	0	0.14	0.99
area	0	49.91	61.33
social disp	0	1.44	1.78
genre disp	0	0.73	1.89

For a follow-up analysis of the relationship between age of acquisition and prevalence based on a larger number of words, we combined the data sets from Kuperman et al. (2012) and Brysbaert et al. (2019). This resulted in a set of age-of-acquisition norms and prevalence estimates (again, not probit-transformed) for 30,794 words (Data Set 2).

3 Empirical data analysis

The statistical analysis unfolds in three steps. First, we computed straightforward linear regression models, one for each dispersion measure (as dependent variable) in Data Set 1, in which aoa figures as predictor. All variables were z-transformed to obtain standardized regression coefficients. Results are shown in Figure 3b. It can be seen, first, that as expected lexical dispersion is inversely correlated with aoa in all cases. More importantly, it is both measures of prevalence which are most strongly correlated with acquisition (effect of ß = −0.42 and −0.4 for prev heard and prev used, respectively).

Figure 3:

(a) Bayesian network of aoa (dark orange) and all lexical dispersion measures (light orange). Edge weights correspond to absolute Gaussian regression coefficients. (b) Separate linear models of lexical dispersion depending on aoa together with standardized regression coefficients.

In the second step of the statistical analysis, we were interested in identifying dependencies between all dispersion measures and acquisition. To this end, we fitted a Bayesian network to our data (Pearl 2004). Bayesian networks are probabilistic acyclic graph models which have been developed to identify conditional relationships between variables (although it is debated if this is linked to the concept of causality). In these graphs, variables figure as nodes. A pair of nodes is linked through a directed edge if one node conditionally depends on the other. For example, an edge aoa → prev heard means that passive prevalence conditionally depends on age of acquisition. In the continuous case, under the assumption of normally distributed variables, the distribution of each node is given by a normal distribution determined by the nodes pointing to it. That is, the computation of a Bayesian network boils down to fitting linear multivariate Gaussian models for sets of nodes, one of them being the dependent variable in that model, and selecting models based on some learning algorithm (here, greedy hill-climbing; Scutari and Denis 2014). That is, in our analysis both the structure of the network and the weights on its edges are derived automatically from the data and not determined a priori. The Bayesian network fitted to our data is visualized in Figure 3a, where the weights of the edges correspond to absolute regression coefficients in the respective Gaussian models. Coefficients with signs are shown in Table A2.

In Figure 3a, prev heard is most closely linked with acquisition, followed by genre disp and social disp. All other dispersion measures are only indirectly connected to aoa. In particular, the often-investigated measure of frequency is least closely correlated with acquisition. This indicates that acquisition is more strongly connected with – and likely also driven by – social interactions rather than simply frequency of exposure.^[4]

The variable genre disp is interesting in that it features a weak relationship with aoa in the linear regression analysis; in the Bayesian network, however, genre disp seems to directly depend on many variables, one of them being aoa. A closer look at the sign of the coefficient (of the impact of aoa on genre disp) in Table A2 reveals that, if log range, log freq, and prev heard are also considered, then genre disp in fact increases with aoa. That is, compared to the univariate linear model (in Figure 3b), the sign of the relationship is reversed in the multivariate model. Dispersion across genres seems to interact nontrivially with acquisition, albeit weakly so.

In a third step, we assessed the shape of the functional relationship between age of acquisition and prevalence based on Data Set 2, that is, the data set of 30,794 words. To do so, we fitted a generalized additive model (GAM, Wood 2006) to our data in which age of acquisition and prevalence figure as dependent and independent variable, respectively. Crucially, prevalence entered the model as a smooth term. The initial number of knots (k = 30) in this term was determined through an AIC-based grid search over k (Burnham and Anderson 2002; COCA, Davies 2008; see Figure A1). The resulting model is visualized in Figure 4a.

Figure 4:

(a) GAM of age of acquisition depending on prevalence based on Data Set 2 (gray area indicates a 95% confidence band; adjusted R ² = 38.4). The functional relationship is roughly linear up to a knee close to p = 1.0. A small set of lexical examples was sampled randomly for each decile. (b) Illustration of the theoretical approach described in Section 4. Left: Schematic representation of the relationship between a word’s age of acquisition and prevalence at equilibrium under the assumption of a rectangular age distribution. Right, first row: Logistic model of word prevalence in a speaker population. Right, second row: Age of acquisition is linearly related to equilibrium prevalence.

4 Discussion and theoretical considerations

Our analysis corroborates the close connection between acquisition and lexical prevalence, that is, the dispersion of words in speaker populations. This relationship could go in two directions: on the one hand, early acquisition could be a reflex of high lexical prevalence; on the other hand, early lexical acquisition might drive lexical prevalence.

Let us discuss the former direction first. To begin with, we can argue that acquisition is promoted by a high abundance of linguistic informants. A word that is known (and subsequently also used) by many speakers has a higher chance to be picked up by someone who does not yet know that word than a word known only by a small set of individuals.

In addition to this essentially probabilistic argument, which we will come back to below, cognitive factors and generalization across situations and communication events are likely to play a role (Kemler Nelson et al. 2000). For example, Schwartz and Terrell (1983) show that in infants, distributed exposure promotes lexical acquisition more strongly than massed exposure (i.e., many instances at the same time). The same was found by Childers and Tomasello (2002), who show that exposure across several events facilitates word learning over single massed exposure even if overall frequency is kept constant. Since the present study focuses on the acquisition of words that are acquired relatively late (cf. Figure 2), insights drawn from the study of early lexical acquisition (Goldfield and Reznick 1990) must be treated with caution in this context. However, the promoting effect of repetition in language learning is also visible more generally in adolescents and adults in various linguistic domains (Dempster 1996). Arguably, lexical prevalence, social dispersion, or areal dispersion are more closely related to distributed exposure than to frequency, as operationalized in our analysis. In particular, high prevalence implies that a word can be used to communicate with many different speakers rather than being employed idiosyncratically.

Another possibility is that our results may also be a reflex of word semantics. There is robust evidence that polysemous words are more frequent (Casas et al. 2019) and processed faster than words with only a single meaning (at least if they are abstract; Jager and Cleland 2016). If polysemy reflects dispersion across different contexts and if contextual versatility is correlated with the number of interlocutors, this might explain the relationship between lexical prevalence and acquisition, at least partially. However, when looking at the roles of range and dispersion across genres in our analysis (Figure 3), the situation becomes less clear. Both variables may relate to polysemy, but range is only weakly related to age of acquisition and high genre dispersion even seems to be associated with late acquisition if other covariates are also considered.^[5] Potentially, this mismatch can be explained by semantic bleaching effects, which are typically associated with late acquisition and impeded learning (Brown 1973; De Groot and Keijzer 2000; Hopper and Traugott 2003). We conclude that the role of semantics in the relationship between acquisition and prevalence needs to be investigated in further detail.

Let us now consider the other direction, that is, prevalence being a reflex of acquisition. One argument is that words which are acquired early are highly entrenched, which has an immediate effect on the chances of stably using (and not forgetting) those words. This might be grounded in increased cognitive plasticity at early ages (cf. Monaghan 2014) or cumulative frequency (Ghyselinck et al. 2004). Arguably, if an individual acquires a word earlier, they have more time to produce that word which increases the chances of passing it on to other speakers.

This probabilistic argument is closely related to population dynamic accounts of language. Linguistic constituents such as words are transmitted through speaker populations and across generations (Croft 2000; Ritt 2004), so that the proliferation of words represents a spreading phenomenon (Barabási 2016). Consequently, the spread of linguistic constituents has been modeled in terms of population dynamics (Cavalli-Sforza and Feldman 1981; Nowak 2000; Solé et al. 2010; cf. Figure 4b, right). In these models, linguistic spread is always a function of population structure and the number of interactions among individuals. In particular, we have shown elsewhere in terms of a simple logistic model of learner-user interactions that age of acquisition and prevalence (next to diachronic growth) are both related to the reproduction rate of linguistic constituents (Baumann and Ritt 2018).

From this model, it can be derived that age of acquisition and prevalence are linearly related to each other (Figure 4b, right). The argument is straightforward and analogous with considerations about age of first infection in epidemiology (Dietz 1993; Heffernan et al. 2005). Let us assume that a particular word is close to its population dynamic equilibrium (i.e., its prevalence does not grow or decline strongly). Let us furthermore assume a roughly rectangular (i.e., equal) age distribution. Then the ratio of people not knowing the word and the total population size must be the same as the ratio of age of acquisition and life expectation. This is illustrated in Figure 4b (left). From this it follows that age of acquisition is a linearly decreasing function of prevalence. Thus, the close relationship between acquisition and lexical prevalence that we see in our data can be interpreted as a reflex of straightforward population dynamics of the interactions among users and learners. This mathematical relationship does not hold well for words that are known by almost everyone, but it serves as a good model for the relationship between acquisition and prevalence in relatively rare words. That this holds true also empirically can be seen in Figure 4a, which shows the generalized additive model fitted to the data collected by Kuperman et al. (2012) and Brysbaert et al. (2019). Up to a knee close to a prevalence of 100%, the relationship between age of acquisition and prevalence seems to be fairly linear.

The directionality of the Bayesian network in Figure 3a hints at effects of acquisition on prevalence (and in turn other dispersion measures). However, it is important to point out that Bayesian networks are models of joint probability distributions defined by conditional probabilities among variables. This does not necessarily entail causal relationships among them (see Pearl 2004 for some discussion). Nevertheless, the network in Figure 3a does reveal plausible chains of dependencies, such as aoa → prev heard → prev used → log freq.

It is evident that our study has many shortcomings. The number of words analyzed (500) is low, in particular in comparison with the large-scale studies by Brysbaert and colleagues (Brysbaert et al. 2019; Kuperman et al. 2012). Also, the focus of our study is on analyzing relatively late-acquired periphery items. Figures 2 and 4a show that words which are not known by all speakers are acquired roughly at the age of 12 or later. For that reason, we can say little about how prevalence and age of acquisition hang together in early stages of lexical acquisition.

Furthermore, our proxies for spatial and social dispersion are rather simplistic. With more participants per word, one could define more accurate measures of geographic dispersion (e.g., by also taking areas of regions and more fine-grained diversity measures into account) rather than just assessing the most extreme area implied by the data. Likewise, having social dispersion just based on educational status is simplistic as well. Information about income class, for example, strikes us as a reasonable extension. Regarding social dispersion, another issue arises in our study. Since we rely on a relatively small set of words, chances are high that words with low prevalence are those that are related to scientific or academic contexts so that entropy of social state of a word (as computed here) is in fact correlated with the average social state of individuals knowing that word. We suggest that future studies with limited sample sizes should feature stratified sampling of words with respect to speech registers as well to account for such a bias (e.g., by including slang words like shyster).

Finally, the population dynamic model discussed above is based on simplistic assumptions. It assumes homogeneous mixing rather than a more plausible social network structure (Barabási 2016; Blythe and Croft 2012). In particular, Barabási (2016) shows that the reproduction rate of spreading phenomena is crucially influenced by social network structure in that the threshold for successful spread is decreased, which may in turn influence the relationship between age of acquisition and prevalence (Heffernan et al. 2005; Pastor-Satorras and Vespignani 2001). Since the acquisition of lexical periphery largely takes place during adolescence (cf. Figure 2), the setup and change of social networks in this age group should be considered as well. First, there is robust evidence that social networks grow most rapidly between the ages of 10 and 20 (Wrzus et al. 2013). Thus, ideally, network size should be controlled for. Second, and directly related to the latter consideration, there is evidence that adolescents tend to adapt to older peers, which violates the assumption of homogeneous mixing (Kerswill 1996). Moreover, the model is based on age-independent learning rate and it does not account for interactions among linguistic constituents (e.g., via phonological or semantic relatedness).^[6]

5 Conclusion

In this paper, we have analyzed various measures of lexical dispersion based on different data sources. We have shown that lexical prevalence seems to be relatively closely linked to the age of acquisition of words. We have argued that this relationship may be partially accounted for by cognitive mechanisms related to word learning, but that it may also be construed probabilistically. Here, we have argued that, if the proliferation of words is modeled as a spreading phenomenon, then, under certain assumptions, the close relationship between age of acquisition and prevalence is a logical consequence of such a model. Despite the limitations of our study laid out in the previous section, we consider our findings interesting enough to motivate further research in this direction. In particular, we would like to highlight the relevance of collecting and analyzing word-prevalence data next to word-frequency measures. This is in line with Brysbaert et al. (2016), who stress the relevance of prevalence for psycholinguistic research. Our research, however, emphasizes this matter with a different point of view in that we argue prevalence to be also relevant to research on linguistic dynamics, which is often restricted to studying diachronic frequency developments, as far as empirical analysis is concerned. Finally, we stress that studying lexical periphery in addition to core lexical items is important to get a more complete picture of lexical acquisition, usage, and change.

Corresponding author: Andreas Baumann, Department of European and Comparative Literature and Language Studies, University of Vienna, Universitätsring 1, 1010 Vienna, Austria, E-mail: andreas.baumann@univie.ac.at

Acknowledgments

We would like to thank Theresa Matzinger, Magdalena Schwarz, Vanja Vukovic, and the editors of this journal, as well as two anonymous reviewers, for valuable comments and feedback. The crowdsourcing survey in this study was funded by the Faculty of Philological and Cultural Studies, University of Vienna.

Appendix

A1 Methods

Corpus selection and range robustness checks: British corpus data, such as the British National Corpus (BNC), would be in principle more suitable for assessing range in our study, since the participants of our crowdsourcing task were British as well. SOAP, although representing American English, was preferred over BNC, COCA, and their respective spoken sub-corpora for multiple reasons. First, it features spoken and colloquial data, which we consider more relevant to our argument. Second, with about 22,000 soap episodes, SOAP collects a larger number of texts than is the case for BNC (spoken), with only about 900 texts. This makes our estimates of range more robust. Third, SOAP does not feature a large gap in terms of range between words that surface in only few texts and words that do not occur at all in the corpus, so that the distribution of range is a bit less bimodal in SOAP than in BNC or COCA. Finally, range is highly correlated across the abovementioned corpora in any case, and altering the underlying corpus for range did not affect the qualitative nature of our findings. Pairwise correlations were computed for SOAP, BNC, COCA, and the spoken sub-corpora of BNC and COCA (see Table A1). All coefficients are high and significantly nontrivial (95% confidence level).

Table A1:

Pairwise Pearson correlation coefficients for range in a selection of spoken and mixed corpora (both British and American English).

	BNC (spoken)	COCA (spoken)	SOAP	BNC	COCA
BNC (spoken)	1.00	0.90	0.87	0.88	0.90
COCA (spoken)	0.90	1.00	0.89	0.77	0.97
SOAP	0.87	0.89	1.00	0.70	0.84
BNC	0.88	0.77	0.70	1.00	0.83
COCA	0.90	0.97	0.84	0.83	1.00

Bayesian network coefficients: The Bayesian network computed in Section 3 effectively consists of multiple (univariate or multivariate) Gaussian regression models that determine the conditional probabilities. In Figure 3a, weights in the network are displayed as absolute values to denote the strength of association between the respective linked variables. Table A2 below shows the corresponding slopes (i.e., regression coefficients) with signs.

Table A2:

Slopes of the Gaussian models in the fitted Bayesian network.

from (independent)	to (dependent)	coefficient
prev heard	prev used	0.91
prev heard	area	1.28
area	social disp	0.6
log freq	log range	0.45
prev heard	genre disp	0.41
prev used	log freq	0.65
aoa	prev heard	−0.42
prev used	area	−0.55
prev used	log range	0.29
log range	genre disp	0.23
aoa	genre disp	0.15
aoa	social disp	−0.12
log freq	genre disp	0.14
prev heard	log freq	−0.24

Grid search: To determine the initial number of basis functions (knots; k) in the GAM of age of acquisition depending on prevalence, AIC was computed for several initial values of k (from 10 to 100 in steps of 5). Figure A1 shows AIC for every single GAM. According to Burnham and Anderson (2002), all models with an AIC difference of less than or equal to about 7 in the AIC-minimal model can be considered as plausible candidates; this is denoted by the orange region in Figure A1. The first model in that region is the one with k = 30, which was subsequently employed in Section 3 (Figure 4a).

Figure A1:

AIC for GAMs with different initial k values. Orange region denotes plausible models.

A2 Code

The data and code in this study can also be accessed in the following RStudio Cloud project: https://rstudio.cloud/project/2225135.

References

Barabási, Albert-László. 2016. Network science. Cambridge: Cambridge University Press.Suche in Google Scholar

Baumann, Andreas & Nikolaus Ritt. 2018. The basic reproductive ratio as a link between acquisition and change in phonotactics. Cognition 176. 174–183. https://doi.org/10.1016/j.cognition.2018.03.005.Suche in Google Scholar

Blythe, Richard A. & William Croft. 2012. S-curves and the mechanism of propagation in language change. Language 88(2). 269–304. https://doi.org/10.1353/lan.2012.0027.Suche in Google Scholar

Brown, Roger. 1973. A first language: The early stages. Harvard: Harvard University Press.10.4159/harvard.9780674732469Suche in Google Scholar

Brysbaert, Marc, Paweł Mandera, Samantha F. McCormick & Emmanuel Keuleers. 2019. Word prevalence norms for 62, 000 English lemmas. Behavior Research Methods 51. 467–479. https://doi.org/10.3758/s13428-018-1077-9.Suche in Google Scholar

Brysbaert, Marc, Michaël Stevens, Paweł Mandera & Emmanuel Keuleers. 2016. The impact of word prevalence on lexical decision times: Evidence from the Dutch lexicon project 2. Journal of Experimental Psychology: Human Perception and Performance 42(3). 441–458. https://doi.org/10.1037/xhp0000159.Suche in Google Scholar

Burnham, Kenneth P. & David R. Anderson. 2002. Model selection and multimodel inference: A practical information-theoretic approach. New York: Springer.Suche in Google Scholar

Bybee, Joan. 2007. Frequency of use and the organization of language. Oxford: Oxford University Press.10.1093/acprof:oso/9780195301571.001.0001Suche in Google Scholar

Casas, Bernardino, Antoni Hernández-Fernández, Neus Català, Ramon Ferrer-i-Cancho & Jaume Baixeries. 2019. Polysemy and brevity versus frequency in language. Computer Speech & Language 58. 19–50. https://doi.org/10.1016/j.csl.2019.03.007.Suche in Google Scholar

Cavalli-Sforza, Luigi L. & Marcus W. Feldman. 1981. Cultural transmission and evolution: A quantitative approach. Princeton: Princeton University Press.10.1515/9780691209357Suche in Google Scholar

Childers, Jane B. & Michael Tomasello. 2002. Two-year-olds learn novel nouns, verbs, and conventional actions from massed or distributed exposures. Developmental Psychology 38(6). 967–978. https://doi.org/10.1037/0012-1649.38.6.967.Suche in Google Scholar

Corominas-Murtra, Bernat & Ricard V. Solé. 2010. Universality of Zipf’s law. Physical Review E 82. 1–9. https://doi.org/10.1103/PhysRevE.82.011102.Suche in Google Scholar

Croft, William. 2000. Explaining language change: An evolutionary approach. Harlow, UK: Longman.Suche in Google Scholar

Davies, Mark. 2008. The corpus of contemporary American English (COCA). Update covering 1990–2012, with 450 million words. Birmingham: Brigham Young University. Available at: https://www.english-corpora.org/coca/.Suche in Google Scholar

Dempster, Frank N. 1996. Distributing and managing the conditions of encoding and practice. In Elizabeth Ligon Bjork & Robert A. Bjork (eds.), Memory, 317–344. San Diego: Academic Press.10.1016/B978-012102570-0/50011-2Suche in Google Scholar

Dietz, Klaus. 1993. The estimation of the basic reproduction number for infectious diseases. Statistical Methods in Medical Research 2. 23–41. https://doi.org/10.1177/096228029300200103.Suche in Google Scholar

De Groot, Annette M. B. & Rineke Keijzer. 2000. What is hard to learn is easy to forget: The roles of word concreteness, cognate status, and word frequency in foreign-language vocabulary learning and forgetting. Language Learning 50. 1–56. https://doi.org/10.1111/0023-8333.00110.Suche in Google Scholar

Ghyselinck, Mandy, Michael B. Lewis & Marc Brysbaert. 2004. Age of acquisition and the cumulative-frequency hypothesis: A review of the literature and a new multi-task investigation. Acta Psychologica 115. 43–67. https://doi.org/10.1016/j.actpsy.2003.11.002.Suche in Google Scholar

Goldfield, Beverly A. & J. Steven Reznick. 1990. Early lexical acquisition: Rate, content, and the vocabulary spurt. Journal of Child Language 17. 171–183. https://doi.org/10.1017/S0305000900013167.Suche in Google Scholar

Gries, Stefan T. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13. 403–437. https://doi.org/10.1075/ijcl.13.4.02gri.Suche in Google Scholar

Gries, Stefan T. 2016. Dispersions and adjusted frequencies in corpora: Further explorations. In Stefan T. Gries, Stefanie Wulff & Mark Davies (eds.), Corpus-linguistic applications, 197–212. Amsterdam: Rodopi.10.1163/9789042028012_014Suche in Google Scholar

Heffernan, Jane, Robert Smith & Lindi Wahl. 2005. Perspectives on the basic reproductive ratio. Journal of The Royal Society Interface 2(4). 281–293. https://doi.org/10.1098/rsif.2005.0042.Suche in Google Scholar

Hopper, Paul J. & Elizabeth Closs Traugott. 2003. Grammaticalization. Cambridge: Cambridge University Press.10.1017/CBO9781139165525Suche in Google Scholar

Jager, Bernadet & Alexandra A. Cleland. 2016. Polysemy advantage with abstract but not concrete words. Journal of Psycholinguistic Research 45. 143–156. https://doi.org/10.1007/s10936-014-9337-z.Suche in Google Scholar

Kemler Nelson, Deborah G., Rachel Russell, Nell Duke & Kate Jones. 2000. Two-year-olds will name artifacts by their functions. Child Development 71. 1271–1288. https://doi.org/10.1111/1467-8624.00228.Suche in Google Scholar

Kerswill, Paul. 1996. Children, adolescents, and language change. Language Variation and Change 8. 177–202. https://doi.org/10.1017/S0954394500001137.Suche in Google Scholar

Keuleers, Emmanuel, Michaël Stevens, Paweł Mandera & Marc Brysbaert. 2015. Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment. Quarterly Journal of Experimental Psychology 68. 1665–1692. https://doi.org/10.1080/17470218.2015.1022560.Suche in Google Scholar

Kuperman, Victor, Hans Stadthagen-Gonzalez & Marc Brysbaert. 2012. Age-of-acquisition ratings for 30, 000 English words. Behavior Research Methods 44. 978–990. https://doi.org/10.3758/s13428-012-0210-4.Suche in Google Scholar

Monaghan, Padraic. 2014. Age of acquisition predicts rate of lexical evolution. Cognition 133. 530–534. https://doi.org/10.1016/j.cognition.2014.08.007.Suche in Google Scholar

Mufwene, Salikoko S. 2001. The ecology of language evolution. Cambridge: Cambridge University Press.10.1017/CBO9780511612862Suche in Google Scholar

Nowak, Martin A. 2000. The basic reproductive ratio of a word, the maximum size of a lexicon. Journal of Theoretical Biology 204(2). 179–189. https://doi.org/10.1006/jtbi.2000.1085.Suche in Google Scholar

Pastor-Satorras, Romualdo & Alessandro Vespignani. 2001. Epidemic spreading in scale-free networks. Physical Review Letters 86. 3200–3203. https://doi.org/10.1515/9781400841356.493.Suche in Google Scholar

Pearl, Judea. 2004. Graphical models for probabilistic and causal reasoning. In Tucker Allen, Teofilo Gonzalez, Heikki Topi & Jorge Diaz-Herrera (eds.), Computer science handbook, 2nd edn., 1676–1693. London: Chapman and Hall/CRC.Suche in Google Scholar

Piantadosi, Steven T. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review 21. 1112–1130. https://doi.org/10.3758/s13423-014-0585-6.Suche in Google Scholar

Ritt, Nikolaus. 2004. Selfish sounds and linguistic evolution: A Darwinian approach to language change. Cambridge: Cambridge University Press.10.1017/CBO9780511486449Suche in Google Scholar

Scutari, Marco & Jean-Baptiste Denis. 2014. Bayesian networks. New York: Chapman and Hall/CRC.10.1201/b17065Suche in Google Scholar

Schwartz, Richard & Brenda Y. Terrell. 1983. The role of input frequency in lexical acquisition. Journal of Child Language 10. 57–64. https://doi.org/10.1017/S0305000900005134.Suche in Google Scholar

Solé, Ricard V., Bernat Corominas-Murtra & Jordi Fortuny. 2010. Diversity, competition, extinction: The ecophysics of language change. Journal of The Royal Society Interface 7(53). 1647–1664. https://doi.org/10.1098/rsif.2010.0110.Suche in Google Scholar

Tomasello, Michael. 2003. Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press.Suche in Google Scholar

Trudgill, Peter. 2001. Sociolinguistic variation and change. Edinburgh: Edinburgh University Press.Suche in Google Scholar

Wood, Simon. 2006. Generalized additive models: An introduction with R. Boca Raton, Florida: Chapman & Hall/CRC.Suche in Google Scholar

Wrzus, Cornelia, Martha Hänel, Jenny Wagner & Franz J. Neyer. 2013. Social network changes and life events across the life span: A meta-analysis. Psychological Bulletin 139(1). 53–80. https://doi.org/10.1037/a0028601.Suche in Google Scholar

Zipf, George Kingsley. 1949. Human behavior and the principle of least effort: An introduction to human ecology. Cambridge, MA: Addison-Wesley Press.Suche in Google Scholar

Received: 2021-03-08

Accepted: 2021-11-17

Published Online: 2022-11-24

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary Material

Artikel in diesem Heft

https://doi.org/10.1515/lingvan-2021-0038

Schlagwörter für diesen Artikel

age of acquisition; Bayesian network; lexical dispersion; population dynamics; prevalence

Creative Commons

BY 4.0

Accounting for the relationship between lexical prevalence and acquisition with Bayesian networks and population dynamics

Artikel

Abstract

1 Introduction

2 Data

3 Empirical data analysis

4 Discussion and theoretical considerations

5 Conclusion

Acknowledgments

A1 Methods

A2 Code

References

Zusatzmaterial

Artikel in diesem Heft

Artikel in diesem Heft

Artikel in diesem Heft