Home Linguistics & Semiotics Geospatial effects on phonological complexity in the world’s languages
Article Open Access

Geospatial effects on phonological complexity in the world’s languages

  • Frederik Hartmann EMAIL logo and Johanna Nichols
Published/Copyright: July 18, 2025

Abstract

Linguistic complexity has generally been seen as influenced by ecological, demographic, and sociolinguistic factors and has been approached by seeking correlations of increased complexity along one linguistic dimension with one or another extralinguistic factor. Here we use a multidimensional definition of phonological complexity and analyze its global patterning quantitatively across predefined continents or sets of continents. We use linear and nonlinear regression models to estimate the gradient of geographical distributions of various complexity measures. We found significantly lower phonological complexity levels in South America and Australasia, suggesting isolation by distance (from centers of demographic and economic influence in the Old World). An unexpected finding is that our models show a broad transitional zone between the different levels of complexity in Africa and Europe. This Euro-African Transition Zone (ETZ) has its core between the Equator and the African coastline of the Mediterranean Sea. It has not been regarded as a language area before and is not a convergence area, but it has a robust geospatial complexity profile and we propose it as an area. We also find reasons to regard phonological complexity as a coherent property in itself, with its several dimensions behaving as a whole, and not as adaptive but as geospatially patterned. This paper is an exercise in using synchronic analysis of geolinguistic patterns of complexity to draw inferences about long-term change and evolutionary trends in language. On a higher level, the paper employs methods of geospatial analysis novel to this particular problem with the goal of raising new questions and laying groundwork for wider research into the distribution of linguistic complexity.

1 Introduction

This paper is intended to draw attention to a largely unnoticed side of linguistic complexity. In recent work, linguistic complexity has been shown to be influenced by ecological, demographic, and sociolinguistic factors, i.e. it appears to be adaptive. Most such work seeks correlations of increased complexity along one linguistic dimension (such as phoneme inventory size) with one or another extralinguistic factor, whose geographical distribution vis-à-vis complexity serves as evidence in support of their correlation (e.g. Atkinson 2011). Positive correlations have been found with altitude (Nichols 2013; Nichols and Bentz 2018) and latitude (Bentz 2016), though such cases are generally thought to have ultimately ecological rather than direct geophysical causes (but see Everett 2013; Everett et al. 2016). Findings differ on whether complexity covaries with population size (pro: Hay and Bauer 2007; Lupyan and Dale 2010; Moran and Blasí 2014; Sinnemäki and Di Garbo 2018; con: Donohue and Nichols 2011; Nichols 2009; Wichmann et al. 2011 find an indirect correlation mediated by word length). It can covary with sociolinguistic parameters such as the number of adult L2 learners in the speech community (Bentz and Winter 2013; Bentz et al. 2015; Dahl 2004; Trudgill 2011).

Another line of research investigates correlations between complexity levels in different domains of grammar, such as size of vowel inventory versus consonant inventory (Maddieson 2013b); syllable complexity and consonant inventory size (Easterday 2019:127–149; Maddieson 2013a); and tone and syllable structure (Easterday 2019:193; Maddieson 2013b) within the domain of phonology; studies of different parts of grammar include Shosted (2006), Fenk-Oczlon and Fenk (2014) and Sinnemäki (2014a, 2014b). Such work is often presented as supporting or not supporting what is known as the equi-complexity hypothesis, the assumption that all languages are equally complex overall (Hockett 1958:180–181, with many later citations) and that therefore high complexity in one domain should be balanced against low complexity in another. Negative correlations support the hypothesis; positive ones counterindicate it, suggesting that overall complexity of individual languages can be greater or lesser. Of course, any one demonstration involving only two of the many domains of grammar tells us little about the overall complexity of a language, a vast matter that we have no way of measuring and probably never will. Rather, demonstrations that complexity covaries with external factors serve as powerful evidence against equi-complexity, and perhaps explain the recent emphasis on investigating possible adaptivity of language to external factors. Nonetheless, negative correlations continue to feature prominently in discussions of findings.

Here we inquire into a very different aspect of the distribution of complexity: a systematic study of whether and how it is geospatially patterned. This is an exploratory study in which we design a grid for surveying the worldwide distribution of complexity, define types and domains for a thorough survey of complexity in phonology, and track the different kinds of complexity across the grid in a sample of languages distributed over large areas based on those of the Autotyp database (Bickel et al. 2022). Therefore, we do not test explicit hypotheses with the computational model.

We investigate phonology since there is a body of existing work on its worldwide distribution. To describe these patterns we worked out a set of metrics expressly configured to measure phonological complexity. The present paper thus aims to describe the worldwide patterns of complexity with fine-grained configured variables, using analysis of synchronic patterns of complexity to draw inferences about long-term language evolution and change. We use configured here to mean measures that have been picked to specifically reflect phonological complexity in the most comprehensive way. This is in contrast to raw feature or segment counts.

Like all previous work on phonological complexity we are aware of, this study deals with what is known as enumerative complexity, also known as taxonomic complexity, system complexity, inventory complexity, and other terms; e.g. Dahl (2004), Miestamo (2008), Nichols (2009), Sinnemäki (2011), Ackerman and Malouf (2013), Baerman et al. (2015), Anderson (2015), Cotterell et al. (2019). It measures complexity as the number of items or distinctions in a system. It is relatively straightforward to survey and sample, and there is little controversy about the units to survey in phonology (chiefly contrastive sounds, i.e. phonemes as shown in phoneme charts of grammars).[1] It has been used in all the survey work cited in the previous paragraph.

Phonological complexity in the works just mentioned is often limited to a single factor such as phoneme inventory size or just consonant inventory size; complexity of syllable structure is measurable (Easterday 2019; Maddieson 2013a) but has had even less work on geospatial distribution. In general the literature on the distribution of complexity has sought primarily linear correlations with factors such as latitude, altitude, or population size. Here we attempt to improve the record by surveying a larger set of parameters including consonant inventory, vowel inventory, tones, and syllable structure; and due to the setup of certain variables, we could compare binned and non-binned measures. We propose modeling techniques to reveal non-linear distributions and suggest aspects of an analytic framework to account for the distributions. We drew up a database on phonological complexity designed specifically to measure complexity and with continuous (non-binned) formulations of variables.

This paper employs computational methods to determine the effects of geographical parameters such as latitude or longitude on the distribution of phonological complexities in languages. These methods have two main advantages that are of major importance to this endeavor: (1) They can show trends and effects that are undetected by the human eye and (2) they provide an empirical basis for determining certain trends. In this paper, we build upon previous investigations with computational methods previously seldom employed in typological and complexity research. We intend to raise awareness of the geospatial side of linguistic complexity and lay groundwork to spur further research into the distribution and patterning of linguistic complexity.

2 Data

As the basis of this analysis we use a small but precise dataset of 336 languages similar to those used in other work on enumerative phonological complexity and drawn from published grammars supplemented by consultation with experts and our own knowledge. It seeks genealogical and geographical diversity while additionally surveying the Northern Hemisphere in enough detail to support closer family and areal comparisons. It is designed to capture complexity measures other than the phonological systems we survey in this paper. The main advantage of our dataset for this type of analysis is that the variables were specifically chosen to reflect phonological complexity in the broadest possible way. The measures are configured (as discussed above) to this research problem, an advantage over large-scale general databases which do not feature the properties needed to conduct this analysis.

Furthermore, our survey of languages was critical in the sense that we surveyed not only reported phoneme inventories but the entire phonology and phonetics sections of grammars, to make sure that we have captured the best possible contrast-based phoneme system. That is, we aimed to replicate published phoneme inventories following our criteria, and if there were differences we used our own analysis. The languages surveyed were chosen in the first place for quality of description and geographical and phylogenetic coverage. Appendix 1 shows the actual data for each language, along with metadata and the primary source(s) cited (not the full list of sources reviewed en route to our analysis) (Table 1).

Table 1:

List of variables used in this study.

Variable Type Description
Grades Count Stop and affricate distinctions other than place of articulation: voicing, aspiration, pulmonic, ejective, click, fortis/lenis, etc.
Places Count Places of articulation: labial, dental, velar, etc.
Pitch Count Coarticulations (of other consonants) such as palatalization, labialization, velarization, pharyngealization
Closure Count Additional closure(s). E.g. ejectives and plain clicks have an additional glottal closure; labial-velars have both labial and velar closures.
Geminate Count Presence of contrastive gemination or length in single consonants.
Total consonants Count Total consonants, minus those distinguished by a contrastive pitch coarticulation from counterparts at the same place.a
Qualities Count Vowel phones as defined by basic vowel qualities (front/back, round, high/low), excluding ancillary contrasts.
Ancillary Count Further vowel distinctions such as length, nasalization, and phonation types.
Total vowels Count Number of vowels (qualities plus ancillary contrasts)
Diphthongs Count Number of true diphthongs (behaving as a single syllable nucleus)
Total tones Count Number of contrastive tones
Contrastive stress Binary Presence vs. absence of contrastive stress.
Syllable max. onset Count Maximum number of consonants permitted syllable-initially (only those permitted word-initially).
Syllable max. coda Count Maximum number of consonants permitted syllable-finally (only those permitted word-finally).
  1. aThis exclusion is justified in the Supplement (Section S2).

The definitions are given in more detail in the Supplement.

3 Methods

3.1 Computational model

The computational model for this analysis needs to be capable of detecting and modeling several important aspects of geospatial and linguistic patterns. The model type which can fulfill these requirements is a custom state-of-the-art Bayesian nonlinear multi-level Gaussian Process model. The software we used to create this highly complex model is the Stan programming language.[2]

Below, we discuss the individual aspects and requirements for the model and how we integrate those as elements in the Gaussian Process model. The specific implementation is outlined below in this section.

3.1.1 Geospatial complexity

At the core of this analysis is the question of how the complexity variables in question behave vis-à-vis geography. This means that the model needs to be capable of modeling how languages differ geospatially with regard to the individual complexity variables. This requires there to be the capability to detect incremental areality, in other words the gradual changes in complexity from one language to the next. If we have, for example, an area with high complexity next to an area with lower complexity, the transition is gradual. Likewise, if we have an area with high or low complexity, we expect a language in this area to be likewise higher or lower in complexity. We want to be able to determine for every point on the globe what the expected complexity value (of the variable in question) is, given the languages that surround it. Taken together, this means that we need to model geospatial complexity as a smooth surface where we have a value for the average expected complexity at any point on the globe. Note here that “smooth” is the quantitative term to mean that a function over data is continuous (i.e., non-discrete), even if there are finite data points used to infer the function. The model therefore has an expectation for every point on the globe, even though we do not have a language at every point. This does not mean that the data or the variable in question (complexity in this case) is assumed to have smooth geographical transitions.

Computationally, this requires using a two-dimensional surface (the globe, excluding bodies of water) for the analysis on which the individual languages in our dataset are located. A Gaussian Process can then take this surface along with the languages on it and estimate the expected complexity value for each point on this surface. Gaussian Processes are a type of model architecture which adaptively pool model estimates based on the distance between the data points.[3] We therefore determine for every point on the globe surface: given its geographical distance to all other languages in the data set, what is the expected complexity value at this point? The closer languages are to this point the more weight they will have in determining the complexity value at this point. If, for example, we select a point on the geographical surface which is in opposite directions 500 km away from a language with four tones and 1,500 km away from a language with no tones, in simplified terms, we would expect it therefore to be closer in the number of tones, but not necessarily equal, to the language with four tones. How much pooling is applied for any given point on the surface (i.e., how much weight is given to far-away languages) is determined by its kernel function. This kernel function takes in the matrix of geographical distances between any point on the surface and all languages and estimates the pairwise covariance between each of the points or languages. The higher the covariance between two points is, the more the model estimates those points to mutually influence each other’s complexity values. Since this covariance is a function of geographical distance, the covariance will be determined by geographical distance. The kernel function we chose is an exponentiated quadratic kernel of the form

K i j = α e ρ D i j 2 3

where α is the maximum covariance between any two points on the surface, d ij is the geographical distance between two points (i and j) on the surface, and ρ is the shape parameter determining the covariance decay with increasing D ij .

Here, lim K i j α for lim ρ , D i j 0 and lim K i j 0 for lim ρ , D i j , α , , 0 .

This means that the smaller ρ becomes, the more all other points on the surface have equal weight in influencing one specific point. Conversely, the larger ρ becomes, the steeper the decay in the distance influence on covariance and farther-away points have less influence. We chose the exponentiated quadratic kernel as one of the standard kernels in Gaussian Processes and because its parameterization allows for easy interpretability of the relationship between geographical distance and covariance between languages.

Note that this modeling approach assumes language distance to be equal to the distance between two points that represent these languages. Although this is an assumption necessary for the model, it needs to be kept in mind that this does not capture the actual extent of languages accurately.

In the model, we feed a large matrix comprising the geographical distances (on a spheroid globe) between the individual languages and an additional 578 land surface grid points and infer the complexity value for each point. We do this for all complexity variables separately. The grid points were calculated by first obtaining a grid of points on the globe surface (more precisely: between [−180, 180] longitude and [−56, 72] latitude) spaced every five degrees latitude/longitude which resulted in 1,898 total grid points. Then we removed all grid points that did not fall on land.

3.1.2 Complexity autocorrelation

If we ran the procedure described above for each complexity variable, we would have one separate model for each variable. However, based on previous literature, we hypothesize that complexity in each individual variable is not entirely detached from other complexity variables. For example, vocalic complexity may correlate with tonal complexity in that languages with higher vocalic complexity may have higher or lower tonal complexity. Running separate geospatial models would therefore deprive the analysis of being able to infer the higher-order correlations between the variables. Moreover, by adding this additional higher dimension to the model, we are able to model the complexity estimates on the language level as part of a large, all-encompassing process where relationships on the language level but also on the variable level are considered. Doing this enables the model not only to estimate the geographical complexity relationships between the languages in the dataset for each complexity variable, but also to estimate the correlational relationships between the variables themselves. In the model, this is implemented by adding a multi-level component to the language complexity estimates. Here, the per-point mean expected complexity for each complexity variable inside the Gaussian Process function is drawn from a multivariate normal distribution of all variables.

3.1.3 Language family effects

As mentioned above, we are aware of the fact that language family might be a severe confound obliterating the geospatial effects under investigation because inheritance is a major causal link between languages of the same family. This means that if two languages are genealogically related, there is a higher probability of these languages exhibiting similar levels of complexity which they inherited from their common ancestor. It is hypothetically plausible that there are no geospatial effects on complexity at all and that geographic complexity patterns are a secondary artifact that arises from language families being unequally distributed across the globe. We therefore controlled for this potential confound in the computational model by adding a varying intercept for the variable language family within each complexity variable (i.e., each complexity variable has its own family-level term). By doing this, the models account for different families exhibiting different levels of complexity: varying intercepts are a modeling technique in which each grouping variable in the data (language family in this case) is assigned its own intercept. This allows the model to account for different group-dependent base levels in the data. Thus, if an effect in the regression model is solely due to variation between language families or idiosyncratic complexity levels of certain families, the model is able to account for this.

4 Model summary

Figure 1 shows the full model formula.

Figure 1: 
Full model formula.
Figure 1:

Full model formula.

In this model, C v i is the data point i in complexity variable v where C is the data matrix with dimensions n v × n where n is the total number of data points (in this case, languages) and n v is the total number of complexity variables. C v i is Poisson-distributed with rate λ Obs v . The rate is the outcome of an additive function with an intercept varying by language family within each v ( α v Family ) and the Gaussian Process outcome ϕ v for each v. α v Family is the outcome of a non-centered parameterization of a hierarchical Normal prior with the by-variable group mean and variance α v and σ v . The latter parameters have weakly informative priors. This varying intercept ensures that every language family within each complexity variable has its own intercept value. The Gaussian Process factor ϕ v , which determines the family-independent geospatial variation, is the outcome of a Cholesky-decomposed geospatial variance matrix K v (i.e., L v ) as a kernel matrix, and a vector of z-scores η v , which determine the z-transformed mean expected family-independent complexity values of variable v for each data point. Note that η v is a vector of length n. For each pair of points on the geographical surface ij, the kernel matrix K v is defined by the kernel function β e ρ D i j 2 3 (see discussion above, Section 3.1) where D ij is the geographical distance of ij, stored in distance matrix D. η v is a vector of z-scores for the means of the per-language geographical effects. However, it is not simply drawn from a single Standard Normal distribution. As discussed above, we want to capture complexity autocorrelation, meaning the covariance of the geographical effect on individual variables on a language level. For this, we draw the z-scores from a (Cholesky-decomposed) Multivariate-Normal distribution where the covariance matrix M contains the correlations between each of the complexity variables.

Due to the specific setup, the model has both languages and grid points on the geographic surface (see discussion on geospatial complexity above). The latter do not have associated complexity values. However, due to how Gaussian Processes analyze the data, they only infer the geographical effects at the locations of every data point (in this case, every language). All locations in between those languages are not covered. Therefore, we included a grid of 578 evenly-spaced points on the map along with the languages to make sure we obtain complexity values for all points on the globe.[4]

Those grid points are processed in the model jointly with all languages, with the exception that for the grid points, there are no observed complexity values C v . Therefore, we split the model in two outcomes, λ Su r v and λ Ob s v . The latter is fed as the rate parameter to the Poisson distribution to be evaluated against, whereas the former, the surface point estimates, remain as point predictions for the geographical complexity estimate at the surface point in question. However, since not every grid point belongs to a language family, we used the family-level mean α v for prediction instead. This means that λ Su r v are the geographical estimates including the inferred average all language families. Note that all but one variable are counts. For the binary variable, we use a Bernoulli outcome distribution with a log-link on λ Ob s v i . We use the family-level mean for the evaluation of the nonlinear function at the grid points to obtain the expected complexity value at each point excluding any effects from individual languages. At the observed points, we include language family as a factor such that family-level effects can be captured that would otherwise have biased the complexity estimate at the geographical level. Without this, the inferences at the grid points would be skewed towards the language family in that area.

Since some of the surveys have incomplete variables, the missing entries were imputed in the model by having them be missing inferred parameters. This requires the variable items to be missing-completely-at-random (MCAR) (no pattern to the missingness) which is fulfilled since the reason a language would have a missing entry is by not having a corresponding entry in the literature which is often variable-dependent (only three out of 14 variables have more than 4 % of values missing, and eight variables have no missing values).

In summary, the model infers the geographical estimates from a hierarchical model with intercepts varying by language family and geographical effects which are drawn from a Gaussian Process with an exponential quadratic kernel function informed by the pointwise geographical distances. At the same time, the Gaussian Process estimates are drawn from a Multivariate-Normal distribution modeling the correlations between the individual variables concerning the geographic effects.

In short, the Bayesian model infers the geospatial effects on phonological complexity while accounting for geospatial distance, family-specific effects, and correlations between the individual complexity variables.

4.1 Geographical distances

The geographic distances were derived from each language’s latitudinal and longitudinal values projected on a spheroid earth (Haversine distance) using the R-package geosphere (Hijmans 2022). However, calculating distances between languages on a globe as part of a geospatial or language contact study is problematic. In its raw form, a language in Newfoundland would be equally distant from a language in central Europe and a language in central Asia. However, contact and language spread between central Asia and Europe has been far greater than between North America and Europe (in the latter case, barely any). Yet in the model, both language pairs would be treated as equally mutually influential.[5] We mitigate the issue through a map transformation in which we artificially inflated the globe by a scale factor of 1.5 and then projected a Pacific-centered map back onto it. This way, we inflated the Atlantic to more than 1.5 its actual size. We calculated that the distortion between each point on the West-East axis from Europe/Africa to the Americas is small, while distances between the East-West Europe/Africa – Americas axis are greatly inflated. This allows the model to run as intended, but strongly discounting any geospatial or contact influence across the Atlantic. Inflating the Atlantic region on the map is a necessary procedure, since the model requires the pairwise distance matrices to be positive definite, which makes editing any individual distances impossible. In other words, the model requires the input data to be distributed on a contiguous surface (a globe in this case) which disallows for manual changes to pairwise distances. In preliminary tests, we tested running the model without the inflated Atlantic and observed a noticeable influence of (West) African language complexity on (Eastern) South American languages. While this would be a genuine model result, we know that since there was no actual historical contact in this case, the influence was likely an overestimation.

For the computational analysis, we used the model of a globe earth described above, but the maps used for examining the model results are Pacific-centered maps.

4.2 Mode of result analysis

The output of the model is, as discussed above, a grid of geographic points on the globe with inferred complexity values for each complexity variable.[6] This gives us a series of 2-D maps with the inferred complexity values. However, while 2-D maps (one per complexity variable) are useful for a general overview of the global complexity patterns, they are difficult to compare with one another. Additionally, smaller geospatial variations are easily missed. For this reason, we defined seven geographical macroareas which we use to isolate geographical patterns in order to analyze them in more detail.[7]

The macroareas are drawn from the continent-sized areas of the Autotyp database (Bickel et al. 2022) but differ from those and other large language areas in the literature in that each has an axis of orientation along which we often find that variable values are aligned. The grid points which fall in each area (defined in Figure 2a) can be isolated and plotted as a one-dimensional nonlinear curve against one geographic coordinate. For example, we could isolate the inferred complexity data points that fall in one continent and then plot those values against, for example, latitude. As a result, we can observe the changes in complexity along the latitudinal axis. This is akin to taking a slice out of the world map and looking at the grid points from the side.

Figure 2: 
Map of the selected geographical areas. (a) Every polygon corresponds to an area while the blue dots correspond to the locations of the surveyed languages. (b) The grid points used in the model (red crosses) overlaid with the surveyed languages (blue dots). It is important to keep in mind that the model uses and outputs the estimates of this set of grid points instead of complexity values for the individual languages.
Figure 2:

Map of the selected geographical areas. (a) Every polygon corresponds to an area while the blue dots correspond to the locations of the surveyed languages. (b) The grid points used in the model (red crosses) overlaid with the surveyed languages (blue dots). It is important to keep in mind that the model uses and outputs the estimates of this set of grid points instead of complexity values for the individual languages.

The axes of orientation for the most part follow the predominant directionalities of migration and expansion as the human range expanded from Africa to fill the globe: from Africa to Europe and the Near East (for expansions and fallbacks along this axis see Section 6.3); from Southeast Asia to Northeast Asia, populating Siberia; from Southeast Asia to New Guinea–Australia, populating the Pacific; from North America to South America, populating the Americas.

The macroareas, in summary, are used in the analysis of the results, but are not input to the model and calculations. The map in Figure 2 shows the geographical ranges of the macroareas.[8]

The isolated areas are in detail:

  1. American Pacific rim (latitudinal): an area consisting only of languages along the coastal and near-coastal land up to the far side of the major coast range, along the entire coastal length from Alaska to Tierra del Fuego.

  2. Americas: all languages located in the Americas.

  3. East Asia (latitudinal): an area comprising languages in northern and central East Asia, including the Asian Pacific rim, stretching from northern Siberia to the northern border region of the Indochinese Peninsula.

  4. Euro-Africa (latitudinal): all of Africa and Europe, along with some of the Middle East.

  5. Northern Eurasia (longitudinal): from western Europe to northeast Asia, south to the Caucasus and Mongolia, and including the entire Eurasian steppe.

  6. Southeast Asia and Australasia (diagonal, northeast to southwest): Northwest-Southeast starting in mainland Southeast Asia and ending in southeastern Australia.[9]

  7. North America (latitudinal): all of North America from the Arctic to southern Mesoamerica.

In addition to the nonlinear curves, we give small cropped maps with the mean complexity measures superimposed.

Note also that we did not assess the fit of the model, as this is a purely inferential model rather than a predictive one and we did not have a competing model to which we could compare it. This means that a single-fit metric for model predictions for the existing languages is only useful to assess the predictive power of the model and leave-one-out cross-validation measures can only be interpreted in the context of other competing models of the same data. We calculated that the median absolute difference between the inferred values and the observed languages equals 0.35 standard deviations across all variables: Grades (ranging from 1 to 9), for example, has a standard deviation of 1.29, which means that the median difference between predicted and observed values for Grades is 0.45. As discussed above, it is not possible to put this value into perspective, unless there are other models to which this can be compared or if there is a specific accuracy score that needs to be met for predictions. In this inferential model, neither of these is the case.

5 Results

5.1 Areal results

In the following, we can examine the area plots to get the model estimates for complexity gradients in those areas along a geographical variable (latitude, longitude, or diagonal). All figures show the mean posterior estimates for the distribution of grid points on the surface by complexity variable with a black median line. Every blue vertical line represents the 95 % credible interval over the grid points at one point on the x-axis. It thus corresponds to the range of complexity at a certain point on the x-axis in the selected area. This means that the larger the vertical line segments are, the more variance in complexity there is at the corresponding point.

Concretely, these graphs are longitudinal or latitudinal slices from the inference of the grid points on the world map. They are oriented in the direction of ascending geographical latitude or longitude. This means the graphs run from south to north or from west to east. As discussed above, the model itself makes inferences in the global context. The closest to this output is what we can see in the heatmaps in Appendix 2. However, since heatmaps of the globe can be difficult to analyze especially for many complexity features, we consider selected areas (a subset of the heatmaps) and plot the distribution of complexity (inferred from the grid points) in that area (landmass only) along a geographical (e.g., latitude/longitude) axis. As a practical example, the complexity distribution for Closure in the North America area (Figure 3) can be read as follows: the blue vertical lines are the ranges which 95 % of complexity values fall in. We look at the graphs according to their west-east longitudinal placement. The black curve in the graphs indicates the median line of all grid points and can be taken as the approximate median average complexity at that latitude/longitude. This is akin to looking from the south and seeing the relief of the complexity distribution in North America. This means the left part of the graph is approximately located in western North America, the part on the right in eastern North America. For this reason, the complexity peak in the left third of e.g. the Closure plot corresponds approximately to the Pacific Rim area.

Figure 3: 
Mean posterior estimates for the grid points in the North America area by variable (blue line segments). X-axis is the longitude west to east. The black line indicates the median of the estimates at every increment of the geographical variable.
Figure 3:

Mean posterior estimates for the grid points in the North America area by variable (blue line segments). X-axis is the longitude west to east. The black line indicates the median of the estimates at every increment of the geographical variable.

5.1.1 North America

The North America area shows one dominant pattern: an overall complexity decrease from west to east in all variables except Geminates where complexity instead increases from west to east. Most of these variables show a distinct peak at around 225 degrees which is the result of higher phonological complexity along the North American west coast (Figure 3). The dip at the very left is due to lower complexity in Alaska, which is further east than the west coast.

We can further see that the higher complexity is mostly constrained to the Pacific Northwest with Ancillary having a second peak in the Eastern Part of the United States, while Total Tones and Total Vowels are both higher in the far northwest and in the central parts of North America (Figure 4).

Figure 4: 
Heatmap of the mean posterior estimates for the grid points in the North America area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.
Figure 4:

Heatmap of the mean posterior estimates for the grid points in the North America area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.

5.1.2 The Americas

The latitudinal view of the Americas area shows a steep rise in complexity in the syllable variables (syllable onset, contrastive stress, and syllable coda) from the Equator northwards. Other variables show small or less clearly observable complexity increases (e.g. Pitch, Geminates, Places, Pitch). Notably, geographical variance in this area is generally higher in the north with the exception of Qualities and Total Vowels. Higher complexity in South America is mostly found in the eastern part closer to the Atlantic coast (see Figures 5 and 6).

Figure 5: 
Mean posterior estimates for the grid points in The Americas area by variable (blue line segments). X-axis is the latitude south to north. The black line indicates the median of the estimates at every increment of the geographical variable.
Figure 5:

Mean posterior estimates for the grid points in The Americas area by variable (blue line segments). X-axis is the latitude south to north. The black line indicates the median of the estimates at every increment of the geographical variable.

Figure 6: 
Heatmap of the mean posterior estimates for the grid points in the Americas area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.
Figure 6:

Heatmap of the mean posterior estimates for the grid points in the Americas area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.

5.1.3 American Pacific Rim

The latitudinal view of the American Pacific Rim area suggests a similar, yet distinct pattern from the Americas area (recall that the American Pacific Rim is a subarea of the Americas as a whole) (Figures 7 and 8). The complexity increases at and north of the Equator are more pronounced though some of these variables fall off in the far north. South America is generally of lower complexity. The lower complexity in the south may reflect the settlement history of the Americas: immigrations were very few until the end of glaciation opened up inhabitable land near enough to Siberia that a more varied set of immigrants could enter (see also Section 6.2 below). Geographical variance is smaller than in the Americas overall with the exception of Total Vowels (cf. Figure 5). This can be seen in the blue line segments around the median line being shorter.

Figure 7: 
Mean posterior estimates for the grid points in the American Pacific Rim area by variable (blue line segments). X-axis is the latitude South to North. The black line indicates the median of the estimates at every increment of the geographical variable.
Figure 7:

Mean posterior estimates for the grid points in the American Pacific Rim area by variable (blue line segments). X-axis is the latitude South to North. The black line indicates the median of the estimates at every increment of the geographical variable.

Figure 8: 
Heatmap of the mean posterior estimates for the grid points in the American Pacific Rim area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.
Figure 8:

Heatmap of the mean posterior estimates for the grid points in the American Pacific Rim area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.

5.1.4 Euro-Africa

The Euro-Africa area shows a varied set of complexity patterns (Figures 9 and 10). Consonantal, vocalic, and tonal complexity are higher in southern Africa while in the stress and syllable structure variables this pattern is reversed. Because of this strong opposing distributional behavior, which is not found in this many variables in any other region, we can identify a geographical area which is consistently in between the two complexity levels. This area is found between 0° and 25° N and marks the approximate range within which each individual complexity trend starts, ends, or reverses. For example, this area marks the end of the complexity decline in the variables Closure, Grades, Pitch, Places, and Total Tones, while marking the approximate beginning of an ascending trend in the Syllable variables and Places. In the syllable variables, its northern end corresponds to the area with the steepest increase in complexity. It can therefore be seen as a transitional area between opposing gradients of phonological complexity. We call this the Euro-African Transitional Zone (ETZ) and discuss it further in Section 6.3 below. This pattern is unlikely to be an artifact of the model, since there is ample support from various languages in this region (see Figure 2b), and if there were a sudden break in complexity rather than a transition between these areas, we would see a sudden jump in complexity in the model inferences in this region similar to what we can observe in the ContrastiveStress variable in Southeast Asia and Australasia (Figure 15) or the Americas (Figure 5). In those cases, the patterns indicate a sudden step in the complexity gradient which is not found in the ETZ pattern.

Figure 9: 
Mean posterior estimates for the grid points in the Euro-Africa area by variable (blue line segments). X-axis is the latitude south to north. The black line indicates the median of the estimates at every increment of the geographical variable. Red vertical lines indicate the extent of the ETZ, to which we now turn.
Figure 9:

Mean posterior estimates for the grid points in the Euro-Africa area by variable (blue line segments). X-axis is the latitude south to north. The black line indicates the median of the estimates at every increment of the geographical variable. Red vertical lines indicate the extent of the ETZ, to which we now turn.

Figure 10: 
Heatmap of the mean posterior estimates for the grid points in the Euro-Africa area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.
Figure 10:

Heatmap of the mean posterior estimates for the grid points in the Euro-Africa area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.

The complexity gradient in the Euro-Africa area is mostly latitudinal, with the exception of Places which is higher in the Caucasus region than in western Eurasia.

5.1.5 North Eurasia

The longitudinal view of North Eurasia is marked by very strong stress and syllable complexity patterns as well as Geminates which are very high in Europe and decline towards Eastern Asia. Notable is the peak or trough in complexity around 25° which corresponds approximately to a strong dip or increase in complexity in the Caucasus which breaks the overall pattern (see Figures 11 and 12).

Figure 11: 
Mean posterior estimates for the grid points in the North Eurasia area by variable (blue line segments). X-axis is the longitude west to east. The black line indicates the median of the estimates at every increment of the geographical variable.
Figure 11:

Mean posterior estimates for the grid points in the North Eurasia area by variable (blue line segments). X-axis is the longitude west to east. The black line indicates the median of the estimates at every increment of the geographical variable.

Figure 12: 
Heatmap of the mean posterior estimates for the grid points in the Northern Eurasia area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.
Figure 12:

Heatmap of the mean posterior estimates for the grid points in the Northern Eurasia area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.

5.1.6 East Asia

The most apparent property of this area is the high variance which dominates the graphs. This indicates that, unlike in the other regions, there are strong differences between the complexity levels throughout the area. The reason for the high variance is probably the sharp interfaces between the three linguistic populations known as Paleosiberian and Altaic in the north, and Southeast Asian in the south. Structurally, all three are internally diverse and different from each other. We discuss this issue more in Section 6.1. Moreover, a consonantal higher-complexity zone can be found in central East Asia (Figures 13 and 14).

Figure 13: 
Mean posterior estimates for the grid points in the East Asia area by variable (blue line segments). X-axis is the latitude south to north. The black line indicates the median of the estimates at every increment of the geographical variable.
Figure 13:

Mean posterior estimates for the grid points in the East Asia area by variable (blue line segments). X-axis is the latitude south to north. The black line indicates the median of the estimates at every increment of the geographical variable.

Figure 14: 
Heatmap of the mean posterior estimates for the grid points in the East Asia area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.
Figure 14:

Heatmap of the mean posterior estimates for the grid points in the East Asia area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.

Note that the blue line segments here are more widely spaced because the scale is smaller.

5.1.7 SEA and Australasia

Generally, complexity is higher in Southeast Asia and declines towards southeastern Australasia. It is high in the southeast for places because Australian languages tend to have multiple contrastive series among dental, alveolar, palatal, and retroflex consonants. However, there is comparatively more variance especially in the variables Geminates, Total Consonants, and Ancillary (see Figures 15 and 16). There is a strikingly smooth decline in syllable codas, with very little variance, chiefly because Australian languages near-unanimously have very simple syllable structures.

Figure 15: 
Mean posterior estimates for the grid points in the SEA and Australasia area by variable (blue line segments). X-axis is the diagonal northwest to southeast. The black line indicates the median of the estimates at every increment of the geographical variable.
Figure 15:

Mean posterior estimates for the grid points in the SEA and Australasia area by variable (blue line segments). X-axis is the diagonal northwest to southeast. The black line indicates the median of the estimates at every increment of the geographical variable.

Figure 16: 
Heatmap of the mean posterior estimates for the grid points in the Southeast Asia and Australasia area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.
Figure 16:

Heatmap of the mean posterior estimates for the grid points in the Southeast Asia and Australasia area by variable superimposed on map outlines. Brighter color (yellow) indicates higher complexity.

5.2 Parameter covariance

Recall that the Bayesian inference model treats each variable effect as an outcome of a correlated process on the level of each grid point. This means the geographic effects are not modeled independently of one another but jointly come from a common underlying Multivariate Normal distribution. If variables are correlated in their effects, they will more often yield similar outcomes on the geographical level. For example, if high complexity in variable 1 often occurs together in the same geographic location with high complexity in variable 2, both are positively correlated. Because of this property of the model, we can extract the correlations from the covariance matrix. Figure 17 shows the mean posterior correlation between the complexity variables (plotted with the R-package corrplot; Wei and Simko 2017).

Figure 17: 
Posterior estimates of the pairwise correlations between the complexity variables. Larger dots with more intense colors indicate higher correlation while the color indicates the direction of the correlation.
Figure 17:

Posterior estimates of the pairwise correlations between the complexity variables. Larger dots with more intense colors indicate higher correlation while the color indicates the direction of the correlation.

Here, we can see that most variables are slightly to strongly positively correlated with each other. The highest correlations can be found between Total Consonants and Grades, Places, Pitch, and Closure; Ancillary, Total Vowels, and Qualities; and Closure and Grades. The only exception, albeit very a slight one, is the negative correlation between Qualities and Places/Closure.

This means that in the system overall, many variables seem to be positively correlated to some degree. As a result, they show similar complexity trends in the geospatial context.

As a second check on this predominant positive correlation between the individual variables, we ran a Bayesian Latent Variable Factor Analysis model using the R-package blavaan (Merkle et al. 2021). We found that no complexity measure can be securely determined to be influenced by the latent variable negatively (the compatibility intervals of Quality, Diphthongs, ContrastiveStress, Initial Stress, and Final Stress overlap with zero – all others are positive) which means that all complexity variables are either positively or neutrally associated with a latent complexity measure. However, this basic Factor Analysis is not sufficient to demonstrate any such relationship definitively, because it does not account for the spatial component.[10] We take this therefore as corroborating our above findings from the more complex spatial model and results of previous studies (see Hartmann forthc.).

5.3 Geographical covariance

Lastly, we can inspect the geographical covariance within each variable. Since the Bayesian model contains a Gaussian Process, we model each point on the geographic surface in relation with all other points. Sometimes, far-away points will influence a point more strongly and sometimes only localized effects are influential. All of this is expressed in the shape of the kernel function (see discussion in Section 3.1) where, when covariance is high in a variable for a large geographical distance, it means that this variable shows more long-distance effects. Figure 18 shows the posterior kernel functions for each variable.

Figure 18: 
Posterior kernel functions by variable. The black line is the posterior mean while the colored area represents the 95 % credible interval.
Figure 18:

Posterior kernel functions by variable. The black line is the posterior mean while the colored area represents the 95 % credible interval.

Overall, we can see that five variables show clearly discernible localized effects (represented by a steeper drop-off in covariance). Ancillary, Total Consonants, Grades, Total Tones and Total Vowels show this behavior. Most notably, Total Consonants and Total Vowels decrease steeply and approximate zero by approximately 2,000 km distance. This means there is close to no long-distance influence between the points on the surface beyond 2,000 km.

All other variables show considerably more long-distance influences. Concretely, this means that complexity is more homogeneous over large distances and shows smoother curves. This can be interpreted as these variables being geographically more homogenous and showing less steep changes.

6 Discussion

6.1 Distribution of variables across areas

Clear geospatial profiles in total vowel complexity among the areas are generally lacking. This does not necessarily mean that there are no vowel complexity differences associated with any area, but only that the association strengths are smaller or sparse. Vowel complexity (Total Vowels and Ancillary) shows varied complexity levels such as in North Eurasia or less varied patterns as in SEA and Australasia or North America where there is one single peak. An example where we did find clear differences for vowels was the decrease in complexity from Southeast Asia to Australasia (see Figures 15 and 16).

Consonant complexity, stress, and syllable complexity, on the other hand, prove to be good indicators for gradients in geospatial complexity distributions. The more high-level complexity measures (e.g., Total Tones, Total Vowels, and Total Consonants) show clearer geographical profiles, since they are more locally determined (see Figure 18). This means that they are distributed in such a way that there is less inferred similarity with farther-away languages when it comes to these measures. This gives these variables clearer, regionally determined patterns.

As noted in Section 5.1 above (Figure 5), the settlement prehistory of the Americas would seem to predict some of the patterns we see in the distribution of complexity and variance in the Americas. Continental glaciation blocked overland entry to North America north of about the Oregon–Washington border until about 14,000 years ago, when what is known as the Ice-Free Corridor between the Cordilleran and Laurentide ice sheets formed and gradually widened. The first evidence of humans north of the ice appears just before the opening, but there is little or no evidence of actual north-to-south entries via this route (see Dalton et al. 2023, especially Fig. 9; Potter et al. 2014, Willerslev and Meltzer 2021). At that point immigrants from East Asia seeded a diverse North American linguistic population (for a sense of the population density see Anderson et al. 2019, 2010). Coastal movements by watercraft undoubtedly brought earlier settlements south of the ice sheets, probably beginning 25,000 or more years ago (see e.g. Dalton et al. 2023; Gowan et al. 2021 on ice sheet reconstruction over time, Davis and Madsen 2020 on paleodemographic and migration implications, Praetorius et al. 2023 on possible entry times based on sea ice conditions, and Nichols 2024 on linguistic entries). The earlier, glacial-age immigrants, however, were far fewer and must have formed a sparse and highly mobile population.[11] We know little about the impacts of that population structure and its sociolinguistics on language structure, but we suppose linguistic boundaries may have been fluid and languages may have been affected by generations of language mixture and L2 learning, and of course that population had few entering ancestors to start with. The result could well have been reduced diversity, variance, and complexity compared to languages of the north. We raise this possibility for purposes of hypothesis formation; if it has any reality, it gives us a general identity and age for a linguistic population much older than the Comparative Method can usually trace for descent groups.

More recent population movements may be responsible for the high variance in East Asia (Figures 13 and 14). Here we find a sharp discontinuity between the diverse and often complex languages collectively known as Paleosiberian (Chukotka-Kamchatkan, Eskimo-Aleut, Yukagir, Nivkh, and sometimes also Korean and Japanese) and the three families known collectively as Altaic (Turkic, Mongolic,[12] and Tungusic). The latter three are known for their structural regularity (e.g. consistent head-final morphology and syntax, simple syllable structure, consistent causativization in the verbal lexicon) which strikes us as well adapted to contact, calquing, and symbiotic bilingualism. For these language groups and an overview see chapters in Vajda (2024). Still different, and complex in different ways, are the languages of mainland Southeast Asia (for a structural overview see Enfield 2014; the two major language families, both old and much diversified, are Sino-Tibetan and Austroasiatic). The meeting of those three very different structural profiles appears to be showing up in the high variance observable in Figure 13.

6.2 Broader and global distribution of variables

Some previous work (Shagal et al. 2019, nd; Nichols 2017; Nichols et al. 2004; Janhunen 2000; and others) has revealed the existence of large-scale gradient distributions of a number of typological properties, running east-west across Eurasia or the entire northern hemisphere. The explanation has been that there is a long-standing westward trajectory of language spreads from northern Asia to Europe via the Eurasian steppe and northern forest, and a similar trajectory of language movements from northern Asia into North America. In such studies, gradient distributions or east-west asymmetries more generally have been observed opportunistically, and other typological features were then examined in searches for possibly confirming evidence. Some of our areas are based on those earlier findings, but here we have made the areas – subsets of the global complexity distributions calculated by the model – central to the interpretation of the results. The most striking feature of the North Eurasia area is that, along the entire West-East axis, we find the most varying and seemingly uncorrelated complexity patterns. Vocalic and tonal complexity measures show many larger and smaller peaks along the extent of the area, while syllable and stress complexity decline from west to east. Consonantal complexity (with the exception of pitch coarticulation), on the other hand, shows no clear patterning, yet has increased variance between the languages in the area around 50° E.

The overall picture of the complexity distributions shows that syllable and consonant complexity are the most clearly geographically distributed. Australasia has very low complexity overall: most complexity variables decrease in the south(east) direction. We also find that South America and Australasia are quite similar in that complexity decreases to the south in both areas. Perhaps more accurately: the further to the southern periphery of these areas’ languages, the more complexity decreases. Euro-Africa is very different, as the same south-north complexity increase does not hold there (see discussion in Section 5.1 and Figures 9 and 10). Instead, we find many opposing complexity directions in different variables.

Concerning South America and Australasia, both are at the far ends of early human migration chains and the far edges of the former human species range, which followed trajectories from Africa across southern Eurasia via Southeast Asia to New Guinea and Australia, and from both Southeast Asia and Europe north to Beringia and then to the Americas (HUGO 2009; Rootsi et al. 2007; Raghavan et al. 2014, and others). That is, the lower complexity in South America and Australasia is an isolation by distance effect: these places are at the far ends of migration trajectories, relatively isolated from large-scale effects, and shielded by natural barriers such as oceans and bottlenecks. It might have been the case that in both areas a dominating earlier complexity level was easily maintained due to the relative isolation. In this view, whether the complexity level is low or high is irrelevant as it is merely the level which first spread into these areas and was trapped in the formation of low-complexity pockets. An alternative hypothesis, which is able to account for the fact that it is low complexity that is prevalent in these areas, is that low complexity is an ancient feature, higher complexity has diffused outward from population centers in Africa and Eurasia, and the later developments have not yet fully penetrated to the ends of the trajectories. Innovations are expected to arise and be propagated in denser networks with unequal distributions of prestige and connections (Fagyal et al. 2010), conditions that have been present longer and more strongly in the Old World than at the younger eastern edge of the human range. An isolation by distance model predicts that some innovations made near the center will fail to reach to the edges, especially innovations dependent on or favored by other features in the same or neighboring languages (such as clicks dependent on or favored by ingressive articulations). Such innovations are more readily lost than gained, and when the innovation is an additive one the consequence will be feature loss in outlying regions.

When comparing the results from the American Pacific Rim to the geographical effects found across the Americas as a whole, we see that the subarea American Pacific Rim exhibits the same gradients and effects as the larger Americas does. On the whole, the American Pacific Rim has considerably more clear gradients. This suggests that the American Pacific rim is a true subset of the Americas as a whole (cf. also Jacobsen 1989). Further, the American Pacific Rim and the western region of the North America area show overlapping complexity peaks compared to the Americas as a whole. This means high phonological complexity is concentrated in the North American Pacific Rim area. The only variable where this trend is broken is the geminates, which strongly increase toward the eastern part of North America.

Two instances where vocalic and tonal complexity are defining properties of areas and their axes involve Southeast Asia, where the two areas SEA–AUS and SEA–NEA overlap. The region of intersection is a vowel complexity peak, with complexity dropoffs in both directions away from it, albeit with much variance in the northern direction. This phenomenon can be seen as a complexity fork.

6.3 The Euro-African transition zone

In Figures 9 and 10, we see the Euro-African Transition Zone (ETZ), a complexity transition zone between the Equator or the southern Sahel in Africa (about 0° to 10° N) and the northern edge of the Sahara Desert (about 25° N), with the northern edge of this area extending into the African Mediterranean coast. This transition zone represents an intermediate area between two opposing complexity regions, where we see high consonantal and tonal complexity, and low complexity of syllable structure, stress, and overall vocalic complexity, in Africa, but roughly the opposite in the Levant, Anatolia, and southern Europe. The main feature of this region is that inside of this transition zone, several complexity variables reach a low point or turning point (the point at which a decreasing complexity trajectory stabilizes or even reverses) or an inflection point (the point at which the rate of change of the curve reverses). For some variables, the region contains languages lower in complexity, while there is higher complexity on both sides. For variables where complexity is continuously rising or falling, the languages here constitute the point at which the speed of the change measured from south to north starts to flatten. For the prehistory and linguistic geography of the ETZ see Hartmann and Nichols (forthc.).

An exact geographic definition of the ETZ proves difficult as the patterns, where present, do not all align. The ETZ is unlike anything we see in the other areas; there are various perturbations elsewhere, but the ETZ is the only case where there is a clear mirror-image pattern between the variables. This calls for an area-theoretical description and an explanation or hypothesis, to which we turn in the following subsections.

6.3.1 Geographical and economic issues

Why is the ETZ located where it is? Possible contributing factors include the Mediterranean Sea, which was a barrier or at least an inhibitor to European-African linguistic contact until the development of sailing technology in the Bronze Age; and the Sahara, a spread zone where human populations are naturally sparse and mobile, contacts are long-distance and not always stable, and patterns of linguistic diffusion and language spread may naturally have become quite different from those of Europe and sub-Saharan Africa. Probably the major factor, however, was the North African Humid Period (NAHP), lasting from about 10,000 to about 7,500 years ago, during which the monsoon belt moved north and the Sahara became a well-watered grassland with large lakes and rivers; it was followed by desiccation which continues today (see Williams 2021 and Hartmann and Nichols forthc. for the linguistic consequences, and sources there for the paleoclimatology and the impact on language family ranges). During the NAHP, human populations moved into the Sahara, probably from all sides, and post-NAHP there has been retreat, competition for decreasing resources, and likely language extinction. Linguistic diversity in today’s Sahel is extremely high, possibly reflecting crowding (over and above the high diversity expected at that latitude). It seems likely that the drastic expansion and retraction of the NAHP, together with the development of the Neolithic and then the Bronze and Iron Ages post-NAHP, must have affected rates and directionalities of language spread and diffusion (diffusion, and language boundaries, are chiefly east-west in Africa; Güldemann 2018, 2019; Cysouw and Comrie 2009; Güldemann and Hammarström 2020, i.e. north-south spreads and diffusion are inhibited). All of these factors must have impacted the distribution of linguistic complexity in the ETZ.

6.3.2 Theoretical issues concerning the ETZ

The variable boundaries and differently located complexity peaks or troughs in the ETZ indicate that the different complexity measures are independent to a great extent. If they were logically linked to one another or stemmed from the same group of languages, we would see a more homogenous picture. However, they do seem to adhere to a common trend insofar as in this common transition zone most complexity measures are declining or increasing simultaneously. The congruence is not perfect, as just discussed, since diffusion processes involve different centers, trajectories, and rates of change.

Using a different method and a somewhat different inventory of variables, Hartmann and Nichols (forthc.) find an additional unique ETZ pattern: a low count of consonants and a high number of features, which suggests that, in the languages of the ETZ, more features are distributed over fewer consonant phonemes. Put differently, phoneme inventories are less economical and symmetrical in the ETZ than farther south in Africa, possibly reflecting the shorter post-NAHP time frame for the interaction and development of phoneme systems, compared to the very long time that was available for the development of the large and symmetrical consonant systems of southern Africa.

We find the ETZ effects only in Africa, though similar effects of climate changes might also be expected in similar latitudinal and climate zones in Australia and North and South America. We suspect the difference is primarily due to the greater width of Africa in the desert zone, affording room for geolinguistic patterns to be visible and a great variety of phonological complexity levels to develop. It could also be that a denser survey would reveal finer-grained patterns. Probably most important, the pattern of advance and fallback in response to climate and other changes in Africa has run north-south during the time frame recoverable from language family ages, while that of Australia is multidirectional, with movement between center and peripheries, as well as circular, with large spreads beginning in the northeast and moving both clockwise and counterclockwise around the continent. Capturing this would require a different and much more complex model (References include Güldemann 2010; Güldemann and Hammarström 2020; Hartmann and Nichols forthc.; Lourandos 1997; McConvell 2020; Evans and McConvell 1998; Nichols 2020:27–30).

6.4 Is complexity adaptive?

Work showing correlations of complexity with other factors, both linguistic and non-linguistic, generally assumes that complexity is adaptive, over time tending to acquire the level that suits the sociolinguistic or extralinguistic context best or is most easily learned (Bentz 2018; Sinnemäki and Di Garbo 2018). Our findings suggest that, at the scale we are investigating, complexity levels are not responsive to any evident contextual factor.

Along the south-north areas (Africa-Europe, Australasia-Southeast Asia, South America-North America), the distribution of complexity is not consistent with what has been said about adaptation. Patterning along these axes, if adaptive, should at least reveal the one kind of geospatially structured complexity that has been previously described: latitudinally conditioned complexity with the extreme levels near the equator and a fall-off polewards in both directions (The cause has to do with increasing density of resources, length of growing season, density of languages, density and diversity of language families, density and diversity of species, population density, etc. toward the equator, vs. greater sparseness toward the poles). This is in fact what is found by Bentz (2016, 2018:125–127) and Maddieson (2013a). It is not what we see in our areas, however; they show increased complexity from south to north, i.e. a gradient extending from pole to pole.

We conclude that, at the global scale of this study, complexity does not appear to be adaptive to either environmental or sociolinguistic contexts. But it also has a large-scale geospatial distribution that is too large and appears too stable to be the result of local sociolinguistic, ecological, etc. factors. Rather, on the face of it the global distribution would suggest that lower complexity is the ancestral state and higher complexity has evolved piecemeal as individual additive innovations have spread out from long-standing centers of population growth and expansion: Africa-Europe and Southeast Asia. There does appear to be evidence of adaptive complexity at lower levels of linguistic population structure, such as language families or local areas (Nichols and Bentz 2018), indicating, consistent with most of the literature, that complexity levels are the result of universal kinds of adaptation to historical contingencies such as sociolinguistic isolation. But the global scale of the present study and the large-scale or global scope of some of our complexity areas suggest that another, equally strong, factor is distance from the western Old World, which we interpreted (Section 3.3 and just above) as meaning distance from early centers of population growth.

6.5 Complexity as a meta-property

We found that the different complexity measures tend to correlate positively or neutrally; there are no strong negative correlations (see Figure 17). The positive correlations are strongest for measures that are inherently related, such as Total Consonants and Grades, Closure and Grades, Tones and Vowels, and Ancillary and Vowels. In general, we find mostly positively covarying variables which behave in similar ways vis-à-vis their geospatial patterning. One of the more surprising covariances can be found in total tones and grades.

We have seen that complexity can pattern similarly with coinciding geospatial trajectories. We see lower complexity in Australasia, South America, and eastern North America in several complexity domains. Higher complexity is found in southern Africa and the northern American Pacific Rim area. This suggests that individual regions not only display a unique complexity profile, but that this profile appears in many variables at the same time. In short, the findings show that the complexity variables considered here are cumulative. This is further supported by the variable covariances detected in the model (see Figure 17), where the covariances of notable strength are exclusively positive. This is true for variables of the same complexity type (e.g., all affecting consonants or all affecting syllables) but also across the types. Given that this is a stable global pattern, it can be taken as an indication that diachronic complexifications and decomplexifications are both geographically sensitive (i.e., they do not occur randomly scattered across all languages) and appear in unison with other complexifications or de-complexifications.

Thus, as many previous studies suggest as well (e.g. Maddieson 2011, 2009), we find that complexity in general seems to be cumulative. More specifically, this means that different complexity measures tend to be predictive of one another, meaning that complexity levels rise and fall in unison. The covariance of individual complexity measures found in Section 5.2 is furthermore relevant to the debate on equi-complexity, which concerns whether complexity differences on the micro-level (i.e. individual complexity measures) compensate one another so that languages are equally complex on the linguistic macro-level. Building on our findings in this paper, we can ascertain that there is little evidence for compensatory or tradeoff behavior between phonological measures, at least for larger geospatial trends (see also Maddieson 2011, 2005/2013a, 2005/2013b, Sinnemäki 2014a). We follow Hartmann (forthc.) in suggesting that the actual differences between the complexity measures, though salient, are constrained by the bounds of what is required for languages to maintain a stable phonological system (the observed complexity levels were never so low as to result in a collapse of discriminative power in lexical and sublexical units in any one language). In other words, it is conceptually difficult to determine compensatory effects in a system that is stable even when the variation in complexity within and across complexity measures is large (see e.g. Sinnemäki 2014a).

The findings of co-occurrence of complexity patterns both in the level of complexity and its geographical extent suggest that phonological complexity can be seen as a single property that manifests in individual observable complexity traits. It is influenced by geography in that historical developments such as migration and contact lines can yield different levels of complexity in different areas. This notion of an abstract property of complexity is further supported by the observation that the individual complexity features are mostly positively correlated. Complexity, therefore, seems to be generally cumulative. As a cumulative feature it can therefore be well explained by a common underlying – i.e. latent – property that propagates outward into individual observable complexity measures at the surface (see discussion in Hartmann, forthc.). A theory of complexity therefore has to account for this apparent unity on the underlying level.

7 Conclusions

The global geographical distribution of phonological complexity shows distinctive patterns that are strongly dependent on the region or the geographical areas under investigation. We used methods of linear and nonlinear regression to examine these patterns, which made visible a number of geographical effects in different geographical areas. Among the key geographical patterns we identified is that phonological complexity is generally lower in Australasia and South America, coinciding with the far ends of human spread and migration. An isolation-by-distance model would predict that these areas, farthest from world centers of innovation and diversification and known to have evolved in a significant degree of genetic and archaeological isolation, have received fewer linguistic innovations and immigrations than more centrally located areas. On such a model it could be either that low complexity is the archaic or ancestral situation for human language, or that the shared low complexity level of both is accidental and has not been disrupted by cross-areal additive innovations. Both of these are different from founder effects, which are essentially random individual effects of founding populations and not the average type of the whole world population at the time.

Another key finding is the ETZ, a transition zone between different levels of complexity in southern Africa and Europe. The ETZ is characterized by an opposing complexity gradient between the two levels or between an ETZ level and another level outside of the ETZ. These complexity effects are highly nonlinear and are unlike any other pattern in the considered areas. An expert area-wide inquiry into typological history in the southern Mediterranean population over the entire ETZ might reveal contacts and types of interactions and subsequent evolutionary histories of contact effects that we are not equipped to detect.

Relevant to complexity theory in the more general view is that there are consistencies of complexity measures across areas: we see, for example, that different complexity measures are in most cases positively correlated, meaning that there seem to be no large-scale effects that are negatively correlated with one another. This suggests that complexity is coherent as an entity itself, a latent property motivating the various surface realizations. This observation parallels the notion of complexity as a cumulative trait identified in previous literature (e.g. Maddieson 2011, 2005/2013a, 2005/2013b). In addition, certain measures are more correlated with other measures that are based on broad structural similarity.

Overall we find no distributional support for complexity as adaptive. Rather, its distribution seems to be purely geospatial, and that suggests that complexity itself, as well as the specific measures that make it up in our survey, has expanded or retracted in gradual diffusion and spread along contact networks, with spreading patterns originating in major centers of influence and moving slowly outward. The interaction of these large-scale spreads with more fine-grained adaptive patterns (like those plausibly due to sociolinguistic and sociohistorical factors) must be intricate and interesting, but does not show up at the scale of our inquiry. Moreover, it also suggests that, even when salient enough to appear in a quantitative analysis, adaptive patterns are chiefly local or language-specific, and non-uniform in the direction of their effect. This intra- and supra-regional heterogeneity of adaptive patterns would yield data structures that are indistinguishable from random noise in quantitative analyses like ours.

This implies that linguistic complexity, at least concerning phonological properties, can be viewed as an abstract construct – cumulative and non-adaptive – underlying the observable phonological traits. Moreover, this underlying complexity level does not affect phonological properties equally and is influenced by areally-confined historical developments.

This paper was designed to have broad scope because the investigation aims to be a stepping stone for future research regarding theory, methodology, and results: the findings obtained here can be utilized by a wide range of scholarly inquiry into specific aspects of linguistic complexity and phonology. We hope that further examinations of certain geographical areas regarding contact, migration, and linguistic patterns will build on our findings to reveal connections between complexity distributions and historical processes.

Appendixes and Supplement

Appendix 1. ComplexityData_Final

Appendix 2. FullSurface_Heatmaps

Supplement. Coding practices


Corresponding author: Frederik Hartmann [ˈfʁɛdəʁɪk ˈhaʁtman], University of North Texas, Denton, TX, USA, E-mail:

References

Ackerman, Farrell & Robert Malouf. 2013. Morphological organization: The low conditional entropy conjecture. Language 89(3). 429–464. https://doi.org/10.1353/lan.2013.0054.Search in Google Scholar

Anderson, Stephen R. 2015. Dimensions of morphological complexity. In Matthew Baerman, Dunstan Brown & Greville G. Corbett (eds.), Understanding and measuring morphological complexity, 11–27. Oxford: Oxford University Press.Search in Google Scholar

Anderson, David G., David Echeverry, D. Shane Miller, Stephen J. Yerka, Eric Kansa, Sarah Whitcher Kansa, Christopher R. Moore, et al.. 2019. Paleoindian settlement in the southeastern United States: The role of large databases. In David Thulman & Irv Garrison (eds.), Early Floridians: New directions in the search for and interpretation of Florida’s earliest inhabitants, 241–275. Gainesville: University Press of Florida.10.5744/florida/9781683400738.003.0014Search in Google Scholar

Anderson, David G., D. Shane Miller, Stephen J. Yerka, J. Christopher Gillam, Erik N. Johanson, Derek T. Anderson, Albert C. Goodyear & Ashley M. Smallwood. 2010. PIDBA (Paleoindian Database of the Americas) 2010: Current status and findings. Archaeology of Eastern North America 38. 63–90.Search in Google Scholar

Atkinson, Quentin D. 2011. Phonemic diversity supports a serial founder effect model of language expansion from Africa. Science 332. 346–349. https://doi.org/10.1126/science.1199295.Search in Google Scholar

Baerman, Matthew, Dunstan Brown & Greville G. Corbett. 2015. Understanding and measuring morphological complexity: An introduction. In Matthew Baerman, Dunstan Brown & Greville G. Corbett (eds.), Understanding and measuring morphological complexity, 2–10. Oxford: Oxford University Press.10.1093/acprof:oso/9780198723769.003.0001Search in Google Scholar

Bentz, Christian. 2016. The low-complexity belt: Evidence for large-scale language contact in human prehistory? In S. G. Roberts, C. Cuskley, L. McCrohon, L. Barcelo-Coblijn, Ol Fehér & T. Verhoef (eds.), The evolution of language: Proceedings of the 11th international conference (EVOLANG 11). New Orleans: EVOLANG. Available at: http://evolang.org/neworleans/papers/93.html.Search in Google Scholar

Bentz, Christian. 2018. Adaptive languages: An information-theoretic account of linguistic diversity. Berlin: de Gruyter.10.1515/9783110560107Search in Google Scholar

Bentz, Christian, Annemarie Verkerk, Douwe Kiela & Paula Buttery. 2015. Adaptive communication: Languages with more non-native speakers tend to have fewer word forms. PLOS One 10(6). https://doi.org/10.1371/journal.pone.0128254.Search in Google Scholar

Bentz, Christian & Bodo Winter. 2013. Languages with more second language speakers tend to lose nominal case. Language Dynamics and Change 3. 1–27. https://doi.org/10.1163/22105832-13030105.Search in Google Scholar

Bickel, Balthasar, Johanna Nichols, Taras Zakharko, Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Riessler, Lennart Bierkandt, et al.. 2022. The Autotyp database, Release version 1.0.0. Zenodo: https://doi.org/10.5281/zenodo.5931509. GitHub: https://github.com/autotyp/autotyp-data/tree/v1.0.0.Search in Google Scholar

Cotterell, Ryan, Christo Kirov, Mans Hulden & Jason Eisner. 2019. On the complexity and typology of inflectional morphological systems. Transactions of the Association for Computational Linguistics 7. 327–342. https://doi.org/10.1162/tacl_a_00271.Search in Google Scholar

Cysouw, Michael & Bernard Comrie. 2009. How varied typologically are the languages of Africa? In Rudolf Botha & Chris Knight (eds.), The Cradle of language, 189–203. Oxford, UK: Oxford University Press.10.1093/oso/9780199545858.003.0010Search in Google Scholar

Dahl, Östen. 2004. The growth and maintenance of linguistic complexity. Amsterdam: Benjamins.10.1075/slcs.71Search in Google Scholar

Dalton, April S., Helen E. Dulfer, Martin Margold, Jakob Heyman, John J. Clague, Benjamin Stoker, Michelle S. Gauthier, et al.. 2023. Deglaciation of the North American ice sheet complex in calendar years based on a comprehensive database of chronological data: NADI-1. Quaternary Science Reviews 321. 108345. https://doi.org/10.1016/j.quascirev.2023.108345.Search in Google Scholar

Davis, Loren G. & David B. Madsen. 2020. The coastal migration theory: Formulation and testable hypotheses. Quaternary Science Reviews. 249.e–version (online) 106605. https://doi.org/10.1016/j.quascirev.2020.106605.Search in Google Scholar

Dillehay, T. D. 1997. Monte Verde: A late Pleistocene settlement in Chile. Washington, DC: Smithsonian.Search in Google Scholar

Donohue, Mark & Johanna Nichols. 2011. Does phoneme inventory size correlate with population size? Linguistic Typology 15(2). 161–170. https://doi.org/10.1515/lity.2011.011.Search in Google Scholar

Easterday, Shelece. 2019. Highly complex syllable structure: A typological and diachronic study. Berlin: Language Science Press.Search in Google Scholar

Enfield, Nicholas J. 2014. The handbook of Austroasiatic languages. Leiden: Brill.Search in Google Scholar

Evans, Nicholas & Patrick McConvell. 1998. The enigma of Pama-Nyungan expansion in Australia. In Roger Blench & Matthew Spriggs (eds.), Archaeology and language II: Archaeological data and linguistic hypotheses, 174–191. London,New York: Routledge.10.4324/9780203202913_chapter_7Search in Google Scholar

Everett, Caleb. 2013. Evidence for direct geographic influences on linguistic sounds: The case of ejectives. PLOS One 8(6). e65275. https://doi.org/10.1371/journal.pone.0065275.Search in Google Scholar

Everett, Caleb, Damián E. Blasí & Seán G. Roberts. 2016. Language evolution and climate: The case of desiccation and tone. Journal of Language Evolution 2016. 33–46.10.1093/jole/lzv004Search in Google Scholar

Fagyal, Zsuzsanna, Samarth Swarup, Anna María Escobar, Les Gasser & Kiran Lakkaraju. 2010. Centers and peripheries: Network role in language change. Lingua 120(8). 2061–2079. https://doi.org/10.1016/j.lingua.2010.02.001.Search in Google Scholar

Fenk-Oczlon, Gertraud & August Fenk. 2014. Complexity trade-offs do not prove the equal complexity hypothesis. Poznan Studies in Contemporary Linguistics 50(2). 145–155. https://doi.org/10.1515/psicl-2014-0010.Search in Google Scholar

Gowan, Evann J., Xu Zhang, Sara Khosravi, Alessio Rovere, Paolo Stocchi, Anna L. C. Hughes, Richard Gyllencreutz, et al.. 2021. A new global ice sheet reconstruction for the past 80,000 years. Nature Communications 12. 1199. https://doi.org/10.1038/s41467-021-21469-w.Search in Google Scholar

Güldemann, Tom. 2010. Sprachraum and geography: Linguistic macroareas in Africa. In Alfred Lameli, Roland Kehrein & Stefan Rabanus (eds.), Language and space: International handbook of language variation, vol. 2: Language mapping, 561–585. Berlin: Mouton de Gruyter.10.1515/9783110219166.1.561Search in Google Scholar

Güldemann, Tom. 2018. Historical linguistics and genealogical language classification in Africa. In Tom Güldemann (ed.), The languages and linguistics of Africa, 58–444. Berlin: De Gruyter Mouton.10.1515/9783110421668-002Search in Google Scholar

Güldemann, Tom. 2019. The linguistics of Holocene High Africa. In Yonatan Sahle, Hugo Reyes-Centeno & Christian Bentz (eds.), Modern human origins and dispersal, 285. Tübingen: Kerns Verlag.Search in Google Scholar

Güldemann, Tom & Harald Hammarström. 2020. Geographical effects in large-scale linguistic distributions. In Mily Crevels & Pieter Muysken (eds.), Language dispersal, diversification, and contact: A global perspective, 58–77. Oxford: Oxford University Press.10.1093/oso/9780198723813.003.0004Search in Google Scholar

Hartmann, Frederik. Forthcoming. A computational approach to investigating phonological complexity with latent variables and dimensionality reduction. Phonology.Search in Google Scholar

Hartmann, Frederik & Gerhard Jäger. 2024. Gaussian process models for geographic controls in phylogenetic trees. Open Research Europe 3(57). https://doi.org/10.12688/openreseurope.15490.2.Search in Google Scholar

Hartmann, Frederik, & Johanna Nichols. Forthcoming. The Euro-African transition zone: Findings from geospatial phonology.Search in Google Scholar

Hay, Jennifer & Laurie Bauer. 2007. Phoneme inventory size and population size. Language 83(2). 388–400. https://doi.org/10.1353/lan.2007.0071.Search in Google Scholar

Hijmans, Robert J. 2022. Geosphere: Spherical Trigonometry. Available at: https://CRAN.R-project.org/package=geosphere.Search in Google Scholar

Hockett, Charles F. 1958. A course in modern linguistics. New York: Macmillan.Search in Google Scholar

HUGO Pan-Asian SNP Consortium, Ahmed, Ikhlak, Anunchai Assawamakin, Jong Bhak, Samir K. Brahmachari, Gayvelline C. Calacal, Amit Chaurasia, et al.. 2009. Mapping human genetic diversity in Asia. Science 326. 1541–1545. https://doi.org/10.1126/science.1177074.Search in Google Scholar

Jacobsen, William H.Jr. 1989. The Pacific orientation of western North American languages. Paper presented at First Circum-Pacific Prehistory Conference, Seattle.Search in Google Scholar

Janhunen, Juha. 2000. Grammatical gender from east to west. In Barbara Unterbeck & Matti Rissanen (eds.), Gender in grammar and cognition, 689–708. Berlin: Mouton de Gruyter.10.1515/9783110802603.689Search in Google Scholar

Jombart, Thibaut. 2008. adegenet: A R package for the multivariate analysis of genetic markers. Bioinformatics 24. 1403–1405. https://doi.org/10.1093/bioinformatics/btn129.Search in Google Scholar

Lourandos, Harry. 1997. Continent of hunter-gatherers. Cambridge: Cambridge University Press.Search in Google Scholar

Lupyan, Gary & Rick Dale. 2010. Language structure is partly determined by social structure. PLOS One 5(1). e8559. https://doi.org/10.1371/journal.pone.0008559.Search in Google Scholar

Maddieson, Ian. 2005/2013a. Syllable structure. In Martin Haspelmath, Matthew Dryer, David Gil & Bernard Comrie (eds.), World atlas of language structures, 54–57. Oxford: Oxford University Press. Available at: http://wals.info/chapter/12.Search in Google Scholar

Maddieson, Ian. 2005/2013b. Vowel quality inventories. In Martin Haspelmath, Matthew Dryer, David Gil & Bernard Comrie (eds.), World atlas of language structures, 14–17. Oxford: Oxford University Press. Available at: http://wals.info/chapter/2.Search in Google Scholar

Maddieson, Ian. 2009. Calculating phonological complexity. In François Pellegrino, Egidio Marsico, Ioana Chitoran & Christophe Coupé (eds.), Approaches to phonological complexity, 83–110. Berlin: De Gruyter Mouton.10.1515/9783110223958.83Search in Google Scholar

Maddieson, Ian. 2011. Phonological complexity in linguistic patterning Paper presented at international Congress of phonetic sciences, Vol. 17, 28–34. Hong Kong. Available at: https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2011/index.htm.Search in Google Scholar

McConvell, Patrick. 2020. The spread of Pama-Nyungan in Australia. In Tom Güldemann, Patrick McConvell & Richard A. Rhodes (eds.), The language of hunter-gatherers, 422–462. Cambridge: Cambridge University Press.10.1017/9781139026208.017Search in Google Scholar

Merkle, Edgar C., Ella Fitzsimmons, James Uanhoro & Ben Goodrich. 2021. Efficient Bayesian structural equation modeling in stan. Journal of Statistical Software 100(6). 1–22. https://doi.org/10.18637/jss.v100.i06.Search in Google Scholar

Miestamo, Matti. 2008. Grammatical complexity in cross-linguistic perspective. In Matti Miestamo, Kaius Sinnemäki & Fred Karlsson (eds.), Language complexity: Typology, contact, change, 23–41. Amsterdam: Benjamins.10.1075/slcs.94.04mieSearch in Google Scholar

Moran, Steven & Damián Blasí. 2014. Cross-linguistic comparison of complexity measures in phonological systems. In Frederick J. Newmeyer & Laurel B. Preston (eds.), Measuring grammatical complexity, 217–240. Oxford: Oxford University Press.10.1093/acprof:oso/9780199685301.003.0011Search in Google Scholar

Nichols, Johanna. 2009. Linguistic complexity: A comprehensive definition and survey. In Geoffrey Sampson, David Gil & Peter Trudgill (eds.), Language complexity as an evolving variable, 110–125. Oxford: Oxford University Press.10.1093/oso/9780199545216.003.0008Search in Google Scholar

Nichols, Johanna. 2013. The vertical archipelago: Adding the third dimension to linguistic geography. In Peter Auer, Martin Hilpert, Anja Stukenbrock & Benedikt Szmrecsanyi (eds.), Space in language and linguistics, 38–60. Berlin: Mouton de Gruyter.10.1515/9783110312027.38Search in Google Scholar

Nichols, Johanna. 2017. Person as an inflectional category. Linguistic Typology 21(3). 387–456. https://doi.org/10.1515/lingty-2017-0010.Search in Google Scholar

Nichols, Johanna. 2020. Dispersal patterns shape areal typology. In Mily Crevels & Pieter Muysken (eds.), Language dispersal, diversification, and contact: A global perspective, 25–43. Oxford: Oxford University Press.10.1093/oso/9780198723813.003.0002Search in Google Scholar

Nichols, Johanna. 2024. Founder effects identify languages of the earliest Americans. American Journal of Biological Anthropology. e24923. https://doi.org/10.1002/ajpa.24923.Search in Google Scholar

Nichols, Johanna & Christian Bentz. 2018. Morphological complexity of languages reflects the settlement history of the Americas. In Katerina Havarti, Gerhard Jäger & Hugo Reyes-Centano (eds.), New perspectives on the peopling of the Americas, 13–26. Tübingen: Kerns.Search in Google Scholar

Nichols, Johanna, David A. Peterson & Jonathan Barnes. 2004. Transitivizing and detransitivizing languages. Linguistic Typology 8(2). 149–211. https://doi.org/10.1515/lity.2004.005.Search in Google Scholar

Potter, Ben A., Charles E. Holmes & David R. Yesner. 2014. Technology and economy among the earliest prehistoric foragers in interior eastern Beringia. In Kelly E. Graf, Caroline V. Ketron & Michael R. Waters (eds.), Paleoamerican odyssey, 81–103. College Station, TX: Texas A&M University Press.Search in Google Scholar

Praetorius, Summer K., Jay R. Alder, Alan Condron, Alan C. Mix, Jon M. Erlandson & Beth E. Caissie. 2023. Ice and ocean constraints on early human migrations into North America along the Pacific coast. PNAS 120. 7. https://doi.org/10.1073/pnas.2208738120.Search in Google Scholar

Raghavan, Maanasa, Pontus Skoglund, Kelly E. Graf, Mait Metspalu, Eske Willerslev, Ida Moltke, Simon Rasmussen, et al.. 2014. Upper Paleolithic Siberian genome reveals dual ancestry of Native Americans. Nature 505. 87–91. https://doi.org/10.1038/nature12736.Search in Google Scholar

Rootsi, Siiri, Lev A. Zhivotovsky, Marian Baldovič, Manfred Kayser, Willerslev Eske, Rita Khusainova, Marina A. Bermisheva, Marina Gubina, Sardana A. Fedorova, Anne-Mai Ilumäe, et al.. 2007. A counter-clockwise northern route of the Y-chromosome haplogroup N from Southeast Asia toward Europe. European Journal of Human Genetics 15(2). 204–211. https://doi.org/10.1038/sj.ejhg.5201748.Search in Google Scholar

Shagal, Ksenia, Max Wahlström & Johanna Nichols. 2019. (Non)finiteness in clause combining: A typological survey. University of Pavia Paper presented at ALT 13.Search in Google Scholar

Shagal, Ksenia, Max Wahlström, & Johanna Nichols. nd. (Non)finiteness in clause combining in northern Eurasia. Unpublished manuscript.Search in Google Scholar

Shimunek, Andrew. 2017. Languages of ancient Southern Mongolia and North China. Wiesbaden: Harrassowitz Verlag.10.2307/j.ctvckq4f7Search in Google Scholar

Shosted, Ryan K. 2006. Correlating complexity: A typological approach. Linguistic Typology 10. 1–40. https://doi.org/10.1515/lingty.2006.001.Search in Google Scholar

Sinnemäki, Kaius. 2011. Language universals and linguistic complexity: Three case studies in core argument marking. University of Helsinki. PhD dissertation.Search in Google Scholar

Sinnemäki, Kaius. 2014a. Global optimization and complexity tradeoffs. Poznań Studies in Contemporary Linguistics 50(2). 179–195. https://doi.org/10.1515/psicl-2014-0013.Search in Google Scholar

Sinnemäki, Kaius. 2014b. Complexity trade-offs: A case study. In Frederick J. Newmeyer & Laurel B. Preston (eds.), Measuring grammatical complexity, 179–201. Oxford: Oxford University Press.10.1093/acprof:oso/9780199685301.003.0009Search in Google Scholar

Sinnemäki, Kaius & Francesca Di Garbo. 2018. Language structures may adapt to the sociolinguistic environment, but it matters what and how you count: A typological study of verbal and nominal complexity. Frontiers in Psychology 9. 1141. https://doi.org/10.3389/fpsyg.2018.01141.Search in Google Scholar

Trudgill, Peter. 2011. Sociolinguistic typology: Social determinants of linguistic structure and complexity. Oxford: Oxford University Press.Search in Google Scholar

Vajda, Edward (ed.). 2024. The languages and linguistics of Northern Asia. Berlin: Mouton de Gruyter.10.1515/9783110556216Search in Google Scholar

Wei, Taiyun & Viliam Simko. 2017. R package “corrplot”: Visualization of a correlation matrix (Version 0.84). Available at: https://github.com/taiyun/corrplot.Search in Google Scholar

Wichmann, Søren, Taraka Rama & Eric W. Holman. 2011. Phonological diversity, word length, and population sizes across languages: The ASJP evidence. Linguistic Typology 15(2). 177–197. https://doi.org/10.1515/lity.2011.013.Search in Google Scholar

Willerslev, Eske & David J. Meltzer. 2021. Peopling of the Americas as inferred from ancient genomics. Nature 594. 356–364. https://doi.org/10.1038/s41586-021-03499-y.Search in Google Scholar

Williams, Martin. 2021. When the Sahara was green. Princeton: Princeton University Press.Search in Google Scholar

Received: 2023-09-20
Accepted: 2025-03-05
Published Online: 2025-07-18
Published in Print: 2025-10-27

© 2025 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Frontmatter
  2. Target Paper and Discussion
  3. Introduction
  4. Replication, robustness and the angst of false positives: a timely target article and its multifaceted comments
  5. Target Paper
  6. Replication and methodological robustness in quantitative typology
  7. Commentaries
  8. Embracing uncertainty, and the multifaceted soul of linguistic typology: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  9. Replicability all the way up: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  10. Some comments on robustness in comparative grammar research: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  11. Open research requires open mindedness: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  12. An experimentalist’s perspective on replicability in typology: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  13. Sampling matters: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  14. Weak theories and robustness: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  15. Commentary: Replication, robustness or methodological competition?
  16. Good enough for Galton, and much more: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  17. What is ‘advanced statistical modelling’?: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  18. The value of replication: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  19. Statistical signal versus areal/universal/genealogical pressure: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  20. Different models, different assumptions, different findings: commentary on “Replication and methodological robustness in quantitative typology” by Becker and Guzmán Naranjo
  21. Response
  22. Authors’ response to “Replication and methodological robustness in quantitative typology”
  23. Research Article
  24. Geospatial effects on phonological complexity in the world’s languages
  25. Editorial
  26. Grammar Highlights 2024
Downloaded on 20.12.2025 from https://www.degruyterbrill.com/document/doi/10.1515/lingty-2023-0077/html
Scroll to top button