Why modelling space is hard: no evidence for a serial founder effect in Polynesian phoneme inventories

Matías Guzmán Naranjo; Laura Becker; Miriam L. Schiele; I-Ying Lin

doi:10.1515/ling-2024-0016

Article Open Access

Why modelling space is hard: no evidence for a serial founder effect in Polynesian phoneme inventories

Matías Guzmán Naranjo , Laura Becker , Miriam L. Schiele and I-Ying Lin

Published/Copyright: May 23, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Linguistics

Abstract

In recent years there has been an increased interest in computational modeling of spatial phenomena in typology. While the main focus of most work so far has been on direct language contact, there are two different types of spatial dynamics of interest to typologists and areal linguists: language expansion, and asymmetric contact effects. In this paper we present five statistical techniques to model phylogenetic and spatial relations between languages. We illustrate these techniques with a case study on Polynesian phoneme inventory sizes, and show that there is robust evidence for asymmetric contact effects from Non-Polynesian on Polynesian, but little evidence for expansion effects within Polynesian. In other words, we do not find convincing evidence for a serial founder effect in Polynesian phoneme inventory sizes. We also argue that we are still far from having a complete understanding of how to model all spatial dynamics that can affect language contact, and that more attention should be paid to these issues.

Keywords: language contact; migration; Polynesia; Gaussian Process; Bayesian statistics

1 Introduction

In recent years there has been an increased interest in computational modeling of spatial phenomena in typology (Guzmán Naranjo and Becker 2022; Guzmán Naranjo and Mertner 2023; Hartmann 2022; Hartmann and Jäger 2023; Ranacher et al. 2021; Urban and Moran 2021). While the main focus of most work so far has been on direct language contact, there are two different types of spatial dynamics of interest to typologists and areal linguists: language expansion, and asymmetric contact effects. Especially language expansion in the form of serial founder effects has received attention in typology in that serial founder effects have been argued to account for variation in phoneme inventory sizes (Atkinson 2011b; Trudgill 2004). A serial founder effect refers to a scenario in which populations expand from some point of origin in sequential migrations of small speaker communities. Such a scenario has been argued to lead to less complexity of certain linguistic structures, e.g., to smaller phoneme inventory sizes with increasing distance between a speech community and its origin.

Previous quantitative approaches to modeling spatial phenomena in general, and to serial founder effects in particular, have mostly not taken into account the complexities of spatial relations between languages. Spatial modeling has sometimes been presented as a sort of solved problem, or at least relatively easy to account for. We argue in this paper that space is neither, and that we are only beginning to understand how to take it into account in our models. Taking phoneme inventory sizes in Polynesian languages as a test case, this study proposes five statistical techniques to control for (i) the phylogenetic relations between the languages in the dataset, as well as to model (ii) symmetric contact between languages, (iii) asymmetric contact effects from one set of source languages to another set of potential target languages, (iv) to estimate the probability of such asymmetric contact effects for a given target language, and (v) to model the spatial expansion from a common origin. The first aim of this paper thus is to highlight the complexities in modeling spatial relations between languages and to offer new statistical solutions that can also be used in other studies dealing with spatial relations between languages in quantitative typology.

Our choice of phoneme inventory sizes in Polynesian as a test case is a mostly practical one. On the one hand, we have fairly detailed knowledge about the Polynesian expansion from a common origin. This allows us to model spatial expansion in a relatively detailed way. On the other hand, serial founder effects in expansion scenarios have been mainly discussed in relation to phoneme inventory sizes. Phoneme inventory size is also a variable that can be easily affected by contact through the borrowing of individual phonemes (cf. Matras 2009: 222–224; Moran et al. 2024). In addition, Polynesian languages are also known to have had sustained contact with Non-Polynesian languages in Melanesia and, to a lesser extent, Micronesia. In other words, Polynesian phoneme inventories present a suitable test case for exploring statistical methods to capture different types of spatial relations between languages. We have comparatively detailed knowledge about their historical spatial expansion and about contact situations that have led to phoneme borrowing, and phoneme inventories have been investigated for serial founder effects from a quantitative typological perspective before. Therefore, the second aim of this paper is to provide a more comprehensive quantitative approach to exploring the evidence for or against a serial founder effect on phoneme inventory size. Foreshadowing our results, our model results suggest no evidence in support of the serial founder effect in Polynesian phoneme inventory sizes.

This paper is structured as follows. Section 2 provides background information on Polynesian languages and their expansion, as well as on previous claims regarding phoneme inventory sizes and the serial founder effect. Section 3 describes the dataset used for the present study and our annotation choices. In Section 4, we give an overview of the general patterns concerning phoneme inventories in the Polynesian languages from our dataset in relation to evidence about diachronic developments and language contact from the literature. We then describe the five statistical methods that capture factors with a potential effect on phoneme inventory sizes in Polynesian: phylogenetic relations between the languages (Section 5.1), symmetric contact effects between Polynesian languages (Section 5.2), asymmetric contact effects from Non-Polynesian on Polynesian (Section 5.3), the probability of asymmetric contact affecting a given Polynesian language (Section 5.4), and a potential serial founder effect in the form of directed expansion (Section 5.5). Section 6 presents the model results. We discuss the implications thereof in Sections 7 and 8 concludes.

2 Background

2.1 Polynesian languages

Polynesian languages belong to the Oceanic branch of the Austronesian language family. They are mainly situated in the Polynesian Triangle, which is a geographical region comprising Hawai’i as the north apex, New Zealand as the southernmost corner and Rapanui (Easter Island) as the easternmost point (Kurpa 1973: 1). Besides, about 20 so-called Polynesian Outlier languages are located on the periphery of eastern Micronesia and Melanesia (Pawley 1967: 260). Glottolog distinguishes 38 Polynesian varieties; our dataset includes 36 of them. Their location can be seen in Figure 1, with the core Polynesian languages plotted in dark blue and the Outlier languages in light blue.

Figure 1:

Location of the 36 Polynesian languages in the dataset.

Despite the fact that Polynesian has long been known to constitute a phylogenetic group within Oceanic (cf. Blust 2013: 118), the study of the internal phylogenetic structures between Polynesian languages only emerged in the 20th century. Elbert (1953) proposed an internal structure of Polynesian languages based on phonology, distinguishing between the two main branches of Nuclear Polynesian and Tongic (Elbert 1953: 169). Later, Pawley (1966: 39–40) proposed a more fine-grained subgrouping based on morphology and divided Nuclear Polynesian further into Samoic-Outliers and Eastern Polynesian. This division into Tongic, Nuclear Polynesian, several Outlier subbranches, and Eastern Polynesian mainly corresponds to the phylogenetic relations distinguished today. Without going into further detail, we follow the internal phylogenetic structure of Polynesian as proposed by Glottolog.^[1] The full phylogenetic tree of all 36 Polynesian languages in our dataset can be seen in Figure 2. As major groups, we thus distinguish Tongic versus Nuclear Polynesian, the latter of which consists of several Outlier groups and East Polynesian (cf. also Biggs 1971).

Figure 2:

Phylogenetic structure of the Polynesian languages in the dataset.

It is widely accepted that the origin of the Polynesian homeland lies in an area around Tonga, Samoa, Futuna, and ’Uvea (cf. Bellwood 1979; Geraghty 1983; Green 1981; Jennings 1979; Kirch 1984a, 1996; Kirch and Green 1992, 2001; Pawley and Green 1973, 1984). Archeological evidence ties the origin of Polynesians to the earlier Lapita culture in Melanesia, mostly characterized by complex ceramic potteries (cf. Kirch 1997; Pawley 2007). Based on archeological findings, we can assume that the Lapita culture spread from the Bismarck archipelago to the Solomon islands, Vanuatu, the Loyalites, Fiji to Tonga and Samoa. The beginning of the Lapita culture in Western Melanesia dates back to around 1500 BC, and it persisted about 1,000 years, until around 500 BC (Kirch 1996: 61–63). Early Polynesian culture gradually developed from the Lapita culture, which consisted of a chain of populations from Fiji to Samoa and Tonga. Over the course of time, Lapita populations in central Polynesia had less and less contact with Lapita populations in the West, which led to their isolation and thus to the development of a distinct, early Polynesian culture. Cultural artifacts suggest that this separation happened some time in the first milenium BC (Kirch 2017: 187–188).

Many details of the ensuing Polynesian expansion remain debated and unknown, especially regarding dates and time frames. There is, however, more consensus on the general spatial patterns of the expansion. We will give a brief overview here; see the map in Figure 4 in Section 3.3 for our operationalization of Polynesian migration paths.

The Polynesian expansion started in the Western Polynesian center around Tonga, Samoa, Futuna, and Uvea. From there, early Polynesians sailed east, reaching the Southern Cook islands as the first area of expansion in Eastern Polynesia (Irwin 1994: 98). The Polynesians then traveled further to the Austral islands, Rapa island, the Society islands, the Tuamotu islands, the Marquesas islands, the Gambier islands, and the Pitcairn islands. According to modern radiocarbon dating, Polynesians populated central and southern East Polynesia between 900 and 1200 AD (Kahn and Sinoto 2017; Kennett et al. 2012; Wilmshurst et al. 2011; cf. Kirch 2017: 198–200). The most remote islands of Hawai’i, Rapanui, and New Zealand were colonized somewhat later. The Hawai’i islands were likely settled from the Southern Cook islands along a chain of contact between the Society islands, the Tuamotu islands, and the Marquesas islands (Green and Weisler 2002: 234–235; Kirch 2017: 200). The Polynesians reached Rapanui by moving eastward from Society islands via Mangareva island (Gambier), and Henderson island (Pitcairn) (Green and Weisler 2002: 234–235; Kieviet 2017: 2; Martinsson-Wallin and Crockford 2001). New Zealand was populated the latest from the Southern Cook islands, most likely in the 13th century (Goodwin et al. 2014; Kirch 2017: 200; Wilmshurst et al. 2011).

In addition to their eastward expansion, Polynesians also traveled west from central Polynesia (Kirch 1984b; Ward et al. 1973) to Melanesia and Micronesia. There is evidence that some of this east-to-west expansion happened very early in the first milenium BC, while other islands were settled only around 1500 AD (cf. Carson 2012; Kirch 1984b). By now, evidence points to two main origins of the Polynesian Outlier populations. Tuvalu is the most likely origin of the northern Outliers in Micronesia (Nukuoro and Kapingamarangi) as well as of Melanesian Outlier populations in Papua New Guinea (Takuu) and the Solomon Islands (Ontong Java, Sikaiana, Reef Islands, Duff Islands) (Carson 2012: 28; Kirch 2017: 161). Uvea most likely corresponds to the second point of origin for other Polynesian Outlier populations in Melanesia further south in the Solomon Islands (Anuta, Tikopia, Rennell, Bellona), Vanuatu (Futuna, Aniwa, Emae, Efate) and the Loyalty Islands (Ouvéa) (Carson 2012: 28; Kirch 1984b: 230).

2.2 Phoneme inventory sizes and the serial founder effect

It has been proposed that language expansion leaves clear signatures on the grammar of the languages involved. Among these, the best known are probably serial founder effects (Atkinson 2011b; Deshpande et al. 2009; Fort and Pérez-Losada 2016; Pérez-Losada and Fort 2018). Originally, serial founder effects were proposed for population genetics (Betti et al. 2009; Manica et al. 2007; Pierce et al. 2014). The main idea is that when populations expand from a homeland in sequential migrations of small numbers of individuals, the genetic diversity decreases as the populations expand further and further. Sequential small migrations lead to genetic bottlenecks, which in turn decreases genetic diversity within groups.

Adapting this concept to linguistics, it has been proposed that serial founder effects can impact the structure of languages. This is due to a combination of socio-linguistic factors, namely the interaction between population size and linguistic complexity. Trudgill (2004) argues that if a linguistic community is isolated, small, or has minimal contact with other languages, there is a high probability that linguistic complexity in that language will persist and be transmitted to subsequent generations (Trudgill 2004: 306). He argues that, at the same time, languages with a low degree of contact and small populations can afford to develop smaller inventories, as opposed to languages with larger populations and more contact and therefore potential adult learners. The explanation given is that smaller inventories provide fewer possible segmental contrasts, which in turn lead to higher memory load. According to Trudgill (2004: 315–317), this is not a problem for small tight-knit communities with few L2 learners, while it does pose problems for looser-knit communities with more contact and L2 learners.^[2]

In this vein, Trudgill (2004: 312), building on earlier ideas by Haudricourt (1961), suggests that “the two factors of isolation and small community size” may account for phoneme inventories of Polynesian languages, which are relatively small when compared to other Austronesian languages closer to Taiwan (cf. Trudgill 2004, 2011: 155–156).^[3] The association between phoneme inventory sizes and population sizes has since been tested by a number of other studies, with mixed results. Some studies found effects (e.g., Hay and Bauer 2007; Wichmann et al. 2011), while others did not (e.g., Donohue and Nichols 2011; Moran et al. 2012; Pericliev 2004).

If, for the sake of argument, we assume that there is a real effect between population size and phoneme inventory size, then the serial founder effect would lead to languages having gradually smaller phoneme inventories the further away from the homeland they are located, if the expansion happens in sequential migrations. This is what Atkinson (2011a) claims. Based on phoneme inventory size, he argues that we can observe a serial founder effect in that languages have gradually smaller phoneme inventories with increasing distance to Africa, which is taken as a global point of origin for the spread of humans. Despite having received methodological and theoretical criticism (e.g., Cysouw et al. 2012a; Wang et al. 2012), other recent works have claimed similar findings (e.g., Fenk-Oczlon and Pilz 2021; Fort and Pérez-Losada 2016; Pérez-Losada and Fort 2018).

The above-mentioned studies all take a global approach to the idea of a serial founder effect observable in phoneme inventory sizes. Given the large scale and the variety of language-internal and language-external factors that can impact the phoneme inventory (size) of a language, it is unclear whether the observed effects on a global scale can really be attributed to expansion. It is possible that the effects are rather an artifact of different confounding factors.

Because of this, if we want to take the hypothesis of a serial founder effect seriously, it is important to test it in a more controlled setting and to control for other confounding factors. As far as we are aware, however, no study has looked at effects of geographical expansion on linguistic structure in detail for a small number of languages in a clearly defined region yet. Although some of the more recent studies like Pérez-Losada and Fort (2018) and Fenk-Oczlon and Pilz (2021) make use of quantitative techniques, they do not try to fully control for genetic and other spatial confounds. Moreover, several other studies found no evidence that would support a potential serial founder effect with phoneme inventories (Creanza et al. 2015; Cysouw et al. 2012a). This calls for a smaller-scale quantitative approach to testing for a potential serial founder effect on phoneme inventory size and exploring its interaction with the effects of other spatial factors in a more comprehensive way, which is one of the aims of this paper.

3 Dataset and annotation

3.1 Dataset

Our dataset includes 36 Polynesian languages whose spatial distribution was shown on the map in Figure 1.^[4] As mentioned in Section 3.3, we can distinguish between core Polynesian languages (spoken within the Polynesian triangle) and Polynesian Outlier languages (spoken in Melanesia and Micronesia). In addition to the Polynesian languages, which are the main focus of the present study, we included 124 Non-Polynesian languages in our dataset. Those are languages from other Austronesian branches or other language families spoken in Melanesia and Micronesia in the regions where the Polynesian Outlier languages are spoken. Figure 3 shows their geographic locations (red triangles) together with the 36 Polynesian languages from the dataset.

Figure 3:

Map of the Polynesian and non-Polynesian languages in the dataset.

3.2 Phoneme inventory annotation

All languages in the dataset (both Polynesian and Non-Polynesian) were annotated for their phoneme inventories. We did this using the information provided in reference grammars and descriptions. Of course, determining which phones have phoneme status let alone determining the total number of phonemes in a language is not trivial and can be subject to biases on various levels. Therefore, we generally rely on the decisions in the grammatical descriptions made by experts on the respective languages.

The only systematic exception to this is length, which is treated differently across languages as well as descriptions and analyses. In some cases, segments that only differ in length with no other qualitative difference are treated as separate phonemes, whereas this is not done in other cases. Although there may be valid theoretical and/or language-internal reasons for either analysis, we systematically counted phonological length as an additional phonological feature. In other words, we treated /a/ and /aː/ or /l/ and /lː/ as separate phonemes. In most cases, either classification did not affect the final count of phonemes to a great extent, but in some cases it did. To give one example, Clark (1995: 949–950) describes the phoneme inventory of Mele-Fila as having 15 consonants and five vowels. Yet, he notes that the language has distinctive length contrasts in all vowels and consonants. We therefore annotated the phoneme inventory size of Mele-Fila as 30 + 10 = 40.

Treating length distinctions this way in our dataset ensures systematicity in the annotation across languages, which follows the proposal of comparative concepts for typology (cf. Croft 2016; Haspelmath 2010, 2018). The main idea behind comparative concepts is to define a linguistic category that can be applied and used across languages for typological comparisons. Therefore, it must not rely on language-specific criteria and can differ from established definitions and classifications in linguistic traditions of single languages. We are aware of the fact that it may not be always useful or theoretically motivated to treat short and long phonological segments as separate phonemes.^[5] For the purposes of this study, it is warranted for crosslinguistic comparability and systematicity. In principle, we could have also decided never to treat such segments as phonemes. This, however, would have led to fewer distinctions and the loss of important variation especially comparing the Outlier languages to core and Eastern Polynesian (cf. Section 4).

3.3 Polynesian expansion and distance to origin

To take the migration paths into account when modeling the Polynesian expansion, we require precise coordinates in order to calculate the distances between the current location of a language and the point of origin. Although there is a solid body of work from linguistics, archaeology, and genetics related to the pathways of the Polynesian expansion, there is no straightforward model available that would meet our criteria. Therefore, we built a graph of migration paths based on what is known from the literature, assuming parsimony and simplifying certain aspects of the expansion for modeling purposes.^[6]

As mentioned in Section 2.1, the most likely origin of the Polynesian expansion corresponds to the area between Samoa and Tonga. Given that we have to select single points in order to calculate distances, we calculated two sets of distances for each of the languages, taking Samoa or Tonga as the point of origin. This is a necessary simplification for modeling purposes; the early Polynesians most likely maintained a travel network between Samoa and Tonga rather than expanding from only one of the islands. Using both Samoa and Tonga as points of origin for two alternative sets of distances, however, ensures that we can gauge the impact of these choices and check for potential differences in the results. For reasons of space, we only report results with Samoa as the point of origin, however, there were no noticeable differences in the models using Tonga as origin point.^[7] Figure 4 shows our final expansion graph.

Figure 4:

Graph of the Polynesian expansion with Samoa as the point of origin.

For the purposes of this study, we assume a unidirectional expansion from one point to another. Despite the fact that the expansion involved more complex travel networks and maintained bidirectional contact between different pairs of islands (cf. Irwin 1994), this is a necessary simplification in order to estimate the distances between current language locations and their point of origin.

For islands that are very close to the Tonga-Samoa area with no specific migration paths mentioned in the literature, we assume a direct path from Samoa (or Tonga) by parsimony. This is the case for Niue, Pukapuka, Niuafo’ou, Tuvalu, Tokelau, and Tonga (Samoa).

Another simplification is that we use the coordinates provided by Glottolog as a proxy for islands/atolls and island/atoll groups. Often, the literature can only point to island groups (e.g., the Society islands) that have acted as important hubs and intermediate steps in the expansion. As mentioned before, we need precise coordinates for the calculation of distances, meaning that we have to select a point in that island group. This is not a trivial choice, especially given that such island groups can span a territory of a thousand square kilometers. We can narrow down such island groups to those islands with archaeological findings which support long-term presence of Polynesians. This still leaves us with a number of potential islands per island group in various cases. As a heuristic, we therefore selected islands that are also featured as the location of other Polynesian languages in our dataset. For instance, we use the Glottolog coordinates for Tahitian, spoken on Tahiti, to represent the Society islands as an intermediate step in the expansion from, e.g., the Cook islands (Rarotonga Island) to the Gambier islands (Mangareva).

We included additional points that do not correspond to any language in our dataset in only two cases. The first is Henderson island as an intermediate step between Mangareva and Rapanui (cf. Green and Weisler 2002: 233). The second additional point with no associated Polynesian language from our sample is the island of Taumako. Because Glottolog locates Veakau-Taumako on the Reef islands, we added this additional step of Taumako between Tuvalu and the Reef islands (cf. Næss and Hovdhaugen 2011: 11).

For the migration to the northern Outliers, there are two possible paths, namely via both Tuvalu and Tokalau (cf. Carson 2012; Kirch 2017), that the methods described above could not resolve. In this case, we selected Tuvalu as the intermediate step due to its shorter absolute distance to the northern Outlier islands.

4 General trends and patterns

4.1 Consonants

The most common consonants in Polynesian are shown in Table 1. They all belong to the set of consonants reconstructed for Proto-Polynesian (cf. Biggs 1978: 708). Those in black in Table 1 correspond to phonemes in at least 30 out of 35 languages (80 %).

Table 1:

Common consonants (>80 % in black, 50–80 % in gray).

	labial	alveolar	velar	glottal
plosive	/p/	/t/	/k/	/ʔ/
fricative	/f/	/s/		/h/
nasal	/m/	/n/	/ŋ/
liquid		/r/, /l/
lateral
glide

The only consonant that all 35 languages in the sample have is the plosive /p/; the other most common consonants are /k/ (32 languages), as well as the two nasals /m/, and /n/ (33 and 32 languages, respectively). This reflects earlier findings from historical linguistics, showing that the consonants /p/, /m/, and /n/ are the most diachronically stable ones in Polynesian with no or very few changes from Proto-Polynesian to the Polynesian languages spoken today (Marck 2000: 24). All other phonemes have undergone changes to varying degrees in a number of languages. This is reflected in Table 1, where the phonemes given in gray are present in 18–28 languages (50–80 %). Out of all Proto-Polynesian phonemes, /ʔ/ has been shown to be the least stable in that it has been lost in many Polynesian languages (Marck 2000: 24–25).

Proto-Polynesian is reconstructed to have had both liquids /l/ and /r/, with /l/ developing into /r/ and vice versa in a number of Polynesian languages (Marck 2000: 52–57). This is why many Polynesian languages only feature one of the two glides: /r/ is found in 17 languages of our dataset, 21 have /l/, and only eight languages have both. Six out of the eight languages with phonemic /l/ and /r/ are Outlier languages where /l/ is an innovation rather than a continuation from Proto-Polynesian. According to Elbert (1965: 435), Tikopia and Takuu split /r/ into two distinct phonemes, whereas /l/ became a new phoneme in Mele-Fila, Emae, Futuna-Aniwa, and Rennell-Bellona mainly through borrowing from Non-Polynesian languages.

Another unstable consonant in the history of Polynesian is the glide /w/, reconstructed for Proto-Polynesian (Marck 2000: 49). It developed into /v/ in a number of languages, which is why it is only part of the phoneme inventory in six languages from our dataset.

Voiced plosives are very rare in Polynesian languages. Out of all Polynesian languages in our dataset, only Rennell-Bellona and West Uvean have phonemic voiced plosives. Renell-Bellona features /b/ and /g/, while West Uvean has /b, g, d/ as well as the retroflex voiced plosives /ɖ/ and /ɟ/. In both languages, the phonemic status of these consonants is likely the result of language contact with speakers of Melanesian languages.

For West Uvean, there is clear evidence for contact with Non-Polynesian languages. The language is spoken in the north and south of the Ouvéa island (Loyalities). It still is in direct contact with the Melanesian language Iaai, spoken in the center of the island (Ozanne-Rivierre 1994: 524). As will be discussed in more detail in Section 4.3, voiced plosives (among other phonemes) in West Uvean have been borrowed from Iaai. In addition, voiced plosives in word-final position in West Uvean are the result of English and French loans rather than loans from Iaai (Ozanne-Rivierre 1994: 534–535).

Rennell-Bellona is currently not in contact with another Non-Polynesian language, but there is evidence for language contact in the (distant) past. Linguistic evidence for past language contact is a number of loan words of Melanesian origin in Rennell-Bellona (Carson 2012: 34). Long-term contact with Non-Polynesian Melanesian speaking communities (called hiti) is also plausible based on cultural and archaeological evidence (Elbert and Schütz 1988: 277–278). It is thus likely that when Polynesians settled in Renell and Bellona some time after 1000 AD, the islands were already inhabited by the hiti people, who then likely co-inhabited those islands with the Polynesians for some time (Carson 2012: 38).

Another group of consonants that are found in only some of the Polynesian languages are long consonants or geminates. Table 2 lists those long consonants together with the languages in our dataset that they occur in.

Table 2:

Phonematic long consonants and aspirated consonants.

	/pː/	/tː/	/kː/	/fː/	/vː/	/sː/	/hː/	/mː/	/nː/	/ŋː/	/lː/
Tuvalu	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Mele-Fila	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Sikaiana	✓	✓	✓	✓	✓	✓	✓	✓	✓		✓
Takuu	✓	✓	✓	✓	✓	✓	✓	✓	✓		✓
Nukuoro	✓	✓	✓		✓	✓	✓	✓	✓	✓	✓
Kapingamarangi	(p^h)	(t^h)	(k^h)				✓	(m^h)	(n^h)	(ŋ^h)	(l^h)
West Uvean	(p^h)	(t^h)	(k^h)					(m^h)	(n^h)	(ŋ^h)	(l^h)

Except for Tuvalu, all languages with long consonants belong to the group of Outlier Polynesian languages, i.e., they are spoken in Melanesia and Micronesia to the West of the Polynesian triangle. Although Tuvalu is not classified as an Outlier language as such, it has been identified as one of the most likely origins or closest relative of some Outlier languages in the North (e.g., Carson 2012; Kirch 1984b; Pawley 1966). Among those Outliers are Nukuoru, Kapingamarangi, Takuu, and Sikaiana, which also feature long consonants as shown in Table 2. Such long or geminated consonants have been argued to be a shared innovation in a number of Outlier languages (cf. Biggs 1978: 700–701; Pawley 1967: 286–287). It is likely that they originate from earlier reduplication C₁V₁C₂V₂ sequences, in which V₁ was often unstressed and identical with V₂, leading to its loss. For instance, Proto-Polynesian *lelei ‘good’ developed into Tuvaluan llei ‘good’ Pawley (1967: 286–287).

In addition to long consonants, Table 2 shows aspirated consonants in Kapingamarangi and West Uvean equivalent to the geminated ones in the other languages. It has been argued that aspirated consonants in Polynesian originate from the same C₁V₁C₂V₂ sequences as the geminated consonants. This development was explored in more detail by Milner (1958) for Tuvalu, providing convincing evidence that aspirated consonants are the result of reduplication with vowel elision.^[8] Additionally, Milner (1958) argued that a similar process could have taken place in Kapingamarangi as well. Næss and Hovdhaugen (2011) give a similar account of aspirated consonants for West Uvean, which we return to in Section 4.3.

According to Pawley (1967: 267), long consonants are the most important shared innovation for postulating a phylogenetic grouping that these languages all belong to. Still, he acknowledges that this shared feature may be the result of convergence, meaning that it was a parallel development facilitated by contact rather than an inherited property (Pawley 1967: 287). The remaining language with long consonants in Table 2 is Mele-Fila, another Outlier language spoken on the island of Efate and likely linked to East Uvean and East Futuna (Carson 2012) rather than Tuvalu. This does not fit in with the account of an inherited property in a very straight-forward way. A shared development due to contact, on the other hand, would be more apt to account for long consonants in Mele-Fila. Interestingly, we find a related phenomenon in Rennell-Bellona, another Outlier language from the southern group. Rennell-Bellona was described to have consonant clusters due to vowel loss in fast speech (Elbert and Schütz 1988: 17–19). It could well be that this reflects a similar development as the one leading to geminated (and aspirated) consonants, only that it stayed on the phonetic level without becoming integrated into the phonological system of the language.

4.2 Vowels

Polynesian vowel systems commonly distinguish five vowel qualities. This can be seen in Table 3, showing the vowels that occur in >80 % of the Polynesian languages in the dataset. These vowel qualities also correspond to the five reconstructed ones for Proto-Polynesian (Biggs 1978: 701). Reconstructing vowel length has proven to be difficult, as vowel length has long not been marked orthographically (Dempwolff 1929; Elbert 1953). While we do not have vowel length information for all reconstructed words, it is generally assumed that length distinctions for vowels were already part of Proto-Polynesian (Biggs 1971: 483).

Table 3:

Common vowels (>80 %).

	front		central		back
high	/i/	/iː/			/u/	/uː/
mid	/e/	/eː/			/o/	/oː/
low			/a/	/aː/

Compared to consonants, there is little variation or innovation in terms of vowel phonemes in Polynesian languages. There is one Outlier language in our dataset, West Uvean, that has a noteworthy innovation of four additional vowels: /æ, ə, œ, y/.^[9] According to Ozanne-Rivierre (1994: 534), those were borrowed from the Melanesian language Iaai due to long and intense language contact. Interestingly, in West Uvean all three vowels /œ/, /ə/, and /e/ are used for /ə/ in Iaai loans, reflecting different stages of borrowing (Ozanne-Rivierre 1994: 540–541).

In relation to situations of language contact, Matras (2008: 37) notes that “contact-related change is more likely to affect consonants than vowels”. Thus, the patterns described here and in the previous section showing more contact-induced, innovative consonants than vowels in Polynesian fits in well with what we know about phonological borrowing from the literature.

4.3 Inventory sizes

Figure 5 shows the geographic distribution of phoneme inventory sizes in Polynesian. The inventory sizes range from 16 in Anuta to 40 phonemes in Mele-Fila, with a median of 21 phonemes. As can be seen in Figure 5, larger inventories tend to be located in the West in Melanesia and Micronesia, i.e., they tend to be found in Polynesian Outlier languages. Smaller inventories are more common in the Polynesian triangle, but they are certainly not confined to that area. Still, Figure 5 already suggests that language contact with Non-Polynesian languages in Melanesia may be involved in explaining larger consonant inventories, especially for the eight largest inventories which are all situated in the west: Mela-Fila (40), West Uvean (36), Tuvalu (32), Takuu (32), Sikaiana (30), Nukuoro (30), Vaeakau-Taumako (30) and Kapingamarangi (28).

Figure 5:

Distribution of phoneme inventory sizes.

As was shown in Section 4.1, larger phoneme inventories are in part the consequence of the phonemic long consonants that have developed in a number of Outlier languages and in Tuvalu. Thus, except for West Uvean and Vaeakau-Taumako, the other languages with the largest phoneme inventories all have the phonemic opposition between long and short consonants (cf. Table 2). As was mentioned above, the development of long consonants is the result of vowel deletion between identical consonants. Yet, it is unclear to what extent this is necessarily a shared innovation in some common ancestor language or a parallel development in the single languages later on. The latter is not unlikely, given that we can assume extensive contact between the different Outlier languages (Clark 1994; Pawley 1967). Although not explicitly discussed in the literature, we cannot exclude contact with Non-Polynesian languages to have favored certain stress patterns that could have in turn facilitated the loss of unstressed vowels and thus the development of long consonants.

Related to that, Clark (1994: 117) describes a shift in word stress for Mele-Fila from the penultimate to the antepenultimate syllable. This shift is analyzed as the result of so-called intimate borrowing from North and South Efate, which “requires prolonged intimacy between the two communities (such as frequent intermarriage over generations), affects all parts of linguistic structure, and in particular its lexical effects will not be localised but should pervade the lexicon as a whole” (Clark 1994: 113).

Geminated consonants are not the only extension that has led to larger phoneme inventory sizes in some of the Outliers, and it has long been noted that their phonological properties differ from Triangle Polynesian. Elbert (1965: 440–441) notes that West Uvean, Vaeakau-Taumako, and Kapingamarangi show the greatest expansion in terms of phoneme inventories. He notes that both West Uvean and Vaeakau-Taumako are spoken in close proximity with other Non-Polynesian languages, whereas Kapingamarangi is fairly isolated linguistically. We will briefly discuss the phoneme inventory of these three languages here and relate them to their contact situation alluded to by Elbert (1965).

As mentioned, Kapingamarangi is fairly isolated, with the closest Non-Polynesian languages located at a distance of about 450 km (Clark 1994: 110). It is spoken on the Kapingamarangi atoll, which is the southern most point of Micronesia. The next closest atoll is Nukuoro (where another Polynesian Outlier language is spoken), at a distance of about 300 km, which makes it one of the more isolated Polynesian languages. Its comparatively large phoneme inventory stems from the opposition between unaspirated and aspirated consonants as shown in Table 2 in Section 4.1. It is unclear whether this reflects a purely language-internal development or contact with other Polynesian languages.

One of the other languages that Elbert (1965) mentioned as having greatly increased their phoneme inventory is Vaeakau-Taumako. Vaeakau-Taumako is an Outlier language spoken in the Reef Islands and Duff Islands in the eastern part of the Solomon Islands. There are many other Oceanic, Polynesian as well as Papuan languages spoken in the greater area of the Solomon Islands; the language spoken in closest vicinity to Vaeakau-Taumako is the Oceanic language Äiwoo.^[10] Vaeakau-Taumako has indeed one of the largest phoneme inventories in our dataset, and similarly to Kapingamarangi, this is not due to phonemic long consonants either. One innovation is the phonemic status of the voiced plosives /b/ and /d/. Næss and Hovdhaugen (2011: 11) note that “it is possible that the presence of voiced oral stops in Vaeakau-Taumako, unusual for a Polynesian language, may reflect the presence of an earlier language community which shifted to the newly arrived language […] However, as we have no knowledge of what such an earlier language may have been like, this cannot be verified; though it may be noted that the Main Reefs language Äiwoo, which might share an ancestor with this hypothetical original language if their speakers were part of the same Lapita expansion, has a full set of voiced oral stops”.

Besides that, the large phoneme inventory of Vaeakau-Taumako is mainly due to a consistent distinction between unaspirated (/p, t, k, m, n, ŋ, l/) and aspirated (devoiced) consonants (/p^h, t^h, k^h, m^h, n^h, ŋ^h, l^h/). Although it is not entirely clear what the source of the aspirated consonant is for all cases, Næss and Hovdhaugen (2011: 37) mention two possible sources. For some cases, they show evidence that an aspirated nasal resulted from vowel deletion between an unvoiced fricative and a unaspirated nasal. For other cases, Næss and Hovdhaugen (2011: 37) show that aspirated consonants have an origin in reduplication, similar to long or geminated consonants in a number of other Polynesian Outliers. They propose that in words with reduplicated material, the vowel of the reduplicated syllable was lost, and the resulting geminated consonant developed further into an aspirated consonant, as in kai (SG) > kakai (PL) > *kkai > khai ‘eat’ (Næss and Hovdhaugen 2011: 37). In other words, aspirated, devoiced consonants in Vaeakau-Taumako may, at least in certain cases, represent a further development from geminated consonants. In Section 4.1, we showed that Kapingamarangi has aspirated variants of the same consonants that are commonly geminated in the other Outlier languages (cf. Table 2). Thus, while these developments are usually treated as a language-internal process, it is very likely that early contact between Outliers and other Polynesian as well as Non-Polynesian languages has spread or facilitated both developments. An opposition between unaspirated and aspirated plosives is not uncommon in the area and is found in other Melanesian languages. Tryon and Hackman (1983: 78) list a number of varieties spoken on Santa Isabel to “have developed a phonemic aspirated voiceless stop series”. Other work on Oceanic languages of New Caledonia has shown correspondences between aspirated consonants in the north and high tones in the south, arguing for the same origin in reduplicated forms with vowel elision (Haudricourt 1968; Ozanne-Rivierre 1995; Rivierre 1993). To conclude, the development of long and aspirated consonants has likely been facilitated by contact between Polynesian and other Non-Polynesian languages in Melanesia and west Polynesia.

The third language cited by Elbert (1965) for its large expansion of the phoneme inventory is West Uvean. As shown in Figure 5, West Uvean also has one of the largest phoneme inventories (36) in our dataset despite the lack of long consonants. Instead, West Uvean features a number of additional phonemes that are partially the result from phoneme splits, long-term contact with the Melanesian language Iaai and more recent contact with English and French. Although phoneme splits are traditionally seen as language-internal as opposed to language-external changes due to contact, certain language-internal developments related to voicing likely facilitated the borrowing of new phonemes from Iaai and their integration into West Uvean (Ozanne-Rivierre 1994: 538). Two important splits that occurred involve voicing. The Xtwo original voiceless plosives /k/ and /t/ became voiced /g/ and /d/, respectively, in word-initial and intervocalic positions. This contrast in voicing then also spread to bilabial plosives, resulting in the opposition between /p/ and /b/. On the one hand, the availability of a voicing opposition with plosives has allowed for word-final voiced plosives in later French and English loan words (Ozanne-Rivierre 1994: 534). On the other hand, it also likely facilitated maintaining a voicing contrast in the retroflex (/ʈ, ɖ/) and palatal plosives (/c, ɟ/) borrowed from Iaai. In addition, the Heo variety of West Uvean also retained some of the voicing opposition in nasals, laterals and glides from Iaai, which led to the additional voiceless /m̥, n̥, l̥, w̥/ in the Heo variety of West Uvean. However, the Muli variety of West Uvean only uses their voiced counterparts /m, n, l, w/. Ozanne-Rivierre (1994: 540) relates this difference between varieties to differences in the degree of bilingualism due to their respective geographical locations. While the Heo variety is located in the north of the island of Ouvéa, the Muli variety is mostly spoken on Muli, a separate small island in the south-west of Ouvéa. Ozanne-Rivierre (1994: 537) remarks that “[t]he Fagauvea [West Uvean] spoken in Heo (WUH), which is more influenced by Iaai because it is located on Ouvéa island itself, seems to have less of a tendency to ‘nativise’ than does the Fagauvea [West Uvean] spoken on the small island of Muli (WUM)”. Thus, the Heo variety tends to conserve voiceless nasals borrowed from Iaai better than the Muli variety. Other West Uvean phonemes that are the result of language contact with Iaai are the consonants /θ, ʃ, w, ɲ/ and the vowels /æ, ə, œ, y/ (cf. Section 4.2). Interestingly, Ozanne-Rivierre (1994: 530–531) shows that /h/ in West Uvean is not simply a continuation of the reconstructed Proto-Polynesian *h, but integrated into the language due to later loans from other Polynesian languages that West Uvean has been in contact with, possibly East Uvean or Tongan.

5 How to model complex spatial interactions between languages

In order to examine the Polynesian phoneme inventory data for a potential serial founder effect, this section addresses several relevant extra-linguistic factors and proposes five statistical techniques to capture them in modeling. The first factor that needs to be accounted for is the phylogenetic relation between the languages in the dataset, which we control for with phylogenetic regression (Section 5.1). We then turn the space-related factors that can impact linguistic structures, namely symmetric contact, asymmetric contact, and the directed expansion of the speaker populations through the course of time. In Section 5.2, we show how a stationary Gaussian Process can be used to model symmetric contact between the Polynesian languages of the dataset. We then propose a new method to estimate unidirectional contact effects from Non-Polynesian languages on Polynesian in Section 5.3, and show how we can capture varying degrees of unidirectional contact effects on Polynesian in Section 5.4. Section 5.5 addresses the Polynesian expansion and introduces non-stationary Gaussian Processes to capture it in statistical modeling in order to test for a potential serial founder effect. Section 5.6 provides an overview of the final models and their structure. For the implementation and model definition, see the supplementary materials.

5.1 Controlling for phylogenetic bias with phylogenetic regression

Working with a crosslinguistic dataset that includes related languages, we may run the risk of over-representing a particular linguistic property if we miss dependencies between languages and take them as independent datapoints. One important dependency between languages to control for is their phylogenetic relatedness, especially in our case, where we investigate Polynesian languages that are all (more closely or distantly) related. To capture these phylogenetic relations between the languages in our sample, we use phylogenetic regression (cf. Becker et al. 2023; Bentz et al. 2015; Guzmán Naranjo and Becker 2022; Verkerk and Di Garbo 2022). We will not go into the details of phylogenetic regression in this paper, but the basic idea is that we add a group-level (or so-called random) effect coefficient for each language in our sample. However, instead of allowing each coefficient to vary freely, we force the estimates of this term to be correlated between languages according to a pre-computed phylogenetic tree. Effectively, we have an intercept for each language, and the values of these intercepts follow the structure of the tree. The coefficients for two languages which are more closely related will have closer estimates than the coefficients for languages which are further apart in the phylogenetic tree. Instead of adding categorical language families to the model, this approach captures that phylogenetic relations are gradual, languages can be more or less closely related and are not simply binned together in several categories. Therefore, phylogenetic regression also works well when all languages in a dataset come from the same phylogenetic unit such as Polynesian, as it can handle different degrees of relatedness as long as we have information on that in the form of a phylogenetic tree. To build a phylogenetic tree, we used the phylogenetic information from Glottolog as shown in Figure 2.

5.2 Modeling symmetric contact with a stationary Gaussian Process

We know that sustained long-term contact with a high degree of bilingualism can lead to the borrowing of phonemes and to an increase in phoneme inventory size (Matras 2009: 222–224). We can take this into account in modeling by using a proxy variable for potential contact, namely geographic proximity. This is an established way of operationalizing (potential) language contact regardless of the sampling method or statistical control used (e.g., Cysouw et al. 2012b; Dryer 2018; Holman et al. 2007; Jaeger et al. 2011). While the idea of using distances between languages to model contact effects is not new, we propose a relatively new method to do so, namely Gaussian Processes.

Gaussian Processes (Rasmussen 2004; Rasmussen and Williams 2006) were first introduced for use in typology by Guzmán Naranjo and Becker (2022) and Guzmán Naranjo and Mertner (2023), and their use has been further expanded by (Hartmann and Jäger 2023).^[11] The basic idea behind a Gaussian Process is that we first calculate the distance between all observations in our dataset, and then use a kernel function to build a covariance matrix. This covariance matrix expresses the potential spatial correlation between observations. The key feature of a Gaussian Process is that the spatial correlation between observation decays non-linearly with distance. Put differently, two languages spoken closely together will be highly correlated, but as the distance between languages increases, this correlation can drop to effectively 0.^[12] One particular feature of the usual approach to Gaussian Processes is that they are stationary. This means that the model effects are not dependent on the absolute location of the languages, but their locations in relation to each other. Consider the 1-dimensional toy example of a stationary Gaussian Process in Figure 6. We see a stationary data-generating function (red line), data sampled from that function (light blue dots), and a corresponding 1-dimensional stationary Gaussian Process fitted to the data (dark blue dots with uncertainty intervals).^[13] In this type of scenario, the value of y_i does not depend on the absolute value of x_i, but on its distance to its neighbors and a general behavior of the function. The example in Figure 6 shows that a Gaussian Process can track a non-linear function very well.

Figure 6:

Stationary Gaussian Process.

Since we deal with coordinates for the languages in the dataset, we need to build a two-dimensional distance matrix from the geodesic distance between all pairs of languages.^[14] We can then use the two-dimensional Gaussian Process to estimate the spatial correlation between the phoneme inventory sizes of Polynesian languages. In other words, the two-dimensional Gaussian Process estimates how much of the variation in phoneme inventory size can be attributed to the phoneme inventory sizes of the languages spoken in proximity, i.e., contact.

5.3 Modeling asymmetric contact with unidirectional contact-effect estimation

As was shown in Section 4, we know from the literature that Non-Polynesian languages had an important influence on the phoneme inventories of Polynesian languages in Melanesia and Micronesia, and that this influence was mostly asymmetric. Moreover, even if Polynesian languages have also impacted neighboring Non-Polynesian languages, this influence is irrelevant for the purposes of the present study to account for the variation in phoneme inventory size in Polynesian. In this case we have theoretical and practical reasons to model language contact as unidirectional from Non-Polynesian to Polynesian. Specifically, this corresponds to a scenario in which the borrowing of a new phoneme or a new contrastive feature (e.g., length) from Non-Polynesian leads to a change in the inventory size of Polynesian languages.

In order to capture this type of asymmetric contact influence, we propose a new technique which we call unidirectional contact-effect estimation. The main idea is that we expect Non-Polynesian languages to have an effect on Polynesian languages similarly to how we expect normal contact to happen, that is, languages closer to each other are expected to be able to influence each other more than languages further apart. In our context, we assume that Polynesian languages can borrow phonemes or features from Non-Polynesian languages, and that this borrowing can lead to an increase in phoneme inventory size by direct integration of the feature. In general, as for the contact effects modeled by a stationary Gaussian Process, we assume that languages spoken in closer proximity to each other can influence each other more than languages spoken further apart. To capture such an effect from Non-Polynesian on Polynesian, our method of unidirectional contact-effect estimation involves the following three steps:

We fit a stationary Gaussian Process to a model predicting the presence or absence of a phoneme or phonological feature p in Non-Polynesian languages. We do not include Polynesian languages in this part of the model. We also include a phylogenetic term for the Non-Polynesian languages.
We predict the expected probability of p at the location of each Polynesian language L_i in our dataset. In other words, we do interpolation of probabilities based on the Gaussian Process fitted to the Non-Polynesian languages for the locations of the Polynesian languages.
In a new model for the Polynesian languages, we use the estimated probability of p to predict the phoneme inventory size of each L_i as a linear effect.

While we can expand this approach to an arbitrary number of phonemes, we need to consider that the presence of two phonemes can be correlated with each other, e.g., [p^h] and [k^h]. Strong correlations between two phonemes can lead to multicolinearity issues in the estimates for the linear component. More importantly, strong correlations between multiple phonemes can lead to overly optimistic estimates if we model each phoneme independently. To account for this potential correlation, we do not fit independent (logistic) regression models to each phoneme. Instead, we fit a single multivariate probit model to all relevant phonemes (cf. Guzmán Naranjo and Mertner 2023). This means that we model the probability of [p^h] and that of [k^h] together rather than separately. Put differently, multivariate probit regression can model multiple binary outcomes as correlated by assuming an underlying multivariate normal distribution.

Another potential confound when estimating the asymmetric effects of Non-Polynesian languages on Polynesian languages comes from phylogenetic relatedness, since a number of them are Oceanic and thus related to other Austronesian languages. To control for this, we also include a phylogenetic term in the multivariate probit model.

The last remaining question regarding the effect of Non-Polynesian languages on Polynesian phoneme inventory sizes is which phonemes should be included for the purposes of this study. Since we are interested in contact which could have led to variation in phoneme inventory size, we selected those phonemes and phonological features that cause the main amount of variation across Polynesian languages in terms of phoneme inventories.^[15] These are the presence / absence of:

vowel length
non-cardinal vowels
voiced plosives
long or aspirated plosives
two or more liquids
the phonemes /v/, /w/, /s/, /f/, /ŋ/, /ʔ/, /h/

5.4 Modeling the probability of unidirectional contact effects with mixture models

A further question that emerges from the discussion on how to capture unidirectional contact is that of detecting cases in which contact may have played a role and those in which it likely has not. In our particular case, we do not necessarily want to assume that all Non-Polynesian languages have had an impact on all Polynesian languages. Inversely, we would like to know which Polynesian languages were likely most influenced by Non-Polynesian languages and which ones were not. This last point basically boils down to a question of probabilistic classification: how likely is each observation to belong to group A (not influenced by Non-Polynesian) or group B (influenced by Non-Polynesian).

To implement this idea we use what is called a mixture model (Bradley et al. 2000; Lindsay 1995; Rasmussen 1999).^[16] In a mixture model we assume that observations come from two different distributions. For example, Figure 7 shows the mixture of two normal distributions, one with mean = 0 and sd = 1 (group A), and the other with mean = 3 and sd = 2 (group B). In this example, the observations come from two distinct groups (also called components) A and B. A mixture model can try to recover, for each observation, what the probability is of that observation belonging to either A or B.

Figure 7:

Example of mixture of two normal distributions.

In our case, we assume that we have two distributions of observations: Polynesian languages not influenced by Non-Polynesian languages (A), and Polynesian languages influenced by Non-Polynesian languages (B). That is, one component is the model which only includes a stationary Gaussian Process (A), and the other component is the model which includes a stationary Gaussian Process and unidirectional contact (B).^[17] It is more or less equivalent to using both models at the same time on the data, and estimating which observations are more likely to be better explained by which model. We discuss the results of the mixture model in Section 6.3.

5.5 Modeling directed expansion with a non-stationary Gaussian Process

Given that serial founder effects have been proposed and studied specifically for Polynesian phoneme inventory sizes (e.g., Haudricourt 1961; Trudgill 2004), we also want to account for it in our models. In other words, we need to include a component that represents the Polynesian expansion from their historical point of origin. This in turn means that we need to be able to capture the distances between the current locations of Polynesian languages and their historical point of origin. Assuming a serial founder effect for phoneme inventory sizes, we would expect that phoneme inventories decrease in size with increasing overall distance to the point of origin.

We can use a non-stationary Gaussian Process model in such a scenario. In contrast to stationary ones, non-stationary Gaussian Processes do consider absolute location in space. This is achieved by combining two kernels: a non-linear kernel like the one used in the stationary Gaussian Process, and a linear kernel. The non-linear kernel is responsible for capturing how observations influence each other, and the linear kernel is responsible for the absolute trend. The latter corresponds to the absolute distance to the point or origin, i.e., the serial founder effect, in our case. To illustrate this, Figure 8 shows a non-stationary function (red line) together with the data sampled from that function (red triangles).^[18] In this example, the y value of observation i depends on both the absolute value of x for i, and the relative distance on x between i and its neighbors, and their values. While there are noticeable fluctuations in the value of y, overall we can expect larger values of y for larger values of x.

Figure 8:

Non-stationary Gaussian Process versus stationary Gaussian Process on non-stationary test data.

Furthermore, Figure 8 shows how a non-stationary Gaussian Process (light blue) performs predicting this kind of data in comparison to a stationary Gaussian Process (dark blue). We fitted both models to half of the data (x < 10), and then predicted the remaining half (x > 10). The difference is clear: the stationary Gaussian Process fails to capture the general upward trend in the data and makes incorrect predictions for most of the test data. In contrast, the non-stationary Gaussian Process in Figure 8 does a very good job of predicting the values of the test data including the overall upward trend.

While the previous example only contains one-dimensional data (variation in y), we can expand this observation to a two-dimensional scenario like before. The stationary component is built on top of distances between observations on a two-dimensional plane, while the non-stationary component is built on top of the absolute distance of observations to a specific point on the plane. Thus, in addition to the two-dimensional stationary component described in Section 5.2, the two-dimensional non-stationary component of the Gaussian Process can be used to model the potential effect of the absolute distance between the point of origin (Samoa) and the current location of a language on its phoneme inventory size.

5.6 Taking stock

We have shown in this section that a complex model is necessary in order to capture (i) phylogenetic relations between the languages of the dataset, (ii) symmetric contact effects between Polynesian languages, (iii) asymmetric contact effects of Non-Polynesian languages on Polynesian, (iv) the probability of the latter by language, and (v) effects of directed expansion. This is necessary if we want to properly test for a potential serial founder effect.

Putting everything together, our model structure consists of three main components. The first component deals with asymmetric contact effects and corresponds to the unidirectional contact-effect estimation described in Section 5.3, where we use a multivariate probit model on the presence/absence of selected phonemes and phonological features in Non-Polynesian. This component includes a phylogenetic term (on each phoneme) and a stationary Gaussian Process (on each phoneme).

The second component corresponds to estimating phoneme inventory sizes in Polynesian. Phoneme inventory sizes correspond to count data, i.e., positive integers excluding zero in our case. The distribution of such count data can be captured by the Poisson distribution, which is why we use a series of Poisson models to predict the phoneme inventory sizes of Polynesian languages. Here, we include a phylogenetic term, a Gaussian Process (stationary or non-stationary) and a term based on the unidirectional contact-effect estimation from the first component to estimate the effect of Non-Polynesian on Polynesian. For this second component, we do not fit a single model including all terms. Instead, we fit a series of models including and excluding terms in order to gauge their importance and effects on the final model.

The third component consists of the mixture model introduced in Section 5.4, which we use to allow for the influence of Non-Polynesian languages on Polynesian to vary across the different Polynesian languages in our dataset. More specifically, the mixture model allows for the variation in phoneme inventory sizes in our data to come from a mixture of two independent distributions, which we capture by the two models with and without unidirectional contact-effect estimation (m_s & m_{s_uni}). The mixture model predicts which proportions of mixture of the two models are most likely for each of the data points.

Table 4 shows a summary of all models fitted: m_s, m_ns, m_{s_uni}, m_{ns_uni}, and m_mix. All models include the respective terms of the second component; m_{s_uni} and m_{ns_uni} also include the first component.

Table 4:

Summary of all models fitted.

	unidirectional contact–effect estimation	(component 1)
m_s	phylogenetic term + stationary GP	(component 2)
m_ns	phylogenetic term + non-stationary GP
m_{s_uni}	phylogenetic term + stationary GP + unidirectional contact
m_{ns_uni}	phylogenetic term + non-stationary GP + unidirectional contact
m_mix	phylogenetic term + stationary GP + mixture (m_s & m_{s_uni})	(component 3)

It is important to note that we fitted the unidirectional contact effects simultaneously with the rest of the model, not sequentially. We built a single model with the first and second component, which we estimated jointly. That means all the terms (phylogenetic regression, stationary Gaussian Process, unidirectional contact-effect estimation, non-stationary Gaussian Process) are dependent on each other. This model structure is more complex than that of more common linear or generalized models used in quantitative typology. This complexity is nevertheless necessary for a serious attempt to test the effect of spatial expansion (serial founder effect), including the effect of neighboring languages (asymmetric contact), while also controlling for contact and phylogenetic relations between Polynesian languages.

6 Model results

This section presents the model results. We start by exploring the spatial patterns predicted by the four models m_s, m_ns, m_{s_uni}, and m_{ns_uni} in Section 6.1. This allows us to visually inspect the model predictions, and to compare them against each other as well as against our knowledge on language contact in the region from the literature. In Section 6.2, we then carry out a more principled comparison of model fit, which is important for understanding how well the four models capture the observed data. The results of this comparison facilitate an assessment of how important the directed expansion and the unidirectional contact effects from Non-Polynesian are in accounting for the variation in phoneme inventory size in Polynesian. We then discuss the results of the mixture model m_mix, i.e., the varying probability of unidirectional contact effects, separately in Section 6.3, as it is not directly comparable to the other models.

6.1 Predictions for spatial patterns

In this section, we explore the spatial effects that the different models predict. We use conditional effect plots to do so. A spatial conditional effect plot shows how the response variable (in this case phoneme inventory size) changes across space, given the model parameters. The plot is created by defining a rectangular grid of regularly spaced locations, and predicting the expected phoneme inventory size at each location. These predictions do not take into account other aspects of the model (i.e., we fix the non-spatial parameters), and should not be interpreted in absolute terms. In other words, we can only interpret the relative differences between predictions across space, rather than absolute phoneme inventory sizes. The predictions can be interpreted as the effect of the spatial component after controlling or accounting for all other predictors. Figures 9 and 10 show the observed inventory sizes as red dots, with larger dots representing larger inventories. The predictions are shown as colored areas ranging from dark blue / purple for smaller inventories to orange for larger inventories.

Figure 9:

Spatial effects of m_s (left) and m_{s_uni} (right).

Figure 10:

Linear and non-linear Gaussian Process components of m_ns (left) and m_{ns_uni} (right).

We start by examining the conditional effects of m_s and m_{s_uni}, which have a stationary Gaussian Process to account for symmetric contact between Polynesian languages and control for unidirectional contact effects in the latter case. The conditional spatial effects of those two models are shown in Figure 9. M_s (left plot) finds a relatively weak areal pattern. Polynesian languages to the west (in Melanesia and central Polynesia) are estimated to have slightly larger phoneme inventories than the languages in eastern Polynesia, with a difference of about two phonemes. Since the model does not have any information about contact between Polynesian and Non-Polynesian languages, it attributes all of the spatial patterns to contact effects between Polynesian languages.

The right plot in Figure 9 shows the spatial effects predicted by m_{s_uni}. It includes a component to model unidirectional contact effects from Non-Polynesian on Polynesian, and it finds a similar but weaker areal pattern as m_s. This is an important result. It means that a portion of the spatial variance found with m_s can actually be captured by contact with Non-Polynesian languages and does not simply correspond to a spatial effect within Polynesian languages. In other words, this result of comparing the spatial effects of m_s and m_{s_uni} confirms that our technique to include unidirectional contact effects from Non-Polynesian languages reflects the spatial relations of Polynesian languages in a more realistic way given what we know about their contact relations to especially Melanesian languages from the literature.

We turn to models m_ns and m_{ns_uni} next, which include a non-stationary Gaussian Process to model directed expansion from Samoa.^[19] Model m_{ns_uni} additionally includes unidirectional contact effects from Non-Polynesian on Polynesian. Figure 10 shows the combination of the linear and non-linear components of the non-stationary Gaussian Process in the models m_ns (left) and m_{ns_uni} (right).^[20]

Looking at the spatial effects of m_ns on the left of Figure 10, we see that phoneme inventories in the east are predicted to be smaller with larger distances to the origin in Samoa. This is what we would expect given the serial founder effect. The effect is fairly strong, with a change of about 3–4 phonemes from Samoan to Hawai’ian and Rapanui, and a change of about 1–2 phonemes for most other eastern Polynesian languages. In the west, however, we find no difference between the origin in Samoa and western core Polynesian as well as Outlier Polynesian languages.

As we can see on the right plot in Figure 10, controlling for unidirectional contact effects from Non-Polynesian leads to a few minor changes in the structure of the expansion effects in addition to setting a different baseline. We now find a difference of one phoneme between Samoan and the remaining central Polynesian languages as well as Southern Outliers in New Caledonia, Vanuatu, and the Solomon Islands. The predicted difference between Samoa as the origin and eastern Polynesian languages is slightly higher at 2–4 phonemes.

Comparing the spatial effects of the non-stationary Gaussian Process shown in Figure 10 to the ones that are based on a stationary Gaussian Process in Figure 9, we see a stronger effect for the non-stationary Gaussian Process. In Figure 10, we see differences up to four phonemes for the eastern and the most remote languages (Hawai’ian and Rapanui). In Figure 9, where the spatial effects did not predict any expansion from the Polynesian origin around Samoa, we found no predicted difference between central and eastern Polynesian languages, with a difference of up to 1–3 phonemes in the west (Melanesia and Micronesia).

What this comparison shows is that these different models make substantially different predictions about the spatial patterns in phoneme inventory sizes in Polynesian. Comparing the predictions against the observed inventory sizes, Figures 9 and 10 suggest that the models using a stationary Gaussian Process only may fit the observed data somewhat better, with the largest inventories in the west and smaller ones in the east. We provide a more systematic comparison of the fit of the four models in Section 6.2.

The last important aspect related to the prediction of spatial patterns is the amount of certainty associated with the predicted expansion in models m_ns and m_{ns_uni}. In Figure 10, we only presented the means of the spatial predictions, as the plots were already two-dimensional. To visualize the model uncertainty, we can plot the change in expected phoneme inventory size depending on the distance to Samoa. This is shown in Figure 11 for m_ns (left) and m_{ns_uni} (right). The line corresponds to the mean effect, and the bands to the 95 % uncertainty intervals.

Figure 11:

Linear component of m_ns (left) and m_{ns_uni} (right) with uncertainty intervals.

The mean tendency presents as expected by the serial founder effect hypothesis, phoneme inventory sizes become slightly smaller with increasing distance to Samoa. However, Figure 11 also shows extremely wide uncertainty intervals. In particular, the model that includes uni-directional contact from Non-Polynesian reduces the mean effect to mostly noise. While there is an apparent tendency in the expected direction, this tendency is too weak to support the serial founder effect hypothesis or any other effect based on distance to the origin.

6.2 Model fit: how well do the models capture the observed phoneme inventory sizes?

We begin by comparing how the four models m_s, m_{s_uni}, m_ns, and m_{ns_uni} fit the observed data. This is important in order to understand which model is more suitable to capture the observed data, which in turn informs us which spatial processes are more or less likely to account for the observed variation in Polynesian phoneme inventory sizes.^[21] Specifically, comparing these four models helps us to understand the relevance of unidirectional contact effects from Non-Polynesian languages and of the Polynesian expansion, i.e., a potential serial founder effect. To compare the fit of the models, we can plot the posterior predictions of the models against the observed inventory sizes. Figures 12 and 13 show the predictions versus observations for the models m_s and m_{s_uni}.

Figure 12:

Predicted versus observed phoneme inventory size (m_s).

Figure 13:

Predicted versus observed phoneme inventory size (m_{s_uni}).

Both models include a stationary Gaussian Process to estimate the effects of contact within Polynesian and, in the case of m_{s_uni}, the effects of unidirectional contact from Non-Polynesian languages. The observed inventory sizes are shown in red, with dots for core Polynesian and triangles for the Outlier languages. The predictions are shown in the form of box plots, with the central horizontal black line as the median estimate, the blue boxes correspond to the range that 50 % of the predicted values falls into, and the whiskers indicate maximum and minimum values.

From the wide whiskers in Figures 12 and 13 we can observe that both models have a very large amount of uncertainty in the posterior predictions. At the same time, we see that the central mass of the predictions contains the observed values for most languages, as the observed values fall within the blue boxes. In Figure 12, which shows the results of m_s with no control for unidirectional contact from Non-Polynesian, there are six languages for which the central mass of the predictions lies far from the real value. The languages are Anuta, Mele-Fila, Sikaiana, Takuu, Tuvalu, and West-Uvean. All six languages are all located in close proximity to Non-Polynesian languages; we can see from Figure 13 that the model m_{s_uni}, which accounts for unidirectional contact effects from Non-Polynesian languages, does better in predicting the inventory sizes of those languages. We can thus conclude that Non-Polynesian languages likely have an impact on the phoneme inventory size of a number of Outlier Polynesian languages in Melanesia.^[22] Overall, the models fit the data relatively well, and most of the posterior mass is near the true value of the data. Mele-Fila is the only language in which the observed inventory size clearly lies outside of the central mass of the predictions, even when unidirectional contact effects from Non-Polynesian are included. As mentioned in Section 4, Mele-Fila has the largest phoneme inventory of all Polynesian languages in our dataset with 40 phonemes, which could contribute to the model underestimating its phoneme inventory size. The language includes a phonemic length distinction for both vowels and consonants and features a number of additional consonants that are uncommon in Polynesian: the affricates /tɕ, tɕː/, labialized /p^w, p^wː, m^w, m^wː/, as well as /r, rː/ (in addition to /l, lː/). The other language for which including contact effects from Non-Polynesian does not improve the predictions is Emae. Both models m_s and m_{s_uni} slightly overestimate its phoneme inventory size. This could be related to the close proximity of Emea to especially Mele-Fila but also West-Uvean, which have the two largest phomeme inventories in our dataset.

Figures 14 and 15 show the results for models m_ns and m_{ns_uni}, which include a non-stationary Gaussian Process and the unidirectional contact effect component in the case of m_{ns_uni}.

Figure 14:

Predicted versus observed phoneme inventory size (m_ns).

Figure 15:

Predicted versus observed phoneme inventory size (m_{ns_uni}).

In contrast to the previous two models, these two estimate the effect of the distance between a given Polynesian language and the Polynesian origin in Samoa.^[23] In other words, Figures 14 and 15 show the results of models that estimate a serial founder effect for phoneme inventory sizes in Polynesian. Comparing the model results to their respective non-stationary counterparts in Figures 12 and 13, we see that the non-stationary Gaussian Process component has little impact on model fit. Looking at the predictions from m_ns in Figure 14, which models the serial founder effect but no contact effects from Non-Polynesian, we see that the predictions for the six languages mentioned above have not noticeably improved.

We can further explore model fit by comparing the root mean square error (RMSE) for the models. In contrast to the comparison of model predictions against the observed values discussed above, this method quantifies the overall model fit, allowing us to compare how well the models capture the observed data as a whole. Simply put, RMSE values can range between zero and any positive value. The closer the value is to 0, the better is the model fit. Table 5 shows the RMSE values for the four models discussed, ranked according to their RMSE values from top to bottom. We can see a clear difference between the models that include the unidirectional contact term and those that do not. The two models m_{ns_uni} and m_{s_uni} show a better model fit than the models m_ns and m_s. This confirms the impression from the model predictions discussed above that including unidirectional contact effects from Non-Polynesian languages is important to account for the variation in phoneme inventory sizes in Polynesian. In addition, Table 5 suggests that there is very small advantage to include a non-stationary Gaussian Process in terms of model fit.

Table 5:

RMSE for all six models.

Model		RMSE
m_{ns_uni}	(non-stationary GP + unidirectional contact)	1.99
m_{s_uni}	(stationary GP + unidirectional contact)	2.15
m_ns	(non-stationary GP)	3.62
m_s	(stationary GP)	3.67

To conclude this section, we compared the fit of the four models that capture the Polynesian expansion and the unidirectional contact effects from Non-Polynesian on Polynesian to different extents. Two different types of comparisons suggested that including unidirectional contact effects results in models that can capture the observed variation in phoneme inventory sizes better, especially those of the Outlier languages. Including the Polynesian expansion, on the other hand, did not seem to lead to a noticeable improvement of model fit.

6.3 Mixture model: better representation of unidirectional contact

In Section 6.2, we saw how including the unidirectional contact effects from Non-Polynesian on Polynesian improved model fit, suggesting that contact from Non-Polynesian is an important factor for predicting variation in Polynesian phoneme inventory sizes. Models m_{s_uni} and m_{ns_uni} assume that all Polynesian languages are subject to unidirectional contact effects from Non-Polynesian languages. As was briefly discussed in Section 4, however, we know from the literature that this mostly affects languages in western Polynesia and especially some of the Outlier languages spoken in Melanesia. For languages spoken in central and eastern Polynesia, such contact is much less likely. In other words, we know that the amount of contact with Non-Polynesian languages is not identical for all Polynesian languages in the dataset. We approach this issue with a so-called mixture model. As explained in Section 5.4, the idea of a mixture model is that we assume the response variable (phoneme inventory size) comes from two independent distributions. In our case, one distribution contains information about unidirectional contact effects, while the other distribution does not contain any information thereof. The probability of each observation then results from the mixture of both distributions with differing proportions.

Figure 16 shows the estimated mixing proportions for the unidirectional contact-effects component of m_mix. Observations with high estimated mixing proportions, i.e., a high degree of influence from Non-Polynesian, are shown in red. The observations with low estimated mixing proportions or little influence from Non-Polynesian are colored in blue. It is important to note that the mixture model has no information about which languages are Outliers (or any other classification to capture more and less contact with Non-Polynesian languages). The results are striking for how well they reflect our knowledge on how contact with Non-Polynesian languages affects different (groups of) Polynesian languages.

Figure 16:

Mixing proportion for mixture model on unidirectional contact effects.

First, it is obvious that the higher mixing proportions for the unidirectional contact component are found among the Polynesian Outliers. The one exception to this overall picture is Tuvalu, which is not classified as an Outlier language but which is geographically (and genetically) very close to some of the Polynesian Outliers and likely serves as their origin (cf. Section 2.1).

Second, the language with the highest mixing proportion is Mele-Fila, which had intense contact with Non-Polynesian languages. Mele-Fila has the largest phoneme inventory of all Polynesian languages in the dataset, and as mentioned in Section 4.3, we know from the literature that long-term contact with Non-Polynesian languages played an important role in some of the innovations in Mele Fila. Another language with the highest mixing probabilities is West Uvean. As discussed in detail in Section 4.3, West Uvean has also been shown to have integrated both consonants and vowels into its phoneme inventory as the result of contact. Thus, in those two cases, the model picks up a strong contact signal that fits well into our knowledge from the literature.

Third, Nukuoro and Kapingamarangi are two Outlier languages spoken in Micronesia with relatively low mixing proportions. As was mentioned in Section 4.3, this reflects their high degree of geographic isolation. According to Clark (1994: 110), the closest Non-Polynesian language is spoken on the Mortlock islands, which are at a distance of about 225 km to Nukuoro and of about 475 km to Kapingamarangi. For Kapingamarangi and also for Sikaiana, the low mixing proportions estimated by the model likely also have another explanation. Both languages have fairly large phoneme inventories (28 and 30, respectively), but their main innovation is an extensive series of geminated consonants (cf. Table 2). Given that many of the northern Outliers feature additional geminated or aspirated consonants, the model can also capture these inventory sizes from the phylogenetic and contact relations within Polynesian languages. This reflects what we know from the literature. As was discussed in Section 4.1, geminated and aspirated consonants have been argued to be a shared innovation within Polynesian languages, as the result of inheritance and/or convergence, following the loss of vowels in reduplication contexts.

We also see comparatively low mixing proportions for Rennell-Bellona and Futuna-Aniwa. As mentioned in Section 4.1, Rennell-Bellona features the voiced plosives /b, g/ that have likely resulted from long-term contact with Melanesian languages. Besides this innovation, however, the phoneme inventory of Rennell-Bellona is fairly small (23). Futuna-Aniwa also has a very small phoneme inventory of 18. It does not have long vowels or consonants and its only innovative consonant is /l/, which is likely an innovation rather than a continuation from Proto-Polynesian (cf. Section 4.3).

We can also use the mixing model to explore to what extent it is able to distinguish between core and Outlier Polynesian languages. In this case, we find a very clear delimiting threshold of 0.31.^[24] All but five languages with mixing proportions above this threshold are Outliers, and all languages below this threshold are core Polynesian, except one. The six languages that would not be correctly classified with this threshold are: core Polynesian Tuvalu (with mixing proportion > 0.31), and Outliers Rennell-Bellona, Futuna-Aniwa, Kapingamarangi, and Sikaiana (with mixing proportions < 0.31).

To conclude this section, we showed how mixture models can be used to assess to what extent the different Polynesian languages are affected by contact with Non-Polynesian languages. We found that m_mix captured the contact effects from Non-Polynesian on the phoneme inventories of Polynesian languages rather well compared to what is known from the literature. Although we only showed a brief example of using mixture models, it is an important approach to scenarios in which we need to identify the strength or relevance of contact between source languages and a number of potential target languages.

7 Discussion

The results of this case study have several important consequences and implications regarding the various aspects of modeling spatial and contact relations between languages in general, and the hypothesis of a serial founder effect with phoneme inventory sizes in particular. We begin by discussing the consequences of our results with respect to a potential serial founder effect for Polynesian phoneme inventories in Section 7.1. We then turn to the consequences for phoneme inventory sizes and contact in general in Section 7.2 and to the broader implications for modeling space and languages in Section 7.3.

7.1 No evidence for a serial founder effect for phoneme inventory size in Polynesian

One of the motivations for this study was the hypothesis of a serial founder effect for phoneme inventory sizes. Evidence for this hypothesis has been debated in the literature (e.g., Atkinson 2011a; Cysouw et al. 2012a; Deshpande et al. 2009; Fort and Pérez-Losada 2016; Pérez-Losada and Fort 2018; Wang et al. 2012). The two most important methodological shortcomings of previous approaches that have argued for serial founder effects in the distribution of phoneme inventory sizes are (i) that they examine the effect on a global scale and (ii) that they do not include sufficient controls for a number of potentially confounding factors. In this study, we proposed a method to circumvent both shortcomings. We restricted our dataset to Polynesian languages, since their phylogenetic relations, spatial expansion from a common point of origin as well as contact with other languages are much better understood than they are for many other language families and areas of the world. In addition, we proposed advanced statistical methods to model spatial expansion (Section 5.5) while controlling for phylogenetic relations (Section 5.1), symmetric contact between Polynesian languages (Section 5.2), as well as asymmetric contact effects from Non-Polynesian languages on Polynesian (Section 5.3) and their probability (Section 5.4).

As was shown in Section 6, the results of our models do not support a serial founder effect for phoneme inventory size in Polynesia. Section 6.1 showed the predicted spatial effects of four different models that controlled, to different extents, for directed expansion and asymmetric contact effects from Non-Polynesian languages. The spatial predictions that reflected the observed phoneme inventory sizes best (larger inventories in the west, and smaller ones in central and eastern Polynesia) came from the two models that included potential contact effects within Polynesian and from Non-Polynesian, but no directed expansion (cf. Figure 9). This finding was supported by the model fit comparisons in Section 6.2. We saw that including a term to capture unidirectional contact effects from Non-Polynesian languages improved model fit, but including a term to model the directed expansion of Polynesian languages did not. We can therefore say with a high degree of certainty that our results do not support the serial founder effect hypothesis with respect to phoneme inventory size in Polynesian languages.

The results of this case study in the rather “controlled” context of Polynesian languages also has implications for previous global studies and their results. As we saw in Section 6, if we include contact with Non-Polynesian languages in the form of phoneme or feature borrowing into Polynesian, it can account for much of the variation of phoneme inventory sizes in Polynesian across space. This suggests that language contact is an important confound not accounted for in previous quantitative approaches to serial founder effects with phoneme inventory sizes. While it is possible that we do not see any serial founder effect at the relatively shallow time scale of the Polynesian expansion, studies on larger regions and deeper time depths need to include more detailed and careful spatial controls.

7.2 Linguistic implications for contact and phoneme inventory size

An important finding for Polynesian linguistics is a recurring pattern of Tuvalu in the models. Despite being a core Polynesian language, Tuvalu generally patterns with the Outlier languages with respect to its phoneme inventory (size). We observed this when examining model fit (Section 6.2) as well as in the mixing proportions of estimated contact from Non-Polynesian (Section 6.3). This result is in accordance with the literature about the special situation of Tuvalu with respect to Outlier Polynesian languages (cf. Sections 2.1 and 4).

Another important linguistic result is that we found strong evidence for phoneme inventory size to be affected by contact with neighboring languages. We clearly saw that Polynesian languages in Melanesia and western Polynesia with more contact to Non-Polynesian languages have larger phoneme inventories than languages further east. This finding confirms insights from the literature that long-term language contact with a high degree of bilingualism likely leads to the borrowing of phonemes and thus to an increase in inventory size (cf. Matras 2009; Nichols 1992; Trudgill 2004).

In particular, our model results reflected a strong influence of contact from Non-Polynesian on Polynesian phoneme inventories. With regard to Mele-Fila, Clark (1986) argued that many of the innovative phonemic contrasts (besides grammatical and lexical borrowings) result from the long-term contact with Non-Polynesian languages on Efate such as North/South Efate and Namakura. The other language for which our models suggested comparatively strong contact effects from Non-Polynesian is West Uvean. Also in this case, Ozanne-Rivierre (1994) showed that the innovative consonants and vowels have been borrowed from Iaai, English, and French (cf. Section 4.3). Therefore, our models, despite working with fairly simple linguistic information, are able to capture complex contact situations in a realistic way. This means that the methods described here can be a useful tool to explore contact situations from a quantitative modeling perspective in cases for which we may have less prior linguistic information.

7.3 Implications for modeling spatial relations between languages

One of the objectives of this study was to show the complexity of modeling spatial relations between languages. To capture these complex interactions, we introduced five different modeling components:

phylogenetic regression to capture the relations between the languages in the dataset,
a stationary Gaussian Process to model non-linear spatial correlations between observations, which can be used to capture symmetric contact effects between languages,
unidirectional contact-effect estimation to model contact effects from one set of source languages on another set of potential target languages,
a mixture model to estimate the probability of unidirectional contact effects in a set of potential target languages,
a non-stationary Gaussian Process to model linear spatial correlations between observations, which can be used to capture the effects of directed expansion from a point of origin, i.e., a serial founder effect.

While the models with unidirectional contact-effect estimation showed promising results, we can also observe that there were some aspects of the contact situation that our models did not capture. The models struggled most when predicting the phoneme inventory sizes of Polynesian languages in close proximity to Non-Polynesian languages. This suggests that there are additional spatial and contact relations that would need to be integrated into our models.

We showed how unidirectional contact-effect estimation can be implemented in statistical modeling. Our motivation for examining unidirectional contact effects was mainly a practical one, as our focus was on modeling the variation in phoneme inventory sizes in Polynesian languages and not in Non-Polynesian languages. There is, however, also linguistic evidence to support the contention that contact effects between Melanesian languages and Polynesian Outliers have in fact been asymmetric, with a stronger influence from Melanesian languages on Polynesian than vice versa. For instance, Clark (1994) analyzes linguistic contact between the Non-Polynesian language Efate and the two Polynesian Outliers Mele-Fila and Emae in central Vanuatu. He finds a clear asymmetry in that Efate has influenced Mele-Fila and Emae to a larger extent on various linguistic levels than vice versa. To explain this, Clark (1994) distinguishes between “cultural” and “intimate” borrowing, with cultural borrowing being more superficial and intimate borrowing requiring a tighter and sustained socio-cultural interaction between the speaker populations. He argues that Melanesian languages show fewer borrowings from Polynesian because Polynesians have been a minority since the beginning of the contact situation in the Efate region. This would then put Polynesian speakers under more pressure to interact with Melanesian speakers to trade or engage in other social relations such as marriage than vice versa. Consequently, he shows that Polynesian features more intimate borrowings from Melanesian, and we rather find cultural Polynesian borrowings in Melanesian (Clark 1986: 341). This goes to show that our method of unidirectional contact-effect estimation is useful to model cases of asymmetric language contact where we have linguistic evidence for more influence from one language to the other.

As for the mixture model technique, which we used to estimate the probability of unidirectional contact effects from Non-Polynesian to Polynesian languages, our results suggested that it correctly captures the difference between Polynesian languages strongly influenced by Non-Polynesian languages, and the degree of the contact effect. For the most part, the results of this method were in accordance with the literature on the influence of Non-Polynesian languages on Polynesian languages. While further testing of this technique is still needed, it is a promising approach to detecting contact effects between groups of languages. In theory, it is possible to build a model that knows exactly which observations belong to the Outliers and which do not. This would provide the model with important information regarding which languages are highly impacted by unidirectional contact effects from Non-Polynesian and which are not.^[25] However, this is not particularly useful. It is much more insightful to have the model estimate this information from the data itself and to compare the results with our previous assumptions based on evidence from the literature.

The mixture model method can be seen as a computational implementation of the idea proposed by Di Garbo and Napoleão de Souza (2023). Di Garbo and Napoleão de Souza (2023) suggest a method for finding contact which relies on looking at a target language (the language with potential contact effects), a neighbor language (from which the target language might have borrowed material), and a (set of) benchmark language (s) related to the target language. In their method, if the target and neighbor language share a feature not shared by the target and benchmark language, one can conclude borrowing between target and neighbor is likely. In our case, all Polynesian languages are target and benchmark at the same time, while the Non-Polynesian languages represent the neighbors.

Taking these aspects together, our case study showed that modeling spatial dynamics with relation to linguistic structure is not straightforward. Perhaps the main result is that there simply is no single best model. At most, one could argue that there is no strong justification for a model with an expansion component, but this is only so from a predictive perspective. Depending on the research question, it might make sense to include an expansion component even if it does not help in predicting new observations. This would be the case, for instance, if the researcher wanted to “control for” all potential spatial confounds, including potential expansion effects.

One general implication is that there is no one silver bullet to “control for” contact or space. Our data is small enough that we are able to include much more complex spatial relations than adding some type of areas as group-level effect. However, the methods described in this study are not easy to scale up in order to handle large, global databases like WALS or Grambank. Our data is, by comparison with other global typological samples, fairly contained and manageable. Yet, we saw that there are difficult spatial relations which cannot be simply modeled using a single spatial component, but which require at least four different spatial components. And there are additional aspects that we would ideally want to capture better in a statistical model in order to represent spatial relations between languages in a more realistic way. For instance, we know that a more accurate spatial representation of single languages would not be a single point coordinate but a polygon that covers the entire area in which this language is spoken. Ideally, languages that are in contact should also be represented as overlapping polygons, given that contact stands for groups of speakers who use both languages.

To conclude, we showed that even a fairly contained area and research question can quickly lead to very complex statistical modeling if we want to represent the linguistic realities as accurately as possible. Other regions of the world may exhibit even more complex spatial dependencies. Researchers thus need to take special care when modeling spatial structures.

8 Concluding remarks

This paper had two main objectives: show the complexities of spatial relations in statistical modeling, and provide a more comprehensive quantitative approach to serial founder effects with Polynesian phoneme inventories. Regarding the first objective, we proposed five different statistical modeling techniques to capture phylogenetic and spatial relations between languages, using the distribution of phoneme inventory sizes in Polynesian as a test case. The five techniques were: (i) phylogenetic regression to model the genetic relations between languages, (ii) a stationary Gaussian Process to capture symmetric contact effects between languages, (iii) unidirectional contact-effect estimation to model contact effects from one set of source languages (Non-Polynesian) on another set of potential target languages (Polynesian) and (iv) a mixture model to estimate the probability of unidirectional contact effects in a set of potential target languages, as well as (v) a non-stationary Gaussian Process to capture the effects of directed spatial expansion from a common point of origin (serial founder effect).

Regarding the second objective, we have shown that there is no clear evidence for a serial founder effect with respect to phoneme inventory sizes in Polynesian. The method introduced in this paper can nevertheless be applied to other potential cases in which spatial expansion may play a more important role. We did find moderate evidence for unidirectional effects of Non-Polynesian languages on Polynesian in that the presence of certain phonemes or phonological features in Non-Polynesian languages led to larger phoneme inventories in neighboring Polynesian languages. Similarly, the method to model unidirectional language contact could be useful for other types of scenarios in which we expect some degree of asymmetry in contact, e.g., between languages with different degrees of prestige.

Finally, the take-home message of this study is that modeling spatial relations between languages in a more realistic way is very complex. There is no single technique that will account or control for different types of contact and spatial phenomena, and it is necessary to test and combine different methods if we want to do the linguistic reality justice in statistical models.

Corresponding author: Matías Guzmán Naranjo, Albert-Ludwigs-Universität Freiburg, Freiburg im Breisgau, Germany, E-mail: mguzmann89@gmail.com

Funding source: Deutsche Forschungsgemeinschaft

Award Identifier / Grant number: 504155622

Acknowledgments

We thank all participants of SLE in Bucharest, as well as members of the linguistics department in Freiburg for their useful comments and suggestions. This project was funded by the Emmy Noether project ‘Bayesian modelling of spatial typology’ (project number 504155622).

Appendix A

Another way of assessing model fit is to visualize the mean error (ME) for each observation. We use ME values here instead of RMSE values, since ME values do not only indicate the magnitude of errors but indicate their direction. Positive values stand for over-predictions (too large predicted inventory sizes), and negative ME values correspond to under-predictions (too small predicted inventory sizes).

Figure 17 shows ME values for each observation for m_s, which captures neither unidirectional contact effects nor the expansion, and which had the worst model fit according to the RMSE scores.^[26] Core Polynesian languages are represented by circles, and Outlier languages by triangles. Over-predictions are shown in red, and under-predictions in black. We observe that most of the errors occur in the western Polynesian languages, with the most extreme ME values for Takuu, Sikaiana, Tikopia, Mele-Fila, and Tuvalu. Interestingly, we find both over-predictions (red) and under-predictions (black). Besides the Outlier languages, Tuvalu is one of the languages with larger magnitude ME.

Figure 17:

Mean errors (ME) for m_s.

Appendix B

In Section 6.2 we looked at model fit, i.e., how well the models captured the data that they were trained on. This does not tell us how well the models perform on new data, though. Depending on the research question, it can be very important to assess how well the model handles new data and how generalizable it is. In our particular case, we do not necessarily require our models to deal with new data, since we apply it to a fairly exhaustive dataset in the region. Checking for model performance can still help to detect overfitting. Overfitting can happen when the model becomes too complex in relation to the number of observations, so that it also captures random variation in the data based on individual observations, and fails to generalize to new patterns. This leads to a situation where we have to interpret the model results with care, which is why model performance is a further important step in comparing the different models fitted to capture the variation in phoneme inventory sizes in Polynesian.

To assess the performance of the four models m_s, m_{s_uni}, m_ns, and m_{ns_uni}, we performed leave-one-out cross-validation (CV). This means that we leave one observation out, fit the model on all other observations and try to predict the left-out observation. This is then repeated for all observations. The results of the cross-validation are given in Table 6. The second column shows the difference in Expected Log Predictive Density (ΔELPD) between the models. The absolute ELPD values are irrelevant; it is the relative difference between models that can be interpreted. The best-performing model is set to 0, and all other ELPD values are given in relation to the best-performing model. The larger the difference, the worse is the model’s performance. To interpret the difference between ELPD values, the third column of Table 6 shows the standard error of the ELPD values. A usual threshold suggested for the standard error of ΔELPD is that the latter should be four times as large as its standard error. Only then can we be certain about a real difference between the models and not just chance. Because ELPD values are hard to interpret, we also provide the RMSE of the model.

Table 6:

Model performance.

Model		ΔELPD	SE (ΔELPD)	RMSE CV
m_s	(stationary GP)	0.0	0.0	6.10
m_ns	(non-stationary GP)	−0.5	0.5	6.07
m_{s_uni}	(stationary GP + unidirectional contact)	−5.8	2.2	6.23
m_{ns_uni}	(non-stationary GP + unidirectional contact)	−6.7	2.4	6.21

The ELPD results show that there is very little difference in terms of model performance between the four models. M_s and m_ns perform slightly better than the other two models including unidirectional contact from Non-Polynesisan languages. The differences are, however, very small compared to their standard errors. Therefore, we do not have strong evidence for a substantial difference in model performance. With respect to the non-stationary Gaussian Process that models the Polynesian expansion, this means that we do not have evidence supporting a serial founder scenario for phoneme inventory size in Polynesian. Regarding unidirectional contact effects from Non-Polynesian, the ELPD results suggest that including those effects does not help with out-of-sample predictions, even though we saw in Section 6.2 that they improved model fit for the training data.

In terms of RMSE values, Table 6 shows that the model predictions on out-of-sample observations are about twice or three times worse than the predictions for observed data in Table 5 from Section 6.2. This means that although the models are able to track the spatial structures in the data, these spatial structures have relatively low predictive information for new observations. The implication for Polynesia is that the amount of variance explained by the spatial distribution and contact between languages is small. Spatial relations, either contact within Polynesian languages, contact from non-Polynesian languages, or expansion are weak predictors of phoneme inventory size.

References

Anderson, Victoria & Yuko Otsuka. 2006. The phonetics and phonology of “definitive accent” in Tongan. Oceanic Linguistics 45(1). 21–42. https://doi.org/10.1353/ol.2006.0002.Search in Google Scholar

Atkinson, Quentin. 2011a. Linking spatial patterns of language variation to ancient demography and population migrations. Linguistic Typology 15(2). 321–332. https://doi.org/10.1515/lity.2011.022.Search in Google Scholar

Atkinson, Quentin. 2011b. Phonemic diversity supports a serial founder effect model of language expansion from Africa. Science 332. 346–349. https://doi.org/10.1126/science.1199295.Search in Google Scholar

Bakker, Peter. 2004. Phoneme inventories, language contact, and grammatical complexity: A critique of Trudgill. Linguistic Typology 8(3). https://doi.org/10.1515/lity.2004.8.3.368.Search in Google Scholar

Bauer, Winifred. 1993. Maori. London: Routledge.Search in Google Scholar

Becker, Laura, Matías Guzmán Naranjo & Samira Ochs. 2023. Socio-linguistic effects on conditional constructions: A quantitative typological study. In Silvia Ballarè & Guglielmo Inglese (eds.), Sociolinguistic and typological perspectives on language variation, 121–154. Berlin: De Gruyter.10.1515/9783110781168-005Search in Google Scholar

Bellwood, Peter. 1979. Man’s conquest of the Pacific. New York: Oxford University Press.Search in Google Scholar

Bentz, Christian, Annemarie Verkerk, Douwe Kiela, Felix Hill & Paula Buttery. 2015. Adaptive communication: Languages with more non-native speakers tend to have fewer word forms. PLoS One 10(6). e0128254. https://doi.org/10.1371/journal.pone.0128254.Search in Google Scholar

Besnier, Niko. 2000. Tuvaluan: A Polynesian language of the Central Pacific. London: Routledge.Search in Google Scholar

Betti, Lia, François Balloux, William Amos, Tsunehiko Hanihara & Andrea Manica. 2009. Distance from Africa, not climate, explains within-population phenotypic diversity in humans. Proceedings of the Royal Society B: Biological Sciences 276(1658). 809–814. https://doi.org/10.1098/rspb.2008.1563.Search in Google Scholar

Biggs, Bruce. 1971. The languages of Polynesia. In Thomas Sebeok (ed.), Linguistics in Oceania, 466–506. Berlin: De Gruyter.10.1515/9783111418827-015Search in Google Scholar

Biggs, Bruce. 1978. The history of Polynesian phonology. In Stephen Wurm & Louis Carrington (eds.), Second international conference on Austronesian linguistics: Proceedings, 691–716. Canberra: Pacific Linguistics.Search in Google Scholar

Blust, Robert. 2013. The Austronesian languages, rev. edn. Canberra: Asia-Pacific Linguistics.Search in Google Scholar

Bradley, Paul, Usama Fayyad & Cory Reina. 2000. Clustering very large databases using em mixture models. In Proceedings 15th international conference on pattern recognition. ICPR-2000, vol. 2, 76–80.10.1109/ICPR.2000.906021Search in Google Scholar

Carson, Mike. 2012. Recent developments in prehistory: Perspectives on settlement chronology, inter-community relations, and identity formation. In Richard Feinberg & Richard Scaglion (eds.), Polynesian outliers: The state of the art, 27–48. Pittsburgh, PA: University of Pittsburgh.Search in Google Scholar

Clark, Ross. 1986. Linguistic convergence in Central Vanuatu. In Paul Geraghty & Louis Carrington (eds.), FOCAL II: Papers from the fourth international conference on austronesian linguistics, 333–342. Canberra: Pacific Linguistics.Search in Google Scholar

Clark, Ross. 1994. The Polynesian outliers as a locus of language contact. In Tom Dutton & Darrell Tryon (eds.), Language contact and change in the Austronesian world, 109–140. Berlin: De Gruyter.10.1515/9783110883091.109Search in Google Scholar

Clark, Ross. 1995. Mele-Fila. In Darrell Tryon (ed.), Comparative Austronesian dictionary: An introduction to Austronesian studies, 947–950. Berlin: De Gruyter.Search in Google Scholar

Creanza, Nicole, Merritt Ruhlen, Trevor J. Pemberton, Noah A. Rosenberg, Marcus W. Feldman & Sohini Ramachandran. 2015. A comparison of worldwide phonemic and genetic variation in human populations. Proceedings of the National Academy of Sciences 112(5). 1265–1272. https://doi.org/10.1073/pnas.1424033112.Search in Google Scholar

Croft, William. 2016. Comparative concepts and language-specific categories: Theory and practice. Linguistic Typology 20(2). 377–393. https://doi.org/10.1515/lingty-2016-0012.Search in Google Scholar

Cysouw, Michael, Dan Dediu, Steven Moran & Hui Li. 2012a. Comment on “Phonemic diversity supports a serial founder effect model of language expansion from Africa”. Science 335(6069). 657. https://doi.org/10.1126/science.1207846.Search in Google Scholar

Cysouw, Michael, Dan Dediu, & Steven Moran. 2012b. Supporting online material for: Comment on “Phonemic diversity supports a serial founder effect model of language expansion from Africa”.10.1126/science.1208841Search in Google Scholar

Dempwolff, Otto. 1929. Das austronesische Sprachgut in den polynesischen Sprachen. In Festbundel, uitgegeven door het Koninklijk Bataviaasch Genootschap van Künsten en Wetenschappen bij gelegenheid van zijn 150 jarig bestaan, vol. 1, 62–86.Search in Google Scholar

Deshpande, Omkar, Serafim Batzoglou, Marcus W. Feldman & L. Luca Cavalli-Sforza. 2009. A serial founder effect model for human settlement out of Africa. Proceedings of the Royal Society B: Biological Sciences 276(1655). 291–300. https://doi.org/10.1098/rspb.2008.0750.Search in Google Scholar

Di Garbo, Francesca & Ricardo Napoleão de Souza. 2023. A sampling technique for worldwide comparisons of language contact scenarios. Linguistic Typology 27(3). 553–589. https://doi.org/10.1515/lingty-2022-0005.Search in Google Scholar

Donohue, Mark & Johanna Nichols. 2011. Does phoneme inventory size correlate with population size? Linguistic Typology 15(2). 161–170. https://doi.org/10.1515/lity.2011.011.Search in Google Scholar

Dryer, Matthew. 2018. On the order of demonstrative, numeral, adjective and noun. Language 94(4). 798–833. https://doi.org/10.1353/lan.0.0232.Search in Google Scholar

Elbert, Samuel. 1953. Internal relationships of Polynesian languages and dialects. Southwestern Journal of Anthropology 9(2). 147–173. https://doi.org/10.1086/soutjanth.9.2.3628573.Search in Google Scholar

Elbert, Samuel. 1965. Phonological expansion in outlier Polynesia. Lingua 14. 431–442. https://doi.org/10.1016/0024-3841(65)90055-0.Search in Google Scholar

Elbert, Samuel & Albert Schütz. 1988. Echo of a culture: A grammar of Rennell and Bellona. Honolulu: University of Hawaii Press.Search in Google Scholar

Fenk-Oczlon, Gertraud & Jürgen Pilz. 2021. Linguistic complexity: Relationships between phoneme inventory size, syllable complexity, word and clause length, and population size. Frontiers in Communication 6. 626032. https://doi.org/10.3389/fcomm.2021.626032.Search in Google Scholar

Fort, Joaquim & Joaquim Pérez-Losada. 2016. Can a linguistic serial founder effect originating in Africa explain the worldwide phonemic cline? Journal of the Royal Society Interface 13(117). 20160185. https://doi.org/10.1098/rsif.2016.0185.Search in Google Scholar

Geraghty, Paul. 1983. The history of the Fijian languages. Honolulu: University of Hawaii Press.Search in Google Scholar

Goodwin, Ian, Stuart Browning & Atholl Anderson. 2014. Climate windows for Polynesian voyaging to New Zealand and Easter Island. Proceedings of the National Academy of Sciences 111(41). 14716–14721. https://doi.org/10.1073/pnas.1408918111.Search in Google Scholar

Green, Roger. 1966. Linguistic subgrouping within Polynesia: The implications for prehistoric settlement. Journal of the Polynesian Society 75(1). 6–38.Search in Google Scholar

Green, Roger. 1981. Location of the Polynesian homeland: A continuing problem. In Jim Hollyman & Andrew Pawley (eds.), Studies in Pacific languages and cultures in honor of Bruce Biggs, 133–158. Auckland: Linguistic Society of New Zealand.Search in Google Scholar

Green, Robert & Marshall Weisler. 2002. The Mangarevan sequence and dating of the geographic expansion into southeast Polynesia. Asian Perspectives 41(2). 213–241. https://doi.org/10.1353/asi.2003.0006.Search in Google Scholar

Greenhill, Simon & Ross Clark. 2011. POLLEX-online: The Polynesian lexicon project online. Oceanic Linguistics 50(2). 551–559. https://doi.org/10.1353/ol.2011.0014.Search in Google Scholar

Guzmán Naranjo, Matías & Laura Becker. 2022. Statistical bias control in typology. Linguistic Typology 26(3). 605–670. https://doi.org/10.1515/lingty-2021-0002.Search in Google Scholar

Guzmán Naranjo, Matías & Miri Mertner. 2023. Estimating areal effects in typology: A case study of African phoneme inventories. Linguistic Typology 27(2). 455–480. https://doi.org/10.1515/lingty-2022-0037.Search in Google Scholar

Hartmann, Frederik. 2022. Methodological problems in quantitative research on environmental effects in phonology. Journal of Language Evolution 7(1). 95–119. https://doi.org/10.1093/jole/lzac003.Search in Google Scholar

Hartmann, Frederik & Gerhard Jäger. 2023. Gaussian process models for geographic controls in phylogenetic trees. Open Research Europe 3(57). 57. https://doi.org/10.12688/openreseurope.15490.1.Search in Google Scholar

Haspelmath, Martin. 2010. Comparative concepts and descriptive categories in crosslinguistic studies. Language 86(3). 663–687. https://doi.org/10.1353/lan.2010.0021.Search in Google Scholar

Haspelmath, Martin. 2018. How comparative concepts and descriptive linguistic categories are different. In Daniël van Olmen, Tanja Mortelmans & Frank Brisard (eds.), Aspects of linguistic variation, 83–114. Berlin: De Gruyter.10.1515/9783110607963-004Search in Google Scholar

Haudricourt, André. 1961. Richesse en phone`mes et richesse en locuteurs. L’Homme 1. 5–10. https://doi.org/10.3406/hom.1961.366337.Search in Google Scholar

Haudricourt, André. 1968. La langue de Gomen et la langue de Touho en Nouvelle-Calédonie. Bulletin de la Societe de Linguistique de Paris 63(1). 218–235.Search in Google Scholar

Hay, Jennifer & Laurie Bauer. 2007. Phoneme inventory size and population size. Language 83(2). 388–400. https://doi.org/10.1353/lan.2007.0071.Search in Google Scholar

Holman, Eric W., Christian Schulze, Dietrich Stauffer & Søren Wichmann. 2007. On the relation between structural diversity and geographical distance among languages: Observations and computer simulations. Linguistic Typology 11(2). 393–421. https://doi.org/10.1515/lingty.2007.027.Search in Google Scholar

Irwin, Geoffrey. 1994. The prehistoric exploration and colonialisation of the Pacific. Cambridge: Cambridge University Press.Search in Google Scholar

Jaeger, Florian, Peter Graff, William Croft & Daniel Pontillo. 2011. Mixed effect models for genetic and areal dependencies in linguistic typology. Linguistic Typology 15(2). 281–319. https://doi.org/10.1515/lity.2011.021.Search in Google Scholar

Jennings, Jesse. 1979. The prehistory of Polynesia. Cambridge, MA: Harvard University Press.10.4159/harvard.9780674181267Search in Google Scholar

Kahn, Jennifer & Yosihiko Sinoto. 2017. Refining the Society Islands cultural sequence: Colonization phase and developmental phase coastal occupation on Mo’orea island. Waka Kuaka 126(1). 33–60. https://doi.org/10.15286/jps.126.1.33-60.Search in Google Scholar

Kennett, Douglas, Brendan Culleton, Atholl Anderson & John Southon. 2012. A Bayesian AMS 14C chronology for the colonisation and fortification of Rapa Island. In Atholl Anderson & Douglas Kennett (eds.), Taking the high ground: The archaeology of Rapa, a fortified island in remote East Polynesia, 189–202. Canberra: ANU Press.10.22459/TA37.11.2012.11Search in Google Scholar

Kieviet, Paulus. 2017. A grammar of Rapa Nui. Berlin: Language Science Press.Search in Google Scholar

Kirch, Patrick. 1984a. The evolution of the Polynesian chiefdoms. Cambridge: Cambridge University Press.Search in Google Scholar

Kirch, Patrick. 1984b. The Polynesian Outliers: Continuity, change, and replacement. Journal of Pacific History 19. 224–238. https://doi.org/10.1080/00223348408572496.Search in Google Scholar

Kirch, Patrick. 1996. Lapita and its aftermath: The Austronesian settlement of Oceania. Transactions of the American Philosophical Society 86(5). 57–70. https://doi.org/10.2307/1006621.Search in Google Scholar

Kirch, Patrick. 1997. The Lapita peoples: Ancestors of the Oceanic world. Oxford: Blackwell.Search in Google Scholar

Kirch, Patrick. 2017. On the ROAD of the WINDS: An archaeological history of the Pacific Islands before European contact. Oakland: University of California Press.10.1525/9780520968899Search in Google Scholar

Kirch, Patrick & Roger Green. 1992. History, phylogeny, and evolution in Polynesia. Current Anthropology 33(1). 161–186. https://doi.org/10.1086/204023.Search in Google Scholar

Kirch, Patrick & Roger Green. 2001. Hawaiki, ancestral Polynesia: An essay in historical anthropology. Cambridge: Cambridge University Press.10.1017/CBO9780511613678Search in Google Scholar

Kurpa, Viktor. 1973. Polynesian languages: A survey of research. The Hauge, Paris: Mouton.10.1515/9783110899283Search in Google Scholar

Lindsay, Bruce. 1995. Mixture models: Theory, geometry, and applications. Hayward, CA: Institute of Mathematical Statistics.10.1214/cbms/1462106013Search in Google Scholar

Manica, Andrea, William Amos, François Balloux & Tsunehiko Hanihara. 2007. The effect of ancient population bottlenecks on human phenotypic variation. Nature 448(7151). 346–348. https://doi.org/10.1038/nature05951.Search in Google Scholar

Marck, Jeff. 2000. Topics in Polynesian language and culture history. Canberra: Pacific Linguistics.Search in Google Scholar

Martinsson-Wallin, Helène & Susan J. Crockford. 2001. Early settlement of Rapa Nui (Easter Island). Asian Perspectives 40(2). 244–278. https://doi.org/10.1353/asi.2001.0016.Search in Google Scholar

Matras, Yaron. 2008. The borrowability of structural categories. In Yaron Matras & Jeanette Sakel (eds.), Grammatical borrowing in cross-linguistic perspective, 31–74. Berlin: De Gruyter.10.1515/9783110199192.31Search in Google Scholar

Matras, Yaron (ed.). 2009. Language contact. Cambridge: Cambridge University Press.Search in Google Scholar

Milner, George. 1958. Aspiration in two Polynesian languages. Bulletin of the School of Oriental and African Studies, University of London 21(1). 368–375. https://doi.org/10.1017/s0041977x00072748.Search in Google Scholar

Moran, Steven, Elad Eisen, Dmitry Nikolaev & Eitan Grossman. 2024. Operationalizing borrowability: Phonological segments as a case study. Language 100(4). 671–698. https://doi.org/10.1353/lan.2024.a947038.Search in Google Scholar

Moran, Steven, Daniel McCloy & Richard Wright. 2012. Revisiting population size vs. phoneme inventory size. Language 88(4). 877–893. https://doi.org/10.1353/lan.2012.0087.Search in Google Scholar

Mosel, Ulrike & Even Hovdhaugen. 1992. Samoan reference grammar. Oslo: Scandinavian University Press.Search in Google Scholar

Næss, Åshild & Even Hovdhaugen. 2011. A grammar of Vaeakau-Taumako. Berlin: De Gruyter.10.1515/9783110238273Search in Google Scholar

Nichols, Johanna. 1992. Linguistic diversity in space and time. Chicago: The University of Chicago Press.10.7208/chicago/9780226580593.001.0001Search in Google Scholar

Ozanne-Rivierre, Françoise. 1994. Laai loanwords and phonemic changes in Fagauvea. In Tom Dutton & Darrell Tryon (eds.), Language contact and change in the Austronesian world, 523–549. Berlin: De Gruyter.10.1515/9783110883091.523Search in Google Scholar

Ozanne-Rivierre, Françoise. 1995. Structural changes in the languages of Northern New Caledonia. Oceanic Linguistics 34(1). 45–72. https://doi.org/10.2307/3623111.Search in Google Scholar

Pawley, Andrew. 1966. Polynesian languages: A subgrouping based on shared innovations in morphology. Journal of the Polynesian Society 75(1). 39–64.Search in Google Scholar

Pawley, Andrew. 1967. The relationships of Polynesian Outlier languages. Journal of the Polynesian Society 76(3). 259–296.Search in Google Scholar

Pawley, Andrew. 2007. The origins of early Lapita culture: The testimony of historical linguistics. In Stuart Bedford, Christophe Sand & Sean Connaughton (eds.), Oceanic explorations: Lapita and western Pacific settlement. Canberra: ANU Press.10.22459/TA26.2007.02Search in Google Scholar

Pawley, Andrew & Roger Green. 1973. Dating the dispersal of the Oceanic languages. Oceanic Linguistics 12(1/2). 1–67. https://doi.org/10.2307/3622852.Search in Google Scholar

Pawley, Andrew & Roger Green. 1984. The Proto-Oceanic language community. Journal of Pacific History 19(3). 123–146. https://doi.org/10.1080/00223348408572489.Search in Google Scholar

Pérez-Losada, Joaquim & Joaquim Fort. 2018. A serial founder effect model of phonemic diversity based on phonemic loss in low-density populations. PLoS One 13(6). e0198346. https://doi.org/10.1371/journal.pone.0198346.Search in Google Scholar

Pericliev, Vladimir. 2004. There is no correlation between the size of a community speaking a language and the size of the phonological inventory of that language. Linguistic Typology 8(3). 376–383. https://doi.org/10.1515/lity.2004.8.3.376.Search in Google Scholar

Pierce, Amanda A., Myron P. Zalucki, Marie Bangura, Milan Udawatta, Marcus R. Kronforst, Sonia Altizer, Juan Fernández Haeger & Jacobus C. de Roode. 2014. Serial founder effects and genetic differentiation during worldwide range expansion of monarch butterflies. Proceedings of the Royal Society B: Biological Sciences 281(1797). https://doi.org/10.1098/rspb.2014.2230.Search in Google Scholar

Ranacher, Peter, Nico Neureiter, Rik van Gijn, Barbara Sonnenhauser, Anastasia Escher, Robert Weibel, Pieter Muysken & Balthasar Bickel. 2021. Contact-tracing in cultural evolution: A Bayesian mixture model to detect geographic areas of language contact. Journal of the Royal Society Interface 18(181). 1–15. https://doi.org/10.1098/rsif.2020.1031.Search in Google Scholar

Rasmussen, Carl. 1999. The infinite Gaussian mixture model. In Sara A. Solla, Todd K. Leen & Klaus-Robert Müller (eds.), Advances in neural information processing systems (NIPS 1999), 12.Search in Google Scholar

Rasmussen, Carl. 2004. Gaussian processes in machine learning. In Olivier Bousquet, Ulrike von Luxburg & Gunnar Rätsch (eds.), Advanced lectures on machine learning, 63–71. Berlin: Springer.10.1007/978-3-540-28650-9_4Search in Google Scholar

Rasmussen, Carl & Christopher Williams. 2006. Gaussian processes for machine learning. Cambridge, MA: MIT Press.10.7551/mitpress/3206.001.0001Search in Google Scholar

Rice, Keren. 2004. Language contact, phonemic inventories, and the Athapaskan language family. Linguistic Typology 8(3). 321–343. https://doi.org/10.1515/lity.2004.8.3.321.Search in Google Scholar

Rivierre, Jean-Claude. 1993. Tonogenesis in New Caledonia. Oceanic Linguistics Special Publications 24. 155–173.Search in Google Scholar

Rolle, Nicholas. 2009. The phonetic nature of Niuean vowel length. Toronto Working Papers in Linguistics 31.Search in Google Scholar

Taumoefolau, Melanaite. 2002. Stress in Tongan. MIT Working Papers in Linguistics 44. 341–354.Search in Google Scholar

Trudgill, Peter. 2004. Linguistic and social typology: The Austronesian migrations and phoneme inventories. Linguistic Typology 8(3). 305–320. https://doi.org/10.1515/lity.2004.8.3.305.Search in Google Scholar

Trudgill, Peter. 2011. Social structure and phoneme inventories. Linguistic Typology 15(2). 155–160. https://doi.org/10.1515/lity.2011.010.Search in Google Scholar

Tryon, Darrell & B. D. Hackman. 1983. Solomon Islands languages: An internal classification. Canberra: Pacific Linguistics.Search in Google Scholar

Urban, Matthias & Steven Moran. 2021. Altitude and the distributional typology of language structure: Ejectives and beyond. PLoS One 16(2). e0245522. https://doi.org/10.1371/journal.pone.0245522.Search in Google Scholar

Verkerk, Annemarie & Francesca Di Garbo. 2022. Sociogeographic correlates of typological variation in northwestern Bantu gender systems. Language Dynamics and Change 1. 1–69. https://doi.org/10.1163/22105832-bja10017.Search in Google Scholar

Walworth, Mary. 2014. Eastern Polynesian: The linguistic evidence revisited. Oceanic Linguistics 53(2). 256–272. https://doi.org/10.1353/ol.2014.0021.Search in Google Scholar

Wang, Chuan-Chao, Qi-Liang Ding, Huan Tao & Hui Li. 2012. Comment on “Phonemic diversity supports a serial founder effect model of language expansion from Africa”. Science 335(6069). 657. https://doi.org/10.1126/science.1208841.Search in Google Scholar

Ward, Gerard, John Webb & Michael Levison. 1973. The settlement of the Polynesian Outliers: A computer simulation. Journal of the Polynesian Society 82(4). 330–342.Search in Google Scholar

Watson, Catherine, Margaret Maclagan, Jeanette King, Ray Harlow & Keegan Peter. 2016. Sound change in Ma¯ori and the influence of New Zealand English. Journal of the International Phonetic Association 46(2). 185–218. https://doi.org/10.1017/s0025100316000025.Search in Google Scholar

Wichmann, Søren, Taraka Rama & Eric W. Holman. 2011. Phonological diversity, word length, and population sizes across languages: The ASJP evidence. Linguistic Typology 15(2). 177–197. https://doi.org/10.1515/lity.2011.013.Search in Google Scholar

Williams, Christopher & Carl Rasmussen. 2006. Gaussian processes for machine learning, vol. 2. Cambridge, MA: MIT Press.Search in Google Scholar

Wilmshurst, Janet, Terry Hunt, Carl Lipo & Atholl Anderson. 2011. High-precision radiocarbon dating shows recent and rapid initial human colonization of East Polynesia. Proceedings of the National Academy of Sciences 108(5). 1815–1820. https://doi.org/10.1073/pnas.1015876108.Search in Google Scholar

Wilson, William. 2012. Whence the East Polynesians? Further linguistic evidence for a Northern Outlier source. Oceanic Linguistics 51(2). 289–359. https://doi.org/10.1353/ol.2012.0014.Search in Google Scholar

Received: 2024-01-31

Accepted: 2025-03-06

Published Online: 2025-05-23

This work is licensed under the Creative Commons Attribution 4.0 International License.

https://doi.org/10.1515/ling-2024-0016

Keywords for this article

language contact; migration; Polynesia; Gaussian Process; Bayesian statistics

Creative Commons

BY 4.0