Home Linguistics & Semiotics The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling
Article Open Access

The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling

  • EMAIL logo , and
Published/Copyright: February 9, 2026

Abstract

A growing body of literature has demonstrated that semantics can codetermine fine phonetic detail. However, the complex interplay between phonetic realization and semantics remains understudied, particularly in pitch realization. The current study investigates the tonal realization of Mandarin disyllabic words with all 20 possible combinations of two tones, as found in a corpus of Taiwan Mandarin spontaneous speech. We made use of Generalized Additive Mixed Models (GAMMs) to model f0 contours as a function of a series of predictors, including gender, tonal context, tone pattern, duration, word position, bigram probability, speaker, and word. In the GAMM analysis, word and sense emerged as crucial predictors of f0 contours, with effect sizes that exceed those of tone pattern. For each word token in our dataset, we then obtained a contextualized embedding by applying the GPT-2 large language model to the context of that token in the corpus. We show that the pitch contours of word tokens can be predicted to a considerable extent from these contextualized embeddings, which approximate token-specific meanings in contexts of use. The results of our corpus study show that meaning in context and phonetic realization are far more entangled than standard linguistic theory predicts.

1 Introduction

Mandarin Chinese is a tone language with four lexical tones: a high level tone (T1), a rising tone (T2), a low falling-rising tone known as a dipping tone (T3), and a falling tone (T4). Mandarin Chinese also has a so-called neutral or floating tone (T0), which is often described as unstressed, weaker in intensity, and shorter in duration (Chao 1968). The present study reports the results of an investigation of the realization of the Mandarin tones in a corpus of Taiwan Mandarin spontaneous speech. We first present our corpus-based findings and then present a theory-driven explanation of our findings using the Discriminative Lexicon Model (Baayen et al. 2019; Heitmeier et al. 2025).

The corpus that we made use of was compiled by Fon (2004), originally with the aim of clarifying the influence of Southern Min on Mandarin Chinese as spoken in Taiwan. In what follows, we refer to this corpus as the Corpus of Spontaneous Taiwan Mandarin. The focus of our study is on the realization in this corpus of the tones in words consisting of two syllables. The tonal realization of disyllabic words has been studied before in laboratory speech. Xu (1997) examined the pitch contours of the 16 combinations of the 4 standard tones realized on the two-syllables /ma-ma/, embedded in carrier sentences, and produced by male speakers of Beijing Mandarin. Factors that are known to codetermine the realization of tones, such as speaking rate and the tones on adjacent words, were carefully controlled for. This study showed that in laboratory speech, the tones of the single-syllable constituents are often somewhat different. For instance, a rising tone followed by a falling tone (T2-T4) was observed to be realized as a fall, followed by a rise, and concluded with a fall.

To our knowledge, there currently are no studies that address the tonal realizations of all tonal combination for disyllabic words in spontaneous conversation. It is well known that spontaneous speech can differ markedly from formal speech. Given that in spontaneous speech, words are often realized with various reduced forms (see, e.g., Ernestus 2000; Johnson 2004; Chung 2006: for Dutch, English, and Mandarin Chinese, respectively), it is an open question to what extent the canonical four tones of Mandarin are preserved in spontaneous speech. This is one reason why we carried out a detailed investigation of the realization of tone in disyllabic words as found in the Corpus of Spontaneous Taiwan Mandarin. Importantly, we considered not only the 16 combinations of tones studied by Xu (1997) but also the 4 combinations of a standard tone followed by a neutral tone (T1-T0, T2-T0, T3-T0, and T4-T0).

The second reason we carried out this corpus survey is that previous corpus-based research provides strong evidence for the realization of words’ tones, as represented by their f0 (pitch) contours, is only in part determined by the canonical tones of the constituent syllables, and that, surprisingly, words’ meanings play a much more important role (Chuang et al. 2023, 2025; Jin et al. 2025; Lu et al. 2024) in shaping how the tones are actually realized. In Section 2, we provide further details on these findings and also point to several other studies indicating that meaning and phonetic form are far more entangled than is generally assumed.

Here, we note that if indeed fine semantic detail is reflected in fine phonetic detail, this challenges influential axioms of linguistic theory, such as the arbitrariness of the linguistic sign (De Saussure 1966) and the dual articulation of language (Martinet 1965). Research on sound symbolism has provided evidence for nonarbitrary associations between form and meaning across languages. One of the most well-known cross-linguistic phenomena is the Kiki/Bouba effect, where pseudowords like Kiki are associated with angular shapes and Bouba with round shapes (Bremner et al. 2013; Ćwiek et al. 2022; Maurer et al. 2006). Studies on Kiki/Bouba effect using natural languages argued that both segmental and suprasegmental information are necessary to drive the kiki/bouba effect (Dingemanse et al. 2016; Thompson et al. 2021). Systematic relations between form and meaning have also been documented for perceptual dimensions such as size, emotion, and intensity (Adelman et al. 2018; Imai et al. 2008; Shinohara and Kawahara 2010). A recent finding by de Varda and Marelli (2025) suggests that iconicity ratings may be produced by taking into account perceptual information from multiple modalities, instead of being exclusively tied to phonological resemblance.

A recent theory of the lexicon and lexical processing that rejects the above-mentioned axioms and that is well-aligned with the body of work on sound symbolism and form-meaning iconicity is the Discriminative Lexicon Model (Baayen et al. 2019; Heitmeier et al. 2025). This model represents both words’ forms and their meanings as high-dimensional numeric vectors, and posits functions that map form vectors onto meaning vectors for comprehension, and meaning vectors onto form vectors (production). According to this model, the relation between form and meaning is accounted for by these end-to-end mappings, instead of being mediated by hierarchies of latent form units such as stems and morphemes. In other words, the DLM model does not work with a semantics-free phonological system and a separate combinatorial system, as posited by the axiom of the dual articulation of language.

In the present study, we zoom in on only a small part of the production process as conceptualized by the DLM, and ask whether it is possible to start out with a word’s meaning vector (using context-specific embeddings from distributional semantics and Large Language Models) and to predict that word’s pitch contour using a general mapping from semantic vectors to pitch contours. In Section 4, we show that this is indeed possible with an accuracy that is surprisingly far above chance level. We will also show that the canonical tone pattern of a two-syllable word can be predicted from the centroid of the embeddings of the words sharing that canonical tone pattern. Our results raise many questions, for which, as will become clear in the general discussion, we only have tentative answers.

The remainder of this paper is structured as follows. Section 2 introduces the many factors that codetermine how tones are realized and also provides an overview of previous research on isomorphies between semantics and phonetic realization. Section 3 introduces the corpus that we investigated and provides details on data preprocessing and the statistical method that we used to analyze the corpus data. Section 3.4 reports our results: across all 20 tone patterns, words’ meanings provide a surprisingly good window on their pitch contours. Section 4 presents our theory-driven computational modeling study, showing that token-specific pitch contours can be predicted from token-specific embeddings calculated based on their discourse context. Finally, Section 5 presents our thoughts on the implications of our findings.

2 Semantics and phonetic realization

2.1 Spoken duration and articulation

Evidence is accumulating that subtle differences in meaning can be reflected in the fine phonetic details of how words are actually realized in corpora of natural speech, including aspects such as spoken word (Gahl and Baayen 2024) duration, segment duration (Plag et al. 2017), and tongue position (Saito 2024).

Heterographic homophones are words with the same pronunciation but different spellings and meanings, such as time and thyme. For a long time, homophones were thought to sound identical (see, e.g., Jescheniak and Levelt 1994). However, Gahl (2008), using the Switchboard corpus (Godfrey et al. 1992), reported that heterographic homophones such as time and thyme tend to have different acoustic durations, with more frequent homophones (time) being pronounced with shorter durations than their less frequent homophonic counterparts (thyme). Lohmann (2018) similarly observed that the duration of words such as cut depends on whether they are used as nouns or verbs. Both studies explain these effects in terms of how frequency of use affects lexical access in speech production. However, Gahl and Baayen (2024) reported that computational modeling with the Discriminative Lexicon Model provided strong evidence that the meanings of English homophones (represented by embeddings) are strong codeterminants of their spoken word duration, even when word frequency is controlled for. They argued that a powerful predictor of a homophone’s spoken word duration is the degree of support it receives from the semantics, such that greater semantic support predicts longer spoken word duration.

In addition to durational differences at the word level, durational differences have also been observed at the phonemic level in corpus studies, particularly for the realization of word-final /s/ or /z/ (henceforth referred to as S) in English. In the Buckeye corpus (Pitt et al. 2005), word-final S has been found to vary in duration depending on its morphological function: nonmorphemic S is pronounced longer than plural S, which, in turn, is pronounced longer than clitic S (Plag et al. 2017; Tomaschek et al. 2021; Zimmermann et al. 2016). Furthermore, Plag et al. (2020) found that genitive plural S showed significantly longer durations than plural Schmitz (2022) utilized a pseudo-word paradigm to demonstrate that the morphological category of word-final S (nonmorphemic > plural > clitics) influences its phonetic realization.

The relationship between semantics and phonetic realization has also been demonstrated beyond durational differences. Drager (2011) found that the pronunciation of the English word like varies according to its discourse or grammatical meanings, not only in the duration of the consonants but also in the degree of diphthongization of the vowel. Furthermore, in line with Gahl and Baayen (2024), Saito (2024), Saito et al. (2024) reported for the KEC corpus of German spontaneous speech (Arnold and Tomaschek 2016), which also registers tongue movements using electromagnetic articulography, that greater semantic support leads to a lower position of the tongue tip for the vowel /a/, indicating hyperarticulation.

2.2 Tone in connected speech

The preceding section reviewed recent evidence that semantics and phonetic realization are entangled to a greater extent than has often been assumed. In this section, we zoom in on how word meaning affects the realization of tone. To do so, we first need to provide some further background on the factors that have already been reported to codetermine the realization of tone.

It is well known that the way in which tones are realized in connected speech differs from their canonical realization. How tones are realized has been described as depending on the properties of the segments in the syllable that carries a given tone (Ho 1976a; Ohala and Eukel 1976; Xu and Xu 2003). Tonal variation in connected speech has also been reported to be shaped by the tones of adjacent words (tonal coarticulation Xu 1997), by speaking rate (Xu and Sun 2002), by the interaction between lexical tones and intonation (Ho 1976a; Wu et al. 2020), and by a speaker’s individual speaking style (Stanford 2016).

At the socio-geographic level, cross-dialectal research has reported that different varieties of Mandarin exhibit varying tone inventories (Chang 2010; Zhao 2023). For the present study, we note that in Standard Chinese, the realization of a neutral tone following a given lexical tone has been reported to be largely determined by this preceding tone. Furthermore, the f0 contour of a neutral tone has been claimed to approach a low pitch target by the end of the carrying syllable (Xu 2024). However, in Taiwan Mandarin, the behavior of the neutral tone has been reported to be different. It can either be indistinguishable from one of the four canonical lexical tones, or it can be realized as a static mid-low pitch target (Huang 2018).

From the above overview, it will be clear that the realization of tone is codetermined by a multitude of different factors. A newcomer in this arena is word meaning. Chuang et al. (2025) studied the pitch contours of disyllabic words with an initial rising tone followed by a falling tone (henceforth, the T2-T4 tone pattern). This study used a Generalized Additive Model (GAMM, Wood 2017) to decompose an observed pitch contour into separate pitch contours, capturing the effects over time of predictors such as speech rate, neighboring tones, and segmental properties. What Chuang et al. (2025) report is that the GAMMs provide strong support for word-specific pitch contour components, while controlling for other variables such as segments, gender, speaker, speech rate, and the tones of adjacent words. They also show that the statistical evidence is even stronger for sense-specific pitch contours, which suggests that these effects are semantic in nature. The importance of words’ meanings has been replicated for Mandarin disyllabic words with T2-T3 and T3-T3 tone pattern by Lu et al. (2024), and for monosyllabic Mandarin words by Jin et al. (2025). For disyllabic words with T2-T3 and T3-T3 tone pattern, the variable importance of words’ meanings was on a par of that of tonal context (tone sandhi), the other most important predictor of words’ pitch contours.

To illustrate the challenges that an analysis of tones in natural speech has to face, consider Figure 1, which displays the f0 contours of a selection of tokens of word types with a falling tone followed by a rising tone (the T4-T2 tone pattern) in the Corpus of Spontaneous Taiwan Mandarin (Fon 2004). In the left panel, the pitch contours of six tokens from different word types are presented. All words have the same canonical tone pattern: T4-T2. Token XMC_GY_4119_不能 (bu4neng2, “cannot”) (indicated by light blue) shows an initial sharp f0 rise, followed by a fall, and then a shallow rise. Token XMC_GY_8107問題 (wen4ti2, “problem”) (indicated by purple) has a much lower initial f0 than the other tokens. The initial f0 of token XMC_GY_1025_後來 (hou4lai2, “later”) (indicated by red) is unavailable due to the unvoiced initial /h/.[1] In the right panel of Figure 1, we present four tokens of the same word type 幸福 (xing4fu2, “happiness”). One of its tokens is also presented in the left panel (indicated by brown). The four tokens of 幸福 also exhibit considerable variability in their f0 contours.

Figure 1: 
A selection of tokens in spoken Taiwan Mandarin. Left panel: six tokens representing six different word types, all sharing the tone pattern T4-T2 (a falling tone followed by a rising tone). The tokens are 後來 (hou4lai2, “later”), 幸福 (xing4fu2, “happiness”), 去年 (qu4nian2, “last year”), 不能 (bu4neng2, “cannot”), 自然 (zi4ran2, “nature”), 問題 (wen4ti2, “problem”). Right panel: four tokens representing the word type 幸福 (xing4fu2, “happiness”). All f0 contours shown here are produced by the one female speaker. The vertical lines indicate the normalized syllable boundary for each token. The durations (in seconds) are shown in parentheses after each token name in the legend.
Figure 1:

A selection of tokens in spoken Taiwan Mandarin. Left panel: six tokens representing six different word types, all sharing the tone pattern T4-T2 (a falling tone followed by a rising tone). The tokens are 後來 (hou4lai2, “later”), 幸福 (xing4fu2, “happiness”), 去年 (qu4nian2, “last year”), 不能 (bu4neng2, “cannot”), 自然 (zi4ran2, “nature”), 問題 (wen4ti2, “problem”). Right panel: four tokens representing the word type 幸福 (xing4fu2, “happiness”). All f0 contours shown here are produced by the one female speaker. The vertical lines indicate the normalized syllable boundary for each token. The durations (in seconds) are shown in parentheses after each token name in the legend.

In what follows, we take on the challenge of modeling the realization of tone, taking into account the many factors reported to codetermine pitch contours, such as gender, speaker, neighboring tones, speech rate, word position, and bigram probability. Following up on earlier work (Chuang et al. 2025; Jin et al. 2025; Lu et al. 2024), the “pitch signatures” of individual words are of primary interest.

Two hypotheses guide our research.

  1. The meanings of words codetermine the phonetic details of how the tones of these words are produced.

  2. The pitch contours of word tokens as found in spontaneous Mandarin conversations can be predicted from token-specific meaning vectors with above-chance accuracy using computational modeling with the DLM.

Importantly, end-to-end modeling with the DLM does not require abstract units for syllables and segments. The second hypothesis, therefore, makes the strong empirical claim that for predicting the details of the phonetic realization of f0, such abstract units are not required.

In the next section, we describe the data that we collected from the Corpus of Spontaneous Taiwan Mandarin and present our statistical analyses. Section 4 complements this exploratory part of our study with theory-driven computational modeling.

3 Data collection and statistical analysis

3.1 The corpus

The data used in the present study come from the Taiwan Mandarin Spontaneous Conversation Corpus (Fon 2004). This corpus contains 30 h of speech from 55 native speakers of Taiwan Mandarin, with 31 females and 24 males (aged between 20 and 60 years old). In unstructured interviews, participants were encouraged to speak freely, instead of being guided by a standardized set of questions. As a result, this corpus consists of naturally occurring speech with a diverse and varied set of words across speakers.

The corpus was transcribed in traditional Chinese characters at the word level. The speech data were segmented at both the syllable and word levels. Forced alignment was first performed, and the results were later manually reviewed by native Taiwan Mandarin speakers with a background in phonetics. In the current study, we followed the transcriptions and segmentation provided in the corpus.

3.2 Data selection

Disyllabic words with the 20 tone patterns were extracted for analysis (see column 1 in Table 1). The original dataset comprises 93,701 tokens, representing 7,526 unique word types. Table 1 presents the counts of tokens and word types associated with each tone pattern. Among these, the T4-T4 pattern is the most frequent in disyllabic words, both token-wise and type-wise. The four tone patterns featuring a neutral tone in the second syllable (T1-T0, T2-T0, T3-T0, and T4-T0) are represented by the lowest numbers of types. There are also relatively few words with T3 (see Wu et al. 2021: for similar observations for journalistic speech).

Table 1:

Number of tokens and words grouped by tone pattern in the conversational Taiwan Mandarin corpus.

Tone pattern Tokens Word types Examples
1 T1-T1 3,501 459 應該 (ying1gai1, “should”)
2 T1-T2 3,725 458 當然 (dang1ran2, “of course”)
3 T1-T3 2,313 333 根本 (gen1ben3, “at all”)
4 T1-T4 7,524 706 接觸 (jie1chu4, “to touch”)
5 T1-T0 3,034 83 他們 (ta1men0, “they”)
6 T2-T1 2,763 286 其他 (qi2ta1, “others”)
7 T2-T2 3,043 369 同學 (tong2xue2, “classmate”)
8 T2-T3 4,539 249 結果 (jie2guo3, “result”)
9 T2-T4 9,237 687 學校 (xue2xiao4, “school”)
10 T2-T0 7,010 64 什麼 (shen2me0, “what”)
11 T3-T1 2,655 252 老師 (lao3shi1, “teacher”)
12 T3-T2 3,465 289 感覺 (gan3jue2, “feeling”)
13 T3-T3 3,896 276 了解(liao3jie3, “to know”)
14 T3-T4 7,256 595 可是 (ke3shi4, “but”)
15 T3-T0 3,295 50 我們 (wo3men0, “we”)
16 T4-T1 3,007 400 那些 (na4xie1, “those”)
17 T4-T2 3,978 451 後來 (hou4lai2, “later”)
18 T4-T3 3,302 419 父母 (fu4mu3, “parents”)
19 T4-T4 13,174 989 社會 (she4hui4, “society”)
20 T4-T0 2,984 111 爸爸 (ba4ba0, “daddy”)
Total 93,701 7,526

In our dataset, we took into account the tone sandhi for 不 (bu4) and 一 (yi1). Thus, T2 is assigned to 不 (bu4) when followed by T4, such as 不要 (bu2yao4, “no”). T2 is assigned to 一 (yi1) when followed by T4, such as 一定 (yi2ding4, “must”), and T4 is assigned to 一 (yi1) when followed by other tones (T1, T2, and T3), such as 一天 (yi4tian1, “one day”) and 一點 (yi4dian3, “a little bit”).

Subsequently, we extracted the sound files of these disyllabic words and measured their f0 values using the To Pitch (cc) command in Praat (Boersma and Weenink 2020). For female speakers, the pitch floor was set at 75 Hz and the pitch ceiling at 400 Hz. For male speakers, the pitch floor was set at 50 Hz and the pitch ceiling at 300 Hz. The time step was set to 0.001 s, and a Gaussian window was used for optimal F0 estimation. The To PointProcess command was then applied to identify the time points of glottal pulses in the voiced sections, from which the corresponding F0 values were extracted. No f0 values were returned when there was no vocal fold vibration due to the presence of voiceless plosives or fricatives, or when creaky voice occurred.

Words with fewer than five tokens were excluded from our dataset, to ensure that each word type had a sufficient number of tokens for analysis. For high-frequency words with more than 200 tokens, we randomly sampled 200 tokens, to prevent model predictions from being biased toward high-frequency words. Furthermore, words contributed by only female speakers or only by male speakers were excluded. This ensured that the tokens of a given word type were contributed by at least two speakers, preventing bias from one speaker’s specific way of speaking.

Lastly, tokens with f0 extraction errors were excluded from analysis. These errors typically resulted from pitch halving or doubling. We calculated, for each token, the standard deviation of the differences between consecutive measurements. A large standard deviation indicated high likelihood of discontinuous f0 measurements with abrupt fluctuations. Tokens with a standard deviation greater than the 9th decile of the distribution were considered outliers. This resulted in a dataset with 24,284 tokens, representing 524 unique word types.

3.3 Predictors

The response variable of interest is f0. We log-transformed f0 to obtain a response variable that approximately follows a Gaussian distribution. As our interest is in production rather than comprehension, we did not make use of modifications of the logarithmic transformation such as the MEL or BARK scales, which are optimized for human perception. The predictors for logf0 are as follows.

normalized_t For each token, time was normalized between 0 and 1, enabling the modeling of tokens with varying durations on a common scale. Since f0 values were measured every 15 ms, tokens with longer durations have more measurements and, consequently, more data points within the [0,1] interval of normalized time.
gender A categorical variable with two levels – female and male. Due to physiological differences, female speakers generally produce speech at a higher pitch than male speakers. Gender is included as a control variable.
speaker A factor with anonymized speaker identifiers as levels, required for fine-tuning differences in speakers’ height of voice.
tone_pattern The tonal pattern of the token, as listed in the tone pattern column in Table 1.
tonal_context preceding_tone is the tone of the syllable immediately preceding a token. following_tone is the tone of the syllable immediately following a token. If a pause occurs immediately before or after the token, it is coded as PAUSE. Thus, both preceding_tone and following_tone include six possible values: 1, 2, 3, 4, 0, and PAUSE. To represent the different tonal contexts in which the token may appear, we define tonal_context as the interaction of preceding_tone and following_tone, resulting in a factor with 36 levels.
duration The duration (in seconds) of the token. To avoid concurvity, speech_rate is not included alongside duration as a predictor, as the two variables are moderately correlated r = −0.55.
norm_utt_pos Normalized position in the utterance represents the relative position of a word within its utterance. It is calculated by dividing the word’s position by the total number of syllables in the utterance, resulting in a value normalized on a scale from 0 to 1. Higher values indicate that the token occurs closer to the end of the utterance. For single-word utterances, the position is coded as 1.
bg_prob_prev Bigram probability quantifies how predictable a word is in its context. This measure of contextual predictability is based on the relative frequency of the word’s co-occurrence with surrounding words. A higher bigram probability indicates that the target word is more predictable within its given context. In general, higher predictability is associated with shorter word durations and greater spectral reduction (Arnon and Priva 2014; Tang and Bennett 2018). There is also some evidence showing that contextual predictability influences f0 production, as observed in English (Turnbull 2017), Taiwan Mandarin (Hsieh 2013), and Taiwan Southern Min (Wang 2024). In the present study, following Gahl (2008), bg_prob_prev is calculated as the probability of the occurrence of the target word given the preceding word.
bg_prob_fol This measure represents the bigram probability of the following word, calculated as the probability of the occurrence of the target word given the following word.
word A factor with orthographic words, as available in the corpus, as levels. For instance, the token XMC_GY_8107_問題 is coded as 問題 using traditional Chinese characters. The dataset contains 313 unique words, so there are 313 corresponding levels for word.
sense_type A word can have multiple senses, which are identified based on the contexts in which the word occurs. We used a word sense identification system, described in Hsieh et al. (2024), that utilizes BERT in combination with the Chinese WordNet (Huang et al. 2010).

Of the above list of predictors, the factor tonal_context poses a special challenge for the analysis. tonal_context provides information about the preceding and following tones. Due to the pervasiveness of tonal coarticulation, it is highly probable that the effect of tonal_context varies with the tone pattern of the target word. For example, a preceding high tone will have an effect on a word-initial dipping tone that differs from its effect on a word-initial rising tone. Accounting for such coarticulation is essential for modeling f0 in connected speech. In principle, one could introduce a variable that represents the interaction between tonal_context and tone_pattern (cf. Jin et al. 2025). However, for our dataset, this would result in a variable with 720 levels that is strongly confounded with word and sense_type.

Therefore, we opted to fit separate regression models for f0 across the four most frequent tonal contexts in our dataset, excluding any contexts that involved a “pause” in the preceding or following tone, i.e., the contexts 4.4, 3.4, 4.1, and 4.0 (cf. Table 2). We chose not to include tonal contexts involving a “pause,” for two reasons. First, when a tone is preceded or followed by a pause, several context-related variables, such as norm_utt_pos, bg_prob_prev, and bg_prob_fol, are missing, leading to data loss. Second, pauses in speech often signal utterance boundaries, hesitations, or breaths, making the “pause” category inherently heterogeneous.

Table 2:

Overview of the full dataset and subdatasets grouped by tonal context.

Dataset Number of tokens Number of word types Number of tone patterns
Total (all contexts) 4,283 313 20
By tonal context:
4.4 1,794 288 20
3.4 888 240 20
4.1 874 250 20
4.0 727 210 20

As shown in Table 2, the final dataset contains 4,283 tokens representing 313 unique word types. Across all tonal contexts combined, the minimum number of tokens per word type is five, and the maximum is 56. On average, each word type was produced by 9.37 different speakers (range: 2–30). Additionally, each speaker contributed an average of 53.31 different word types (range: 4–119). For each tonal context, all 20 tone patterns are represented.

3.4 Statistical analysis

In what follows, we carry out a word-based and a sense-based analysis. Following Chuang et al. (2025), we expect that including a factor smooth of word will substantially improve model fit, thereby extending the observed word-specific tonal realization to 20 tone patterns in spoken Taiwan Mandarin. Furthermore, replacing the factor smooth for word with a factor smooth for sense may lead to further improvement of the model fit. Given the complexity of GAMMs, a certain degree of concurvity is inherent and unavoidable. However, we avoid severe concurvity by specifying models that do not include both segmental predictors and word or sense smooths simultaneously.

3.4.1 Models with word as predictor

The Generalized Additive Model (GAMM, Wood 2017) was used for the statistical analyses, with the bam() function from mgcv package (Wood 2017) implemented in R (Team 2020). Four GAMMs were fitted to each of the four datasets, using the same model specification:

logf0 ∼ gender +
s(normalized_t, by=gender,k=4) +
s(speaker, bs=“re”) +
s(normalized_t, tone_pattern, bs=“fs,” m=1)+
s(normalized_t, word, bs=“fs,” m=1) +
s(duration, by=gender, k=4) +
ti(normalized_t, duration, k=c(4,4)) +
s(norm_utt_pos, k=4) +
ti(normalized_t, norm_utt_pos, k=c(4,4)) +
s(bg_prob_prev, k=4)+
ti(normalized_t, bg_prob_prev, k=c(4,4))+
s(bg_prob_fol, k=4)+
ti(normalized_t, bg_prob_fol, k=c(4,4))

To account for differences in average pitch height between genders, we included gender as a fixed effect. We added a by-gender thin plate regression smooth of normalized_t, which allows us to capture differing relationships between normalized time and f0 across genders. Other continuous variables, including speech_rate, norm_utt_pos, bg_prob_prev, and bg_prob_fol, were likewise modeled with thin plate regression splines. Interactions of covariates with normalized time were modeled with tensor product smooths, using the ti() function.

Furthermore, random intercepts were requested for speaker to account for individual variability in pitch height by speaker. Other discrete variables, including tone_pattern and word, were modeled using factor smooths (nonlinear random effects).

We implemented an AR(1) process (first-order autoregressive model) in the residuals to take into account the autocorrelations in the time series of pitch measurements. The inclusion of the AR(1) process with an autocorrelation coefficient of rho = 0.95 effectively removed nearly all autocorrelation from the residuals. Summaries of the four models are provided in the Appendix.

Akaike’s Information Criterion (AIC) was used to assess variable importance. Figure 2 shows the increase in AIC (indicating a lower-quality fit to the data) resulting from withholding individual predictors from the model specification.[2] A greater increase in AIC when a predictor is excluded suggests a higher importance of that predictor in the model. As shown in Figure 2, across all tonal contexts, withholding the predictor word leads to a substantial increase in AIC, ranging from 7,269.27 to 12,345.54. The increase in AIC when word is omitted from the model specification substantially exceeds the corresponding change observed for any other predictor.

Figure 2: 
The increase in AIC scores when a predictor is withheld from the best-fit model. The AIC increase when word or tone_pattern is withheld is shown in red, and the increase for other predictors is shown in blue. Panels 1 to 4 represent four GAMMs with tonal contexts 4.4, 3.4, 4.1, and 4.0, respectively.
Figure 2:

The increase in AIC scores when a predictor is withheld from the best-fit model. The AIC increase when word or tone_pattern is withheld is shown in red, and the increase for other predictors is shown in blue. Panels 1 to 4 represent four GAMMs with tonal contexts 4.4, 3.4, 4.1, and 4.0, respectively.

Surprisingly, withholding tone_pattern has a small impact on the model fit with increases in AIC of around 11.81 units (20.08 units for 4.4, 9.25 units for 3.4, 3.43 units for 4.1, and 14.5 units for 4.0). One possible explanation is that word is nested within tone_pattern. When word is removed from the best-fit GAMM, withholding tone_pattern results in a larger AIC increase by 7,223.17 units for 4.4, 4,139.11 units for 3.4, 2,973.78 units for 4.1, and 3,594.93 units for 4.0. This suggests that tone_pattern still contributes to the model fit, though not as strongly as word. When word is included in the model, the effect of tone_pattern is overshadowed by the stronger effect of word.

Concurvity, analogous to collinearity in linear regression, measures how much a predictor’s effect can be explained by other predictors in the model. Concurvity scores range from 0 to 1, with lower values indicating that the contribution of a predictor is less confounded with the contributions of other predictors. As shown in Figure 3, concurvity scores follow a similar pattern for all four GAMMs, with the lowest concurvity scores for word. speaker also has relatively low concurvity.[3]

Figure 3: 
Concurvity scores for selected terms in the four GAMMs. The concurvity scores for word and tone_pattern are shown in red, and those for other predictors are shown in blue. From left to right, it presents tonal context 3.4, 4.0, 4.1, and 4.4, respectively. Concurvity scores were calculated based on the best-fit GAMMs with all predictors included.
Figure 3:

Concurvity scores for selected terms in the four GAMMs. The concurvity scores for word and tone_pattern are shown in red, and those for other predictors are shown in blue. From left to right, it presents tonal context 3.4, 4.0, 4.1, and 4.4, respectively. Concurvity scores were calculated based on the best-fit GAMMs with all predictors included.

By contrast, the predictor tone_pattern exhibits extremely high concurvity, ranging from 0.998 to 1. This is due to tone pattern being fully predictable given the word. When word is excluded from the model, the concurvity of tone_pattern drops substantially to 0.09. This indicates that word captures information about the word’s tone pattern, so when word is included in the model, the tone pattern is also included implicitly. However, if only tone_pattern is specified, word-specific information is not available. This results in substantially worse fits, which aligns with the AIC changes discussed in preceding subsection.

Finally, we note that the by-gender smooths for time (normalized_t:female and normalized _t:male) show very high concurvity – unsurprisingly, as the tonal contours for both genders are highly similar (see Figure 4). Without accounting for other effects, these general contours primarily reflect the pure influence of time on pitch, illustrating how pitch contours change over time. The overall curves exhibit falling contours, which suggests a general declination trend in pitch contours for disyllabic words.

Figure 4: 
The partial effect of general smooth for the normalized_time for female and male speakers, in different tone contexts. The orange curves indicate the general contours for female speakers, and the blue curves indicate the general contours for male speakers. Vertical gray dashed lines indicate the average syllable boundary, and the horizontal gray dashed line represents the y = 0 reference line.
Figure 4:

The partial effect of general smooth for the normalized_time for female and male speakers, in different tone contexts. The orange curves indicate the general contours for female speakers, and the blue curves indicate the general contours for male speakers. Vertical gray dashed lines indicate the average syllable boundary, and the horizontal gray dashed line represents the y = 0 reference line.

Figure 5 illustrates the partial effects of the 20 tonal patterns across the four tonal contexts under investigation, using color coding to distinguish between the tonal contexts. Within each panel, the various colored curves represent specific tone patterns associated with different tonal contexts. For example, the orange curve in the upper-left panel represents the T4-T0 tone pattern in the 3.4 tonal context. In other words, it represents a tonal sequence T3-T4-T0-T4, as in the phrase 有這麼重 (you3zhe4me0zhong4, “…is this heavy”). The black curves were obtained from averaging the four colored curves representing tone patterns under different tonal contexts. The deviations of the colored curves from the corresponding black curves highlight how the actual realizations of a tonal pattern in context differ from the expected effect of tone pattern, irrespective of context.

Figure 5: 
The effect of tone pattern. The colored curves represent the partial effects of the factor smooth for tone_pattern, combined with the general smooth of normalized_t for female speakers, based on the best-fit models that include the word effect. There is one GAMM for each tonal context, resulting in four colored curves representing, in a given panel, the four tonal contexts. The black curves present the mean f0 contours of a tone pattern, calculated by averaging the four f0 contours across the tonal contexts. Thus, the colored curves in each panel illustrate how the tonal context modulates the general curve shown in black. Vertical gray dashed lines indicate the average syllable boundary, and the horizontal gray dashed line represents the y = 0 reference line.
Figure 5:

The effect of tone pattern. The colored curves represent the partial effects of the factor smooth for tone_pattern, combined with the general smooth of normalized_t for female speakers, based on the best-fit models that include the word effect. There is one GAMM for each tonal context, resulting in four colored curves representing, in a given panel, the four tonal contexts. The black curves present the mean f0 contours of a tone pattern, calculated by averaging the four f0 contours across the tonal contexts. Thus, the colored curves in each panel illustrate how the tonal context modulates the general curve shown in black. Vertical gray dashed lines indicate the average syllable boundary, and the horizontal gray dashed line represents the y = 0 reference line.

For most of the tone patterns, the effects of the neighboring tones on the pitch contour are relatively modest, with as glaring exceptions the T2-T0 tone patterns in the 4.0 tonal context. This tonal sequence T4-T2-T0-T0 (e.g., 對孩子的, dui4hai2zi0de0, “for children’s …”) shows an unexpectedly low f0. This is probably due to the fact that this tonal sequence is underrepresented in the dataset, with only 9 tokens representing 4 unique word types (cf. Table A.1 in Appendix 1). For tone patterns T1-T0, T2-T0, T3-T0, and T4-T0, the general trend appears to approach a similar mid-low pitch target at the end of the syllable, regardless of the following tone.

For the 3.4 tonal context, 14 of the tonal patterns begin with the lowest f0. This may be a straightforward consequence of Tone 3 being often realized as a low tone in Taiwan Mandarin (Fon and Chiang 1999). For the 4.0 tonal context, by the end of the word, the f0 tends to be the lowest across all panels. This is probably due to the general curve of female speakers in 4.0 tonal context. The female curve of 4.0 tonal context has a particularly salient falling trend (cf. Figure 4). Apparently, the following neutral tone is magnifying the final downward inclination observed in the vast majority of tone patterns.

Figure 6 displays predicted pitch contours estimated by the factor smooth for word, combined with the partial effects of the factor smooth for tone_pattern. These contours are predicted from an omnibus model fitted to the entire dataset, with tonal_context included as a random intercept. These partial effects exclude the general intercept and do not account for pitch differences between female and male speakers. The red dashed curves represent the partial effect of the factor smooth for tone_pattern only, without incorporating the word-specific pitch contours, and are shown to provide a reference for assessing the word-specific effects. The deviation of the blue curves from the red dashed curves reflects the differences between the predicted pitch contours and the general tone pattern.

Figure 6: 
The predicted f0 contours for words with the T4-T2 tone pattern. These contours are estimated by combining the partial effects of the factor smooth for word and the corresponding factor smooth for tone_pattern (T4-T2). The dashed red curve represents the partial effect of the T4-T2 tone pattern alone, which is identical across all panels. Vertical gray dashed lines indicate the average syllable boundary, while the horizontal gray dashed line represents the y = 0 reference line. The “n=” in the upper-right corner of each panel shows the token counts for the word types. Panels are ordered by the token counts in the dataset, from highest to lowest.
Figure 6:

The predicted f0 contours for words with the T4-T2 tone pattern. These contours are estimated by combining the partial effects of the factor smooth for word and the corresponding factor smooth for tone_pattern (T4-T2). The dashed red curve represents the partial effect of the T4-T2 tone pattern alone, which is identical across all panels. Vertical gray dashed lines indicate the average syllable boundary, while the horizontal gray dashed line represents the y = 0 reference line. The “n=” in the upper-right corner of each panel shows the token counts for the word types. Panels are ordered by the token counts in the dataset, from highest to lowest.

Figure 6 presents words arranged by frequency. Each word exhibits its own “pitch signature,” independent of frequency. For further discussion of a possible role of frequency, see Appendix 2. It can be observed that the pitch contours of 後來 (hou4lai2, “later”), 不然 (bu4ran2, “otherwise”), and 不能 (bu4neng2, “cannot”) closely align with the predicted tone pattern but are overall shifted upward. Similarly, the pitch contour of 認為 (ren4wei2, “to believe”) also follows a similar shape but is shifted downward. However, other words, such as 幹嘛 (gan4ma2, “What for?”), 目前 (mu4qian2, “at present”), and 化學 (hua4xue2, “chemistry”), largely deviate from the general tone pattern. Two words beginning with 不 (bu4, expressing negation), 不然 (bu4ran2, “otherwise”) and 不能 (bu4neng2, “cannot”), have very similar contours that run parallel to the contour of the T4-T2 pattern. However, 不行 (bu4xing2, “not okay”) displays a steeper fall.

The word-specific tonal realizations observed here are similar to those reported for Mandarin disyllabic words with T2-T4 tone patterns (Chuang et al. 2025), as well as words with the T2-T3 and T3-T3 tone patterns (Lu et al. 2024).

3.4.2 Sense-specific tonal realization

In the preceding section, we documented that the tonal realization of Mandarin disyllabic words varies systematically by word. It is possible that words’ segmental make-up is the crucial factor. Alternatively, it is theoretically possible that it is words’ meanings that shape their pitch contours, just as in English, the duration of homophones is to a considerable extent codetermined by their meanings (Gahl and Baayen 2024). If this hypothesis is on the right track, then word sense should be a more precise predictor than word identity. In the following analysis, we explore whether we can replicate previous studies in which sense emerged as an even better predictor of disyllabic words’ pitch contours than the word itself (Chuang et al. 2025; Lu et al. 2024). If we can show that a word with different meanings exhibits varying pitch realizations, this will provide further evidence that words’ meanings codetermine tonal realization.

In order to explore the value of this hypothesis, we make use of the fact that our data are taken from a corpus, and not a word list. As a consequence, we can estimate a word token’s most likely sense in the exact context in which it was used. To determine these most likely senses in context, we made use of the sense identification system proposed by Hsieh et al. (2024), which uses BERT in combination with the Chinese WordNet (Huang et al. 2010). For example, this system identifies the word 先生 (xian1sheng1, “husband, sir” as “a woman’s spouse in a marital relationship”) in sentences such as 我先生認為 (wo3xian1sheng1ren4wei2, “My husband thinks …”) or 我先生去睡覺 (wo3xian1sheng1qu4shui4jiao4, “My husband went to sleep …”). It assigns 先生 (xian1sheng1, “husband, sir”) the sense “a man addressed in a social context” to when it appears in the phrase 那位先生 (na4wei4xian1sheng1, “That gentleman over there …”).

Since not all words in the dataset could be assigned a sense, we excluded words for which no sense type was identified. Second, we removed sense types represented by fewer than six tokens to ensure that each sense type had sufficient data for meaningful analysis. To prevent the model’s predictions from being biased toward high-frequency sense types, we limited the maximum number of tokens per sense type. Specifically, for any sense type represented by more than 50 tokens, we randomly sampled 50 tokens from all tokens. We then grouped the dataset by tonal context, as in the previous analysis, resulting in four subdatasets (see Table 3). The final dataset consists of 3,525 tokens representing 290 unique sense types. After the trimming process, 252 unique word types remain from the initial 313. All 20 tone patterns are present for each tonal context. The distribution of sense types and word types follow the similar pattern as in the dataset shown in Table 1.

Table 3:

Overview of trimmed datasets grouped by the four tonal contexts, for the sense analysis.

Tonal context Tokens Sense types Word types Tone patterns
4.4 1,512 266 233 20
3.4 740 220 195 20
4.1 716 228 200 20
4.0 557 171 157 20
Total 3,525 290 252 20

For the sense analysis, we replaced the factor smooth for word with a factor smooth for sense_type, while keeping all other predictors from the previous analysis.

logf0 ∼ gender +
s(normalized_t, by=gender,k=4) +
s(speaker, bs=“re”) +
s(normalized_t, tone_pattern, bs=“fs,” m=1)+
s(normalized_t, sense_type, bs=“fs,” m=1) +
s(duration, by=gender, k=4) +
ti(normalized_t, duration, k=c(4,4)) +
s(norm_utt_pos, k=4) +
ti(normalized_t, norm_utt_pos, k=c(4,4)) +
s(bg_prob_prev, k=4)+
ti(normalized_t, bg_prob_prev, k=c(4,4))+
s(bg_prob_fol, k=4)+
ti(normalized_t, bg_prob_fol, k=c(4,4))

An AR(1) process in the errors was also included to account for the autocorrelation in the pitch time series. The model summary is available in Appendix 1.

To assess the relative importance of sense_type, word, and tone_pattern, we compared models with different predictor structures: (1) all other predictors + tone_pattern, (2) all other predictors + word, (3) all other predictors + tone_pattern + word, and (4) all other predictors + tone_pattern + sense_type. Table 4 presents the AIC differences resulting from adding the given variable(s), relative to the all other predictors model.

Table 4:

AIC scores for models with different structures of word, sense_type, and tone_pattern, fitted separately to datasets for the four tonal contexts.

Tonal context Model AIC ΔAIC
4.4 All other predictors −210,904.67
4.4 All other predictors + tone_pattern −217,017.19 −6,112.52
4.4 All other predictors + word −226,514.74 −15,610.07
4.4 All other predictors + tone_pattern + word −226,532.42 −15,627.76
4.4 All other predictors + tone_pattern + sense_type −226,981.87 −16,077.20
3.4 All other predictors −107,031.04
3.4 All other predictors + tone_pattern −111,086.38 −4,055.34
3.4 All other predictors + word −116,996.00 −9,964.96
3.4 All other predictors + tone_pattern + word −116,989.72 −9,958.67
3.4 All other predictors + tone_pattern + sense_type −117,572.92 −10,541.88
4.1 All other predictors −104,637.05
4.1 All other predictors + tone_pattern −107,224.95 −2,587.90
4.1 All other predictors + word −113,207.07 −8,570.02
4.1 All other predictors + tone_pattern + word −113,206.72 −8,569.67
4.1 All other predictors + tone_pattern + sense_type −113,808.24 −9,171.19
4.0 All other predictors −85,980.81
4.0 All other predictors + tone_pattern −88,837.72 −2,856.91
4.0 All other predictors + word −93,647.14 −7,666.33
4.0 All other predictors + tone_pattern + word −93,659.45 −7,678.64
4.0 All other predictors + tone_pattern + sense_type −93,876.38 −7,895.57
  1. ΔAIC represents the difference in AIC between a given model and the baseline model including all other predictors (ΔAIC = AICmodel − AICbaseline), with more negative values indicating improved model fit relative to the baseline. The model with the greatest AIC improvement is highlighted in bold.

First, comparing row 2 and row 5, the inclusion of tone_pattern and sense_type(ΔAIC: −16,077.20) led to a more substantial model fit improvement than tone_pattern by itself (ΔAIC: −6,112.52). This indicates that tone_pattern contributes to the model fit, albeit with a relatively minor effect compared to when sense_type is also included. A similar AIC pattern across the 3.4, 4.1, and 4.0 tonal contexts further reinforces the stronger influence of sense_type over tone_pattern in modeling f0 contours.

Second, consider the GAMMs where word was replaced by sense_type. In the case of the 4.4 tonal context, replacing word with sense_type (comparing row 4 and row 5) led to a substantial AIC decrease of 449.44 units, suggesting that sense_type is a stronger predictor than word for modeling f0 contours. The stronger effect of sense is also observed across other tonal contexts.

Figure 7 displays the predicted tonal contours for different sense types of 另外 (ling4wai4, “in addition”), calculated by combining the partial effects of sense_type and tone_pattern. Similar to the red dashed curves in Figure 6, the red dashed curves in Figure 7 again represent the general tone pattern, which is T4-T4 in this case. The three sense types of 另外 (ling4wai4, “in addition”) are “others” (sense1), “in addition to” (sense2), and “totally different” (sense3). The word 另外 (ling4wai4, “in addition”) exhibits clear variations across the three sense types compared to the general tone pattern. The pitch contours of sense 1 (shown in purple) are generally shifted below the general tone pattern, while those of sense 2 (shown in blue) are shifted above it. The pitch contour of sense 3 (shown in yellow) displays two rises, as in the general tone pattern, but is shifted upward.

Figure 7: 
A selection of the predicted f0 contours for different sense_type of 另外 (ling4wai4, “in addition”) across tonal contexts. The predicted pitch contours represent the partial effect of the factor smooth for sense_type, combined with the corresponding factor smooth for tone_pattern (T4-T4 in this case). The red dashed curves represent the partial effect of T4-T4 tone pattern alone, averaged across four tonal contexts, so the red dashed curve is the same across all panels. Vertical dashed lines indicate the average syllable boundary, and the horizontal gray dashed line represents the y = 0 reference line.
Figure 7:

A selection of the predicted f0 contours for different sense_type of 另外 (ling4wai4, “in addition”) across tonal contexts. The predicted pitch contours represent the partial effect of the factor smooth for sense_type, combined with the corresponding factor smooth for tone_pattern (T4-T4 in this case). The red dashed curves represent the partial effect of T4-T4 tone pattern alone, averaged across four tonal contexts, so the red dashed curve is the same across all panels. Vertical dashed lines indicate the average syllable boundary, and the horizontal gray dashed line represents the y = 0 reference line.

3.5 Summary

This section addressed our first hypothesis, namely, that the meanings of words codetermine the phonetic details of how the tones of these words are produced. Our results show that word emerged as a more powerful predictor than all other predictors. Surprisingly, the variable importance of word was substantially greater than that of tone_pattern. The strong effect of word that we observed is in line with the results of Jin et al. (2025); Lu et al. (2024); Jin et al. (2025) also observed, albeit for monosyllabic words, that word was a stronger predictor than tone_pattern. In the study by Lu et al. (2024), however, the variable importance of word was similar to that of tone_pattern.

A further analysis clarified that sense_type is an even better predictor of pitch contours than word. The substantial improvement in model fit contributed by sense_type provides further support for the hypothesis that words’ meanings codetermine the fine detail of their pitch contours, replicating the findings of earlier studies (Chuang et al. 2025; Jin et al. 2025; Lu et al. 2024).

4 Theory-driven computational modeling

In this section, we turn to our second hypothesis, exploring whether the tonal realization of a given token can be predicted with reasonable accuracy based on its context-specific meaning using computational modeling. To do so, we make use of the general conceptual framework of the Discriminative Lexicon Model (DLM, Heitmeier et al. 2024, 2025), a computationally implemented theory that was developed independently of the present data, but that turns out to provide exactly the right approach to predict tonal realization from semantics.

In the introduction, we already explained that the DLM seeks to predict words’ forms from their meanings. Both forms and meanings are represented by numeric vectors, and in the simplest possible set-up, a linear mapping transforms a meaning vector into a form vector (for mappings using deep neural networks, see Heitmeier et al. 2024, 2025). For the present study, we are not interested in predicting full word forms, but rather words’ pitch contours. What we need, then, are numerical representations of the present Mandarin word tokens’ pitch contours on the one hand, and their meanings on the other hand. Following Chuang et al. (2025), we represent words’ forms using fixed-length vectors representing pitch contours, and we represent words’ meanings using contextualized embeddings obtained with the GPT-2 transformer technology. Importantly, both the pitch vectors and the semantic vectors are context-specific and thus vary from word token to word token. Chuang et al. (2025) demonstrated that the tonal contours of a given token with T2-T4 tone pattern can be predicted from its context-specific meaning with above-chance accuracy using a linear mapping. In what follows, we consider whether this result generalizes to all 20 tone patterns attested for two-syllable words. As a first step, we explain how we obtained fixed-length pitch vectors.

4.1 Fixed-length pitch vectors

To implement a linear mapping within the DLM framework, given n words, we need an n × p matrix C to represent words’ pitch contours, and an n × q matrix S for words’ meanings. Consider the form matrix C , and recall that the tokens in our dataset have unequal numbers of pitch measurement points because tokens with longer durations contain more measurement points. Furthermore, the raw data include missing values due to gaps in the pitch contours. However, the row vectors of C need to have the same fixed length p. To achieve this, we used GAMMs to obtain pitch contours represented by p = 100-dimensional vectors in normalized time. There are several ways in which such fixed-length vectors can be generated, of which we explored three.

Method I The first method fitted separate GAMMs to the f0 contours of each of the individual tokens, i.e., 4,283 independent GAM models, and then extracted the predicted contours. This method generates pitch contours that stay as close as possible to the empirical pitch measurements. However, this method inevitably includes by-token measurement noise in the estimation of the contours. In the case of simple univariate linear regression, the predicted value for a data point (on the regression line) will deviate from the observed value for that data point; taking the observed data point as gold standard is at odds with statistical modeling. Similarly, for the present dataset of time-series of measurements, the observed curves are not the given gold standard. There are several sources of noise: articulatory stochastic noise in the articulation, noise in the audio recordings, and noise in the pitch measurements. Method I incorporates the combined noise originating from these sources. Therefore, Method I serves as a baseline that we expect to yield the least precise results. Methods II and III implement two ways of reducing this by-observed pitch contour measure noise.
Method II The second method fitted a GAMM to the f0 contours of all the tokens of words with a given tone pattern, extracting the smooth for time and the word-specific smooth, and combining these to obtain word-type–specific smooths. This method abstracts away from the influences of contextual factors on the realization of pitch. The resulting pitch vectors are identical for all the tokens of a given word type. We anticipated that this would be the optimal situation for learning, as within-type variation is eliminated. This method also has a theoretical motivation, namely, that it is unlikely that the contextualized embeddings generated by an AI model will capture the full richness of the thought of human speakers engaged in real, 30-minute long conversations.
Method III The third method, following Chuang et al. (2025), obtains token-specific pitch vectors predicted by GAMMs with all contextual factors included. For our data, we used the four GAMMs fitted to the four tonal contexts, as reported above in Section 3. This method has the advantage, compared to Method I, of removing by-observation noise. Furthermore, compared to Method II, it has the advantage of having pitch vectors that vary from token to token. Thus, this method is optimal for detecting the extent to which by-token semantics and by-token phonetics are aligned. The more the contextualized embeddings diverge from the true semantic intentions of the speakers, the less well this method will perform.

After obtaining the estimated pitch vectors from the GAMMs, we applied by-token normalization by centering and scaling each predicted pitch vector. By doing so, the mapping from meaning to form is forced to learn to predict the shape of pitch contours rather than the absolute pitch values of each token, which vary substantially across word types and speakers.

4.2 Contextualized embeddings

We made use of Contextualized Embeddings (CEs) to represent words’ meanings. Word embeddings (semantic vectors) represent meanings in a distributed manner, building on the hypothesis that similar words occur in similar contexts (Harris 1954; Mikolov et al. 2013). Semantic vectors are more than mathematical constructs, they offer a cognitive framework for modeling human semantic memory, as shown by Landauer and Dumais (1997); Bruni et al. (2014) shows that embeddings accurately predict human semantic similarity ratings. An increasing body of research shows that semantic vector spaces encode meaningful psychological dimensions and may capture aspects of conceptual representation that are salient to human cognition (Westbury and Wurm 2022; Westbury et al. 2025). Contextualized embeddings probe aspects of semantic structure that are difficult to capture by traditional linguistics (Günther et al. 2019). For instance, Shafaei-Bajestan et al. (2024) show, using embeddings, that in English plural semantics vary by semantic class.

The first generation of semantic embeddings, such as fastText (Bojanowski et al. 2017), is fully determined by words’ orthographic forms. However, a single orthographic form can express different meanings (e.g., English “bank”), or different senses (e.g., Mandarin 水平 [shui3ping2, “level or horizontal position” or “skill or proficiency”]). Typically, the context in which a word is used provides disambiguating information. Contextualized Embeddings (CEs) were developed to provide token-specific, context-sensitive embeddings that capture the subtle differences in what a word may actually mean in context.

The CEs used in the current study were derived from a pretrained unidirectional language model based on the GPT-2 architecture, developed by CKIP, Academia Sinica. Each token in our dataset was assigned a 768-dimensional vector representing its contextualized embedding.

To inspect the quality of the contextualized embedding space, we reduced the 768-dimensional semantic space to two dimensions using t-SNE (Van der Maaten and Hinton 2008). Figure 8 displays embeddings in the resulting 2-D plane, with convex hulls highlighting clusters of tokens corresponding to different word types. Tokens clearly cluster by word, as expected. Furthermore, some semantically related words have clusters that are close to each other. For instance, in the middle-right of the Figure, the tokens of 大學 (da4xue2, “university”), 學校 (xue2xiao4, “school”), 國中 (guo2zhong1, “middle school”), and 高中 (gao1zhong1, “high school”) occur closely together, which makes sense as they are all semantically related to educational institutions. Other school-related words such as 學生 (xue2sheng1, “students”) and 老師 (lao3shi1, “teacher”) also appear near these words. Some word clusters contain outliers. For instance, in the center of the figure, 上面 (shang4mian4, “above”) has an outlier positioned near 以後 (yi3hou4, “in the future”), and 後來 (hou4lai2, “afterwards”) has an outlier near 之後 (zhi1hou4, “after”).

Figure 8: 
Contextualized embeddings, obtained from a pretrained Chinese GPT-2 model, are shown in a two-dimensional plane obtained with t-SNE. Convex hulls (gray polygons) highlight the clusters of word types.
Figure 8:

Contextualized embeddings, obtained from a pretrained Chinese GPT-2 model, are shown in a two-dimensional plane obtained with t-SNE. Convex hulls (gray polygons) highlight the clusters of word types.

4.3 Methods

Modeling was conducted using the same dataset that we used above for the word-type–based analysis (see Table 2), which contains 4,283 tokens. This dataset, comprising all four tonal contexts, was split into a training dataset (80.39 %) and a testing dataset (19.61 %). Every word type was represented in both the training and testing data, with tokens per word being split roughly proportionally with 80 % in the training dataset and 20 % in the testing dataset.

Using the training data, we computed a linear mapping G from a 3,443 × 768 semantic matrix S to a 3,443 × 100 form matrix C by solving SG = C (for technical details, see Gahl and Baayen 2024; Heitmeier et al. 2025). We then evaluated the quality of the mapping for both the training and the testing dataset.

The accuracy of a predicted pitch vector c ̂ was evaluated as follows. For a given c ̂ , we calculated its Euclidean distance to all gold-standard pitch vectors in C . We then identified its closest form neighbor of c ̂ . If this nearest neighbor was a token of the same word type as the target token, the predicted form vector was assessed as correct; otherwise, it was considered incorrect.

4.4 Results

We estimated three linear mappings from the same semantic matrix S with CEs to three different form matrices C , one for each of the three kinds of smoothed pitch contours introduced above.

The mean accuracy of method I was 2.8 % on the training dataset and 1.4 % on the testing dataset. The mean accuracy of method II was 23.5 % on the training dataset and 15.1 % on the testing dataset. The mean accuracy of method III was 12.0 % on the training dataset and 6.9 % on the testing dataset. All accuracies were above a permutation baseline of 0.4 % and a majority baseline 1.3 %, albeit by only a small margin for method I. That method I has the lowest accuracy is unsurprising, fitting GAMMs to individual pitch contours unavoidably comes with overfitting and a loss of generalizability. Methods II and III gain strength from other tokens and incorporate less by-item observation noise. For these two methods, the mapping from meaning to pitch contours is substantially more accurate than would be expected under chance conditions. The best-performing method is method II, which abstracts away from the influences of contextual factors on the realization of pitch. This suggests that some abstraction away from the immediate context is helpful, possibly because the contextualized embeddings are not precise enough. After all, these embeddings come from a general large language model trained on large volumes of data that most likely diverge considerably for the language experience of the speakers interviewed for the Corpus of Spoken Taiwan Mandarin.

The results obtained with especially methods II and III clarify that there appears to be considerable isomorphy between the contextualized embedding space and the pitch space of word tokens. This isomorphy implies that if we take the most typical embedding for a given tone pattern and map it into the pitch space, the resulting predicted pitch contours should closely resemble the pitch contours identified by the GAMM for that tone pattern. Figure 9 shows that this prediction is on the right track. The pitch contours shown in black are the contours predicted by the GAMMs for the different tone patterns. They represent the best denoised estimates of the average tone-pattern–specific pitch contours and serve as our gold-standard pitch contours. These GAMM-based pitch contours were obtained by first extracting the tone-pattern–specific pitch contours for each of the four tonal contexts, and then averaging these (these GAMM-based contours were shown above in red in Figure 5 before). An alternative method, which results in nearly indistinguishable pitch contours, combines the data for all four contexts and adds the tone-pattern–specific contours to the general contour for female speakers.

Figure 9: 
Pitch contours of 20 tone patterns in selected four tonal contexts. The black curves present pitch contours identified by GAMM, estimated by the partial effect for tone_pattern, combined with general contours of time female speakers (shown as the red curves in Figure 5, which is reproduced here). The blue, green, and red curves represent pitch contours predicted from the centroid of the contextualized embeddings using the three methods, respectively. For the blue curves, form vectors were obtained with GAMM smooths fitted to individual word tokens. For the green curves, form vectors were obtained from a GAMM fitted to the tokens of all words with a given tone pattern. For the red curves, form vectors were obtained with GAMM smooths that included all contextual factors.
Figure 9:

Pitch contours of 20 tone patterns in selected four tonal contexts. The black curves present pitch contours identified by GAMM, estimated by the partial effect for tone_pattern, combined with general contours of time female speakers (shown as the red curves in Figure 5, which is reproduced here). The blue, green, and red curves represent pitch contours predicted from the centroid of the contextualized embeddings using the three methods, respectively. For the blue curves, form vectors were obtained with GAMM smooths fitted to individual word tokens. For the green curves, form vectors were obtained from a GAMM fitted to the tokens of all words with a given tone pattern. For the red curves, form vectors were obtained with GAMM smooths that included all contextual factors.

We now consider how well these average pitch contours can be predicted from words’ contextualized embeddings. The most typical embedding for a given tone pattern can be approximated by calculating the centroid of the contextualized embeddings of the tokens with this particular tone pattern. The centroid is simply the mean of the semantic vectors. When we think of embeddings as points in a high-dimensional space, the centroid is located at the center of these points. To obtain the centroid of a given tone pattern, we first obtained the centroid of every relevant word type by averaging the CEs of its tokens. We then obtained the centroid of the tone pattern by averaging the centroids of the word types associated with that tone pattern. In this way, every word type is given equal weight when determining the centroids for the tone patterns.

In order to get a sense of the semantics represented by these centroid vectors, we calculated, for each tone pattern, which contextualized embeddings are closest to the corresponding centroids. Table 5 lists, for each tone pattern, the two word types with embeddings that are closest to the centroids. These two word types provide an indication of the prototypical meaning of a tone pattern. For example, 她們 (ta1men0, “they (female)”) and 他們 (ta1men0, “they (male or mix gender groups)”) are the most prototypical word types for T1-T0 tone pattern. The tone pattern T2-T0 appears to have 兒子 (er2zi0, “son”) and 孩子 (hai2zi0, “child”) as prototypical members. Most tone patterns, however, are characterized by function words.[4]

Table 5:

The top two word types that have contextualized embeddings that are closest to the centroid embedding of the 20 tone patterns. Proximity is evaluated using Euclidean distance.

Tone pattern Top one closest word type Top two closest word type
T1-T0 她們 (ta1men0, “they”) 他們 (ta1men0, “they”)
T1-T1 今天 (jin1tian1, “today”) 當兵 (dang1bing1, “to serve in the army”)
T1-T2 當然 (dang1ran2, “of course”) 之前 (zhi1qian2, “before”)
T1-T3 剛好 (gang1hao3, “just right”) 開始 (kai1shi3, “to begin”)
T1-T4 之後 (zhi1hou4, “afterwards”) 之類 (zhi1lei4, “and so on”)
T2-T0 兒子 (er2zi0, “son”) 孩子 (hai2zi0, “child”)
T2-T1 人家 (ren2jia1, “others”) 國中 (guo2zhong1, “middle school”)
T2-T2 其實 (qi2shi2, “actually”) 別人 (bie2ren2, “others”)
T2-T3 還有 (hai2you3, “also”) 還好 (hai2hao3, “it’s okay”)
T2-T4 然後 (ran2hou4, “then”) 一樣 (yi2yang4, “the same”)
T3-T0 你們 (ni3men0, “you all”) 我們 (wo3men0, “we”)
T3-T1 很多 (hen3duo1, “many”) 女生 (nv3sheng1, “girls”)
T3-T2 起來 (qi3lai2, “get up”) 以前 (yi3qian2, “before”)
T3-T3 只有 (zhi3you3, “only”) 有點 (you3dian3, “a bit”)
T3-T4 以後 (yi3hou4, “afterwards”) 好像 (hao3xiang4, “seems like”)
T4-T0 這麼 (zhe4me0, “so”) 那麼 (na4me0, “that”)
T4-T1 那些 (na4xie1, “those”) 那邊 (na4bian1, “over there”)
T4-T2 個人 (ge4ren2, “individual”) 不然 (bu4ran2, “otherwise”)
T4-T3 一起 (yi1qi3, “together”) 那裡 (na4li3, “there”)
T4-T4 算是 (suan4shi4, “considered as”) 上面 (shang4mian4, “above”)

To obtain the pitch contours predicted for the tone patterns, we provided the centroids of the 20 tone patterns as input to the three linear mappings defined above. The resulting predicted pitch contours are shown in Figure 9. Each panel in this trellis graph presents the estimates for a given tone pattern. The gold-standard pitch contours (obtained with our GAMM models as described above) are presented in black, and the contours predicted by the three DLM mappings are color-coded. The contours obtained with method I are shown in blue, those obtained with method II in green, and those obtained with method III in red. For all three methods, the resulting predicted contours are similar, and often remarkably similar, to the gold-standard contours.

To assess the similarity between the gold-standard pitch contours and the pitch contours predicted using the meaning-to-pitch mappings, we first calculated the cosine similarity, averaged across the 20 tone patterns, between the gold-standard contours and each of the three DLM pitch contours. The contours from method II show a closer fit (cosine similarity 0.93) compared to the contours from method I (cosine similarity 0.73) and III (cosine similarity 0.82). The mean correlation between GAMM-predicted pitch contours and the three DLM-derived contours is 0.85, 0.98, and 0.96, respectively. However, when using Euclidean distance to evaluate similarity, method II scores slightly worse than the other two (1.24, 1.50, and 1.09, respectively). Figure 10 presents the distributions of these three measures for the three methods, using boxplots. The three methods appear to perform with comparable accuracy, with a slight advantage for Method II when evaluated with the correlation and cosine similarity measures.

Figure 10: 
Boxplot showing the correlation, cosine similarity, and Euclidean distance between GAMM-predicted pitch contours and DLM-derived pitch contours across 20 tone patterns.
Figure 10:

Boxplot showing the correlation, cosine similarity, and Euclidean distance between GAMM-predicted pitch contours and DLM-derived pitch contours across 20 tone patterns.

4.5 Summary

In this section, we have shown that a simple linear mapping can predict the realization of token-specific pitch contours from its token-specific meaning in context with above-chance accuracy. This finding extends the earlier results of Chuang et al. (2025), which focused on disyllabic words with one specific tone pattern only (the rise-fall tone pattern T2-T4). Mapping accuracy for all 20 tone patterns is unsurprisingly somewhat lower than that observed by Chuang et al. (2025) for the T2-T4 tone pattern (30%–40 % for training data and 27%–35 % for testing data). Nevertheless, even for the present more varied dataset, accuracy is substantially above the majority baseline. This result is surprising in the light of the measurement noise that is inevitably present in both our pitch measurements and in the contextualized embeddings. The contextualized embeddings represent the knowledge of an artificial intelligence, trained on vast amounts of texts. The embeddings it generates must diverge from the meanings that the individual speakers had in mind. Nevertheless, the contextualized embeddings are sufficiently precise to enable far above chance prediction accuracy for word tokens’ pitch contours. Interestingly, the 20 canonical tone patterns are surprisingly well approximated by projecting the centroids of the contextualized embeddings of the words with these tone patterns into the f0 space. In other words, the average pitch contours identified by the GAMMs correspond to average contextualized embeddings in semantic space.

5 General discussion

The current study investigated the realization of pitch contours of disyllabic words in a corpus of spontaneous spoken Taiwan Mandarin. We first made use of the Generalized Additive Models to decompose f0 contours into a series of nonlinear functions of normalized time, with each function representing the way in which a predictor modulates the pitch contour over time. A range of predictors was taken into account, including normalized time, gender, tonal context, tone pattern, speech rate, word position, bigram probability, and speaker. Surprisingly, the GAMMs provided strong support for word-specific modulations of the pitch contours. Replacing word by word sense further improved model fit, which suggests that the effect of word may be semantic in nature. If so, the theory of the Discriminative Lexicon Model predicts that it should be possible to well approximate the token-specific pitch contours observed in the corpus with predicted token-specific pitch contours obtained with mappings taking the contextualized embeddings of the words in the corpus as input, and producing the corresponding pitch contours as output. We found that indeed a mapping from GPT-2 generated contextualized embeddings to 100-dimensional fixed-length pitch vectors predicts words’ pitch contours with accuracies that are far above a majority choice baseline. Thus, our study successfully extends the meaning-to-pitch mapping from the T2-T4 tone pattern studied by (Chuang et al. 2025) to all tone patterns in Taiwan Mandarin. Our study also dovetails well with the evidence for the importance of word and sense as codeterminants of pitch contours reported by Lu et al. (2024); Jin et al. (2025).

A remarkable finding is that words and their meanings codetermine the realization of the f0 contours in disyllabic words with effect sizes that considerably exceed those of tone pattern. This finding for disyllabic words aligns with previous research on Mandarin monosyllabic words (Jin et al. 2025), which reported that while the effect of tone pattern on pitch contours is modest, the effect of word is substantial. For disyllabic words, the stronger effect of word largely overshadows the effect of tone pattern.

Our results suggest that there are not only remarkable similarities, but also some clear differences, between tonal realization in laboratory speech and tonal realization in the Corpus of Spoken Taiwan Mandarin. Xu (1997) analyzed the pitch contours of 16 bi-tonal combinations using the /ma-ma/ sequence. Among these combinations, only ma1ma1 corresponds to a real word in Mandarin, 媽媽 (ma1ma0, “mother”);[5] all the others are nonsensical combinations that are unnatural for native speakers. In their study, the f0 contours were carefully controlled, accounting for factors such as gender and speaking rate. Although laboratory speech and spontaneous speech differ in many ways, it is still informative to compare the two registers. We, therefore, reproduced Figure 3 from Xu (1997) (blue curves) and overlaid it with the DLM-derived f0 contours from Figure 9 (orange curves). In Figure 11, the pitch contours from Xu (1997), and our predicted contours are remarkably similar for several tone patterns, including T4-T4, T2-T3, and T2-T4. However, some tone patterns, such as T1-T1 and T1-T2, exhibit noticeable differences in pitch contours. These differences are likely due to dialect differences, differences between spontaneous speech and laboratory speech, and differences between meaningful and meaningless words.

Figure 11: 
The f0 contours for 16 tone patterns from the current study, based on the Corpus of Spoken Taiwan Mandarin, are compared with f0 contours from a previous study (Xu 1997) on carefully controlled laboratory speech. The blue curves represent the f0 contours for 16 tone patterns from the controlled laboratory speech, reproduced from Figure 3 in Xu (1997). The orange curves correspond to the three LDL-predicted pitch contours from Method I, II, and III, as shown in Figure 9, and are reproduced here. These f0 contours are overlaid for comparison. Since the neutral tone (T0) was not included in Xu (1997), only 16 bi-tonal combinations are presented here.
Figure 11:

The f0 contours for 16 tone patterns from the current study, based on the Corpus of Spoken Taiwan Mandarin, are compared with f0 contours from a previous study (Xu 1997) on carefully controlled laboratory speech. The blue curves represent the f0 contours for 16 tone patterns from the controlled laboratory speech, reproduced from Figure 3 in Xu (1997). The orange curves correspond to the three LDL-predicted pitch contours from Method I, II, and III, as shown in Figure 9, and are reproduced here. These f0 contours are overlaid for comparison. Since the neutral tone (T0) was not included in Xu (1997), only 16 bi-tonal combinations are presented here.

At this point, it might be objected that in this approach to Mandarin tone, it is unclear how tone sandhi could be accounted for. How would it be possible that, if indeed every word has its own pitch contour, all words with the T3-T3 tone pattern undergo the same phonological process, such that they become indistinguishable from words with the T2-T3 tone pattern? Our answer to this question has an empirical part and a theoretical part.

On the empirical side, in conversational Taiwan Mandarin, the two tone patterns are basically identical. For instance, in Figure 9, the tone patterns for T2-T3 and T3-T3 are quite similar. A detailed study of this tone sandhi (Lu et al. 2024) supports complete neutralization for Taiwan Mandarin. In other words, the words with the T3-T3 tone pattern can simply be reclassified as words with the T2-T3 tone pattern. There is no need to call on a rule of tone sandhi. In fact, even for Standard Chinese, as gauged by Xu (1997), the differences between T3-T3 and T2-T3 are hardly visible to the eye. However, T3-T3 tone sandhi has been reported to be incomplete for Standard Chinese (Yuan and Chen 2014).

This brings us to the theoretical aspect of the question raised above, namely, how to account for tone sandhi processes in general. Within the framework of the Discriminative Lexicon model, as a model for highly automatized lexical processing, it is impossible to derive forms from forms, a method widely used in educational settings. Forms are predicted from meanings. Importantly, Figures 9 and 11 show that the tone contours associated with tone patterns emerge straightforwardly from the meaning-to-form mapping, without the model ever being informed about tone patterns. In other words, a “word and paradigm” approach (Blevins 2016; Heitmeier et al. 2021) to tonal realization appears to be quite feasible.

A question for further research is how to interpret the present findings for tone patterns with the neutral tone. As can be seen in Figure 9, for three of the four tone patterns, the overall pitch contour appears to be an almost linearly descending pitch contour. This could be viewed as another instance of tone sandhi in classical phonology, whereas within the DLM, this patterning would follow from words’ contextual meanings. We leave this question for further research.

Another question for future research is the role of prosodic hierarchical structure, as documented in studies such as Chen (2022), Liu et al. (2016) for Mandarin. The control variable norm_utt_pos is designed to be a coarse-grained proxy to capture positional effects that correlate with some aspects of sentence intonation in spontaneous speech. Undoubtedly, this measure is too simple to capture the more subtle effects of prosodic structure that have been reported in the literature. However, given that the variable importance of norm_utt_pos, evaluated with the ΔAIC, is relatively small compared to that of speaker and word, it is unlikely that including further prosodic variables would lead to very different conclusions. Corpus-based research on the consequences of these finer details of hierarchical prosodic structure for tonal realization in spontaneous speech will need to find reliable ways of assigning prosodic trees to utterances characterized by false starts, hesitations, as well segment and syllable deletions (Cheng and Xu 2015; Ernestus 2000; Johnson 2004), and prosodic effects observed for careful speech may not generalize to colloquial speech (for register-determined prosodic variation, see Biber and Conrad 2019). A possibility that requires further exploration is that aspects of prosodic hierarchies may well be captured by contextualized embeddings, which would imply that part of the predictivity of embeddings for f0 contours is partially grounded in prosodic structure (see also Orzechowska and Baayen 2025 for evidence that phonotactic structure is predictable from embeddings). However, since contextualized embeddings strongly cluster by word meaning (see Figure 8), their predictivity for a word’s pitch contour is, therefore, most likely driven by semantics.

The results obtained in the present study have several theoretical implications. First, we have documented that the mapping from context-sensitive meaning to pitch contours is machine-learnable. It remains an open question whether human learners also generate pitch contours from semantics. The finding that just a linear mapping (from a statistical perspective, a straightforward multivariate multiple regression model) is all that is needed suggests that human speakers should also be able to learn this simple mapping between meaning and form. Importantly, our results are based on patterns of usage in a corpus of unscripted spontaneous speech, and the mere existence of these patterns indicates that language users must be absorbing community norms for tonal realization, albeit most likely subliminally.

Second, our findings challenge the axioms of the arbitrariness of the sign and the dual articulation of language. If the relation between form and meaning would be truly and fundamentally arbitrary, this would imply learning words and their meanings is extremely difficult and would not allow any generalization. All that can be done is learn by heart that a form x is associated with a meaning y. However, our simple linear mapping generalizes to held-out data. This falsifies the axiom that the relation between form and meaning (here, pitch and meaning) is completely arbitrary.

Third, preliminary results, reported in Chuang et al. (2023), suggest that English two-syllable words with left stress also have pitch contours that have strong word-specific pitch components. The study by Schmitz et al. (2025) reports similar findings for English three-constituent compounds. The accumulating evidence poses a new challenge for linguistics: understanding why these isomorphies between form and meaning exist, irrespective of whether a language is a tone language or a stress language.


Corresponding author: Yuxin Lu, Quantitative Linguistics, Eberhard Karls Universität Tübingen, Tübingen, Germany, E-mail:

Award Identifier / Grant number: Grant SUBLIMINAL (#101054902)

Acknowledgments

The authors thank Yu-Hsiang Tseng for identifying word sense for the dataset in the current manuscript.

  1. Research funding: This work was supported by the European Research Council under Grant SUBLIMINAL (#101054902) awarded to R. Harald Baayen.

Appendix 1: Dataset and model summary

Table A.1:

Overview of the four subdatasets grouped by the four tonal contexts, with the number of word types for each tone pattern.

Tone pattern Tonal context
4.4 3.4 4.1 4.0
T1-T0 5 5 5 3
T1-T1 12 12 9 10
T1-T2 17 14 16 10
T1-T3 6 6 4 4
T1-T4 23 19 18 22
T2-T0 7 6 6 4
T2-T1 11 9 9 11
T2-T2 14 9 11 8
T2-T3 14 12 13 12
T2-T4 23 19 22 12
T3-T0 4 3 4 4
T3-T1 10 8 6 7
T3-T2 16 12 11 12
T3-T3 11 10 9 8
T3-T4 18 15 17 11
T4-T0 3 4 3 1
T4-T1 12 11 14 12
T4-T2 25 21 18 16
T4-T3 11 10 10 7
T4-T4 46 35 45 36
Total 288 240 250 210
Table A.2:

Summary of the model fitted with word for the 4.4 tonal context dataset.

A. Parametric coefficients Estimate Std. error t-Value p-Value
(Intercept) 5.2983 0.0253 209.8128 <0.0001
gendermale −0.5381 0.0339 −15.8936 <0.0001

B. Smooth terms edf Ref.df F-value p-Value

s(normalized_t):genderfemale 2.5049 2.7690 10.0326 <0.0001
s(normalized_t):gendermale 1.1436 1.1879 6.7627 0.0054
s(speaker) 51.0171 53.0000 70.5020 <0.0001
s(normalized_t,word) 1,826.1279 2,592.0000 6.8053 <0.0001
s(normalized_t,tone_pattern) 113.4202 179.0000 2.4614 <0.0001
s(duration):genderfemale 2.3380 2.6000 7.2846 0.0001
s(duration):gendermale 2.4935 2.7500 18.6290 <0.0001
ti(normalized_t,duration) 7.4861 8.2802 25.7720 <0.0001
s(norm_utt_pos) 1.7466 2.1185 95.1552 <0.0001
ti(normalized_t,norm_utt_pos) 6.2524 7.6596 5.8497 <0.0001
s(bg_prob_prev) 2.8614 2.9779 31.6669 <0.0001
ti(normalized_t,bg_prob_prev) 5.2394 6.4955 2.4577 0.0198
s(bg_prob_fol) 1.3373 1.5788 4.8197 0.0102
ti(normalized_t,bg_prob_fol) 7.8502 8.6232 10.2035 <0.0001
Table A.3:

Summary of the model fitted with word for the 3.4 tonal context dataset.

A. Parametric coefficients Estimate Std. error t-Value p-Value
(Intercept) 5.2976 0.0238 222.5837 <0.0001
gendermale −0.5188 0.0306 −16.9328 <0.0001

B. Smooth terms edf Ref.df F-value p-Value

s(normalized_t):genderfemale 2.5660 2.6813 5.0659 0.0155
s(normalized_t):gendermale 2.0481 2.2102 0.5644 0.6399
s(speaker) 49.8943 54.0000 24.1642 <0.0001
s(normalized_t,word) 1,426.5404 2,160.0000 5.7964 <0.0001
s(normalized_t,tone_pattern) 95.2819 179.0000 1.4428 <0.0001
s(duration):genderfemale 1.0004 1.0008 0.0020 0.9689
s(duration):gendermale 1.0005 1.0010 18.3616 <0.0001
ti(normalized_t,duration) 8.6397 8.9350 20.9207 <0.0001
s(norm_utt_pos) 1.0005 1.0010 146.3152 <0.0001
ti(normalized_t,norm_utt_pos) 6.5969 7.7844 4.2409 <0.0001
s(bg_prob_prev) 2.5411 2.8021 16.2505 <0.0001
ti(normalized_t,bg_prob_prev) 8.5273 8.8687 26.7273 <0.0001
s(bg_prob_fol) 1.0190 1.0360 8.3620 0.0035
ti(normalized_t,bg_prob_fol) 2.3449 2.6936 3.4017 0.0134
Table A.4:

Summary of the model fitted with word for the 4.1 tonal context dataset.

A. Parametric coefficients Estimate Std. error t-Value p-Value
(Intercept) 5.2727 0.0257 204.9588 <0.0001
gendermale −0.4900 0.0331 −14.8242 <0.0001

B. Smooth terms edf Ref.df F-value p-Value

s(normalized_t):genderfemale 1.0008 1.0010 14.5966 0.0001
s(normalized_t):gendermale 2.7108 2.9174 5.0698 0.0050
s(speaker) 50.3897 54.0000 28.2290 <0.0001
s(normalized_t,word) 1,517.7123 2,250.0000 5.6099 <0.0001
s(normalized_t,tone_pattern) 90.4451 179.0000 1.3437 <0.0001
s(duration):genderfemale 1.7741 2.1060 11.7587 <0.0001
s(duration):gendermale 2.3729 2.7157 9.2331 <0.0001
ti(normalized_t,duration) 6.7314 7.7618 13.9824 <0.0001
s(norm_utt_pos) 1.0009 1.0018 30.1652 <0.0001
ti(normalized_t,norm_utt_pos) 7.2456 8.3005 2.1423 0.0149
s(bg_prob_prev) 2.3510 2.6877 12.6272 <0.0001
ti(normalized_t,bg_prob_prev) 4.4876 5.7934 2.9332 0.0079
s(bg_prob_fol) 2.4577 2.7678 2.5053 0.0385
ti(normalized_t,bg_prob_fol) 7.9226 8.6620 7.9004 <0.0001
Table A.5:

Summary of the model fitted with word for the 4.0 tonal context dataset.

A. Parametric coefficients Estimate Std. error t-Value p-Value
(Intercept) 5.2497 0.0303 173.3027 <0.0001
gendermale −0.4817 0.0403 −11.9586 <0.0001

B. Smooth terms edf Ref.df F-value p-Value

s(normalized_t):genderfemale 2.9382 2.9859 34.0398 <0.0001
s(normalized_t):gendermale 1.0005 1.0007 5.1790 0.0229
s(speaker) 49.0065 52.0000 29.1850 <0.0001
s(normalized_t,word) 1,250.8152 1,890.0000 6.4572 <0.0001
s(normalized_t,tone_pattern) 94.0207 179.0000 1.4661 <0.0001
s(duration):genderfemale 2.2768 2.6514 2.9000 0.0321
s(duration):gendermale 2.8663 2.9781 9.3139 <0.0001
ti(normalized_t,duration) 7.6599 8.5313 8.0197 <0.0001
s(norm_utt_pos) 2.2454 2.5976 39.0812 <0.0001
ti(normalized_t,norm_utt_pos) 7.0805 8.1311 11.6934 <0.0001
s(bg_prob_prev) 2.2296 2.5837 11.8605 <0.0001
ti(normalized_t,bg_prob_prev) 5.7542 6.9355 4.6567 <0.0001
s(bg_prob_fol) 1.0126 1.0226 9.3906 0.0021
ti(normalized_t,bg_prob_fol) 5.8866 7.1033 4.2724 0.0001
Table A.6:

Summary of the model with sense_type for the 4.4 tonal context dataset.

A. Parametric coefficients Estimate Std. error t-Value p-Value
(Intercept) 5.2908 0.0255 207.6907 <0.0001
gendermale −0.5265 0.0338 −15.5930 <0.0001

B. Smooth terms edf Ref.df F-value p-Value

s(normalized_t):genderfemale 1.9327 2.1244 9.0877 0.0001
s(normalized_t):gendermale 1.8634 2.0753 4.9456 0.0063
s(speaker) 50.7258 53.0000 57.8277 <0.0001
s(normalized_t,sense_type) 1,643.4567 2,394.0000 6.1366 <0.0001
s(normalized_t,tone_pattern) 115.3360 179.0000 2.4197 <0.0001
s(duration):genderfemale 2.8641 2.9780 15.3587 <0.0001
s(duration):gendermale 1.0016 1.0032 46.1671 <0.0001
ti(normalized_t,duration) 8.0854 8.7497 19.0268 <0.0001
s(norm_utt_pos) 2.3383 2.6936 51.8508 <0.0001
ti(normalized_t,norm_utt_pos) 5.2953 6.7676 4.4134 0.0001
s(bg_prob_prev) 2.8716 2.9790 25.0855 <0.0001
ti(normalized_t,bg_prob_prev) 5.5277 6.7714 2.7595 0.0085
s(bg_prob_fol) 1.5652 1.8939 8.2246 0.0003
ti(normalized_t,bg_prob_fol) 7.8475 8.6188 12.1362 <0.0001
Table A.7:

Summary of the model with sense_type for the 3.4 tonal context dataset.

A. Parametric coefficients Estimate Std. error t-Value p-Value
(Intercept) 5.3160 0.0242 219.4006 <0.0001
gendermale −0.5285 0.0318 −16.6345 <0.0001

B. Smooth terms edf Ref.df F-value p-Value

s(normalized_t):genderfemale 2.7729 2.9280 11.5416 <0.0001
s(normalized_t):gendermale 1.0003 1.0004 0.2905 0.5900
s(speaker) 49.2847 53.0000 22.9455 <0.0001
s(normalized_t,sense_type) 1,304.8474 1,980.0000 5.5029 <0.0001
s(normalized_t,tone_pattern) 91.7902 179.0000 1.3253 <0.0001
s(duration):genderfemale 1.0019 1.0035 2.5510 0.1098
s(duration):gendermale 2.5340 2.8234 6.4906 0.0014
ti(normalized_t,duration) 8.1763 8.7515 18.8380 <0.0001
s(norm_utt_pos) 2.6258 2.8772 46.0702 <0.0001
ti(normalized_t,norm_utt_pos) 7.0479 8.1229 4.8243 <0.0001
s(bg_prob_prev) 2.4103 2.7048 12.0135 <0.0001
ti(normalized_t,bg_prob_prev) 8.5543 8.8700 33.9299 <0.0001
s(bg_prob_fol) 1.0013 1.0024 8.6491 0.0033
ti(normalized_t,bg_prob_fol) 2.3490 2.6950 3.0126 0.0202
Table A.8:

Summary of the model with sense_type for the 4.1 tonal context dataset.

A. Parametric coefficients Estimate Std. error t-Value p-Value
(Intercept) 5.2811 0.0265 199.0773 <0.0001
gendermale −0.4818 0.0333 −14.4546 <0.0001

B. smooth terms edf Ref.df F-value p-Value

s(normalized_t):genderfemale 1.0005 1.0006 10.5033 0.0012
s(normalized_t):gendermale 2.5035 2.8049 5.7086 0.0046
s(speaker) 49.4701 53.0000 25.2971 <0.0001
s(normalized_t,sense_type) 1,387.4472 2,052.0000 5.3175 <0.0001
s(normalized_t,tone_pattern) 91.5630 179.0000 1.2863 <0.0001
s(duration):genderfemale 1.7694 2.0984 3.1271 0.0492
s(duration):gendermale 2.4841 2.7819 9.3805 0.0001
ti(normalized_t,duration) 7.0274 8.0034 11.9947 <0.0001
s(norm_utt_pos) 1.0006 1.0011 21.6378 <0.0001
ti(normalized_t,norm_utt_pos) 3.5709 4.9092 1.8554 0.0967
s(bg_prob_prev) 2.3064 2.6443 4.7008 0.0052
ti(normalized_t,bg_prob_prev) 6.0423 7.3314 4.0254 0.0002
s(bg_prob_fol) 2.6997 2.9104 4.0497 0.0068
ti(normalized_t,bg_prob_fol) 7.3377 8.3136 3.8195 0.0001
Table A.9:

Summary of the model with sense_type for the 4.0 tonal context dataset.

A. Parametric coefficients Estimate Std. error t-Value p-Value
(Intercept) 5.2554 0.0323 162.6305 <0.0001
gendermale −0.4787 0.0421 −11.3827 <0.0001

B. Smooth terms edf Ref.df F-value p-Value

s(normalized_t):genderfemale 2.9464 2.9870 33.5275 <0.0001
s(normalized_t):gendermale 1.0010 1.0014 6.8630 0.0088
s(speaker) 48.1211 53.0000 23.2266 <0.0001
s(normalized_t,sense_type) 993.7681 1,539.0000 5.3101 <0.0001
s(normalized_t,tone_pattern) 87.2132 179.0000 1.1985 <0.0001
s(duration):genderfemale 2.4536 2.7740 3.2343 0.0203
s(duration):gendermale 2.8016 2.9561 6.8539 0.0001
ti(normalized_t,duration) 7.8260 8.6169 5.2433 <0.0001
s(norm_utt_pos) 1.7137 2.0551 51.4802 <0.0001
ti(normalized_t,norm_utt_pos) 7.2760 8.3101 10.0548 <0.0001
s(bg_prob_prev) 2.3109 2.6472 12.8859 <0.0001
ti(normalized_t,bg_prob_prev) 6.8041 7.8499 6.0670 <0.0001
s(bg_prob_fol) 1.3832 1.6100 4.3188 0.0147
ti(normalized_t,bg_prob_fol) 2.1106 2.4571 0.5490 0.4627

Appendix 2: The effects of segments and word frequency

Our findings demonstrate that the word itself is a strong predictor of pitch contours in disyllabic words. However, one might question whether this robust effect is at least partially influenced by segmental properties, given existing evidence on the impact of vowel height and onset consonants on Mandarin tones (Ho 1976b; Ladd and Silverman 1984; Ohala and Eukel 1976; Whalen and Levitt 1995). Additionally, lexical frequency has long been recognized as a factor influencing f0 contours, with lower-frequency words being produced with higher pitch (Zhao and Jurafsky 2007). To address these concerns, additional analyses are presented that clarify the effects of word’s segmental composition and lexical frequency on pitch contours.

We first consider words’ segmental properties. Following Chuang et al. (2025), for our disyllabic words, we coded four predictors for segments. vowel1_height and vowel2_height are the vowel height of the first syllable and the second syllable, respectively. Each has five levels: “high,” “mid,” “low,” “low-high,” and “mid-high.” onset1_type and onset2_type are the type of the onset consonant of the first syllable and the second syllable, respectively. Each has seven levels: “aspirated-affricate,” “aspirated-stop,” “unaspirated-affricate,” “unaspirated-stop,” “voiceless-fricative,” “voiced,” and “null.”

As a first step, we established a baseline model that included gender, tonal context, duration, speaker, utterance position, and bigram probability but excluded tone pattern and word. To simplify the analysis, this baseline model was built on an omnibus dataset combining all four tonal contexts. The model specification is as follows:

logf0 ∼ gender +
s(normalized_t, by=gender,k=4) +
s(speaker, bs=“re”) +
s(tonal_context, bs=“re”) +
s(duration, by=gender, k=4) +
ti(normalized_t, duration, k=c(4,4)) +
s(norm_utt_pos, k=4) +
ti(normalized_t, norm_utt_pos, k=c(4,4)) +
s(bg_prob_prev, k=4)+
ti(normalized_t, bg_prob_prev, k=c(4,4))+
s(bg_prob_fol, k=4)+
ti(normalized_t, bg_prob_fol, k=c(4,4))

To examine the effects of segmental predictors, four factor smooth terms were added to the model, one for vowel1_height, one for vowel2_height, one for vowel1_type, and one for vowel2_type. These four predictors were added to the baseline model together.

baseline + … +
s(normalized_t, vowel1_height, bs=“fs,” m=1) +
s(normalized_t, vowel2_height, bs=“fs,” m=1) +
s(normalized_t, onset1_type, bs=“fs,” m=1) +
s(normalized_t, onset2_type, bs=“fs,” m=1)

To evaluate the effect size of the additional predictors, we added them individually or in combination to the baseline model. Table A.10 presents the improvements in model fit relative to the baseline model, as measured by the decrease in AIC.

Table A.10:

Improvement of model fit measured by AIC reduction.

Description Model AIC ΔAIC
Baseline Baseline −605,715.1

Tone pattern vs. word

Add tone pattern Baseline + tone pattern −620,928.6 −15,213.55
Add word Baseline + word −641,469.5 −35,754.39
Add both Baseline + tone pattern + word (best fit) −659,371.8 −53,656.71

Segments vs. word

Add segments (w/o word) Baseline + tone pattern + four segmental predictors −643,447.6 −37,732.48

Frequency vs. word

Add frequency (w/o word) Baseline + tone pattern + frequency −638,308.0 −32,592.89
Add frequency (w/ word) Baseline + tone pattern + frequency + word −659,376.7 −53,661.59

Complementing the evaluation of variable importance used previously, we here present the reduction in AIC when a predictor or set of predictors are added to a model. As can be seen in Table A.10, adding word to the baseline model yielded a substantially better fit (△AIC: −15,213.55) than adding tone_pattern (△AIC: −35,754.39). Including both tone_pattern and word produced the best-fitting GAMM, with an AIC reduction of 53,656.71 units.

The baseline + tone pattern + word model has lower AIC than the baseline + tone pattern + four segmental predictors model, suggesting word is a more powerful predictor than those four segmental predictors combined. However, although baseline + tone pattern + four segmental predictors + word has the lowest AIC, concurvity scores are all 1 for the four segmental predictors and 0.21 for word, indicating that segmental predictors are highly collinear with word, and specifying both in the same model results in a model the partial effects of which are difficult to assess.

Next, we consider word frequency as an additional predictor. In what follows, frequency represents the log-transformed count of occurrences of a word type in entire spoken corpus of Taiwan Mandarin. To examine the effect of frequency, we added the smooth term for frequency, in combination with its interaction with normalized_t, to the baseline + tone pattern model.

baseline + … +
s(frequency, k=4) +
ti(normalized_t, frequency, k=c(4,4))

As can be seen in Table A.10, the baseline + tone pattern + word model has lower AIC than the baseline + tone pattern + frequency model. When frequency was added on top of the best-fit GAMM, it resulted in a further AIC decrease of a mere 4.88 units. However, in this model, concurvity was very high for frequency (0.99) and low for word (0.21), and the smooth term s(frequency) was not well supported (p = 0.0632).

Higher-frequency words tend to be realized with more reduction than lower-frequency words (Cheng and Xu 2015; Meunier and Espesser 2011). This raises the possibility that higher-frequency words might have less clear tonal realizations than lower-frequency words. In other words, low-frequency words potentially would be more likely to be realized with their canonical tone patterns. To test this hypothesis, we conducted an additional analysis on the low-frequency words that we previously excluded due to having fewer than five tokens each. This resulted in a new dataset with 181 word types and 447 tokens. We grouped these words by their canonical tone pattern and replaced their word identifiers by a novel “word” identifier marking them as a low-frequency word with one of 20 tone patters. We added this new dataset to the original dataset and fitted the same GAMM as outlined above in Section 3.4.1. If the lowest-frequency words are indeed words with little or no tonal reduction, then the prediction is that their canonical tone patterns will be detected by the GAMM.

However, the GAM model revealed that, as shown in Figure A.1, also these low-frequency words in spoken Mandarin do not exhibit pitch contours that resemble their canonical tone patterns. For many low frequency tone patterns, there is no effect whatsoever (e.g., all words with a neutral tone for the second syllable). Where some effects are visible, e.g., for the T3-T4 pattern, this may well be an average of the tonal signatures of the different words that we treated – ex hypothesi – as words with the same canonical tone pattern. As a similar result has been observed for T1, T3, and T2 in low-frequency monosyllables as well (see Jin et al. 2025), we conclude that the word specific f0 contours are highly unlikely to be the consequence of tonal reduction.

Figure A.1: 
The partial effect of tone patterns from the GAMM fitted with the dataset including low-frequency words.
Figure A.1:

The partial effect of tone patterns from the GAMM fitted with the dataset including low-frequency words.

References

Adelman, James S., Zachary Estes & Martina Cossu. 2018. Emotional sound symbolism: Languages rapidly signal valence via phonemes. Cognition 175. 122–130. https://doi.org/10.1016/j.cognition.2018.02.007.Search in Google Scholar

Arnold, Denis & Fabian Tomaschek. 2016. The Karl Eberhards corpus of spontaneously spoken southern German in dialogues — Audio and articulatory recordings. In Christoph Draxler & Felicitas Kleber (eds.), Tagungsband der 12. tagung phonetik und phonologie im deutschsprachigen raum, 10–13.Search in Google Scholar

Arnon, Inbal & Uriel Cohen Priva. 2014. Time and again: The changing effect of word and multiword frequency on phonetic duration for highly frequent sequences. The Mental Lexicon 9(3). 377–400. https://doi.org/10.1075/ml.9.3.01arn.Search in Google Scholar

Baayen, R. Harald, Yu-Ying Chuang, Elnaz Shafaei-Bajestan & James P. Blevins. 2019. The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity 48955891. https://doi.org/10.1155/2019/4895891.Search in Google Scholar

Biber, Douglas & Susan Conrad. 2019. Register, genre, and style. Cambridge: Cambridge University Press.10.1017/9781108686136Search in Google Scholar

Blevins, James P. 2016. Word and paradigm morphology. Oxford: Oxford University Press.10.1093/acprof:oso/9780199593545.001.0001Search in Google Scholar

Boersma, Paul & David Weenink. 2020. Praat: Doing phonetics by computer [Computer program] Version 6.0.37, released 2018. http://www.praat.org/.Search in Google Scholar

Bojanowski, Piotr, Edouard Grave, Armand Joulin & Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5. 135–146. https://doi.org/10.1162/tacl_a_00051.Search in Google Scholar

Bremner, Andrew J., Serge Caparos, Jules Davidoff, Jan De Fockert, Karina J. Linnell & Charles Spence. 2013. “Bouba” and “Kiki” in Namibia? A remote culture make similar shape — Sound matches, but different shape — Taste matches to westerners. Cognition 126(2). 165–172. https://doi.org/10.1111/j.1467-7687.2006.00495.x.Search in Google Scholar

Bruni, Elia, Nam Khanh Tran & Marco Baroni. 2014. Multimodal distributional semantics. Journal of Artificial Intelligence Research 49. 1–47. https://doi.org/10.1613/jair.4135.Search in Google Scholar

Chang, Chiung-Yun. 2010. Dialect differences in the production and perception of Mandarin Chinese tones. The Ohio State University dissertation.Search in Google Scholar

Chao, Yuen Ren. 1968. A grammar of spoken Chinese. Berkeley: University of California Press.Search in Google Scholar

Chen, Yiya. 2022. Mind the subtle f0 modifications: The interaction of tone and intonation in Sinitic varieties. Stellenbosch Papers in Linguistics Plus 62(2). 113–136. https://doi.org/10.5842/62-2-904.Search in Google Scholar

Cheng, Chierh & Yi Xu. 2015. Mechanism of disyllabic tonal reduction in Taiwan Mandarin. Language and Speech 58(3). 281–314. https://doi.org/10.1177/0023830914543286.Search in Google Scholar

Chuang, Yu-Ying, R. Harald Baayen & Melanie J. Bell. 2023. Do words sing their own tunes? Word-specific pitch realizations in Mandarin and English. In Proceedings of the 20th international congress of phonetic sciences [ICPhS] 2023.Search in Google Scholar

Chuang, Yu-Ying, Melanie J. Bell, Yu-Hsiang Tseng & R. Harald Baayen. 2025. Word-specific tonal realizations in Mandarin. Accepted for publication in Language. https://arxiv.org/abs/2405.07006.Search in Google Scholar

Chung, Karen Steffen. 2006. Contraction and backgrounding in Taiwan Mandarin. Concentric: Studies in Linguistics 32(1). 69–88. http://ntur.lib.ntu.edu.tw//handle/246246/204805.Search in Google Scholar

Ćwiek, Aleksandra, Susanne Fuchs, Christoph Draxler, Eva Liina Asu, Dan Dediu, Katri Hiovain, Shigeto Kawahara, Sofia Koutalidis, Manfred Krifka, Pärtel Lippus, Gary Lupyan, Grace E. Oh, Jing Paul, Caterina Petrone, Rachid Ridouane, Sabine Reiter, Nathalie Schümchen, Ádám Szalontai, Özlem Ünal-Logacev, Jochen Zeller, Marcus Perlman & Bodo Winter. 2022. The bouba/kiki effect is robust across cultures and writing systems. Philosophical Transactions of the Royal Society B 377(1841). 20200390. https://doi.org/10.1098/rstb.2020.0390.Search in Google Scholar

De Saussure, Ferdinand. 1966. Course in general linguistics. New York: McGraw.Search in Google Scholar

de Varda, Andrea Gregor & Marco Marelli. 2025. Cracking arbitrariness: A data-driven study of auditory iconicity in spoken English. Psychonomic Bulletin & Review. 1–18. https://doi.org/10.3758/s13423-024-02630-0.Search in Google Scholar

Dingemanse, Mark, Will Schuerman, Eva Reinisch, Sylvia Tufvesson & Holger Mitterer. 2016. What sound symbolism can and cannot do: Testing the iconicity of ideophones from five languages. Language 92(2). e117–e133. https://doi.org/10.1353/lan.2016.0034.Search in Google Scholar

Drager, Katie K. 2011. Sociophonetic variation and the lemma. Journal of Phonetics 39(4). 694–707. https://doi.org/10.1016/j.wocn.2011.08.005.Search in Google Scholar

Ernestus, Mirjam. 2000. Voice assimilation and segment reduction in casual Dutch. A corpus-based study of the phonology-phonetics interface. Utrecht: LOT.Search in Google Scholar

Fon, Janice. 2004. A preliminary construction of Taiwan Southern Min spontaneous speech corpus. Tech. rep. NSC-92-2411-H-003-050. Taiwan: National Science Council.Search in Google Scholar

Fon, Janice & Wen-Yu Chiang. 1999. What does Chao have to say about tones? — A case study of Taiwan Mandarin. Journal of Chinese Linguistics. 13–37.Search in Google Scholar

Gahl, Susanne. 2008. Time and thyme are not homophones: The effect of lemma frequency on word durations in spontaneous speech. Language 84(3). 474–496. https://doi.org/10.1353/lan.0.0035.Search in Google Scholar

Gahl, Susanne & R. Harald Baayen. 2024. Time and thyme again: Connecting English spoken word duration to models of the mental lexicon. Language 100(4). 623–670. https://doi.org/10.1353/lan.2024.a947037.Search in Google Scholar

Godfrey, John J., Edward C. Holliman & Jane McDaniel. 1992. Switchboard: Telephone speech corpus for research and development. In Proceedings of ICASSP-92: 1992 IEEE international conference on acoustics, speech, and signal processing, Vol. 1, 517–520.10.1109/ICASSP.1992.225858Search in Google Scholar

Günther, Fritz, Luca Rinaldi & Marco Marelli. 2019. Vector-space models of semantic representation from a cognitive perspective: A discussion of common misconceptions. Perspectives on Psychological Science 14(6). 1006–1033. https://doi.org/10.1177/1745691619861372.Search in Google Scholar

Harris, Zellig S. 1954. Distributional structure. Word 10(2–3). 146–162. https://doi.org/10.1080/00437956.1954.11659520.Search in Google Scholar

Heitmeier, Maria, Yu-Ying Chuang & R. Harald Baayen. 2021. Modeling morphology with linear discriminative learning: Considerations and design choices. Frontiers in Psychology 12. 720713. https://doi.org/10.3389/fpsyg.2021.720713.Search in Google Scholar

Heitmeier, Maria, Yu-Ying Chuang & R. Harald Baayen. 2025. The discriminative lexicon: Theory and implementation in the Julia package judiling. in press. Cambridge University Press.Search in Google Scholar

Heitmeier, Maria, Valeria Schmidt, Hendrik Lensch & R. Harald Baayen. 2024. Is deeper always better? Replacing linear mappings with deep learning networks in the discriminative lexicon model. To appear in Linguistic Vanguard. https://arxiv.org/abs/2410.04259.10.1515/lingvan-2024-0192Search in Google Scholar

Ho, Aichen Ting. 1976a. Mandarin tones in relation to sentence intonation and grammatical structure. Journal of Chinese Linguistics. 1–13. https://doi.org/10.1159/000259792.Search in Google Scholar

Ho, Aichen Ting. 1976b. The acoustic variation of Mandarin tones. Phonetica 33(5). 353–367. https://doi.org/10.1159/000259792.Search in Google Scholar

Hsieh, Po-Jen. 2013. Prosodic markings of semantic predictability in Taiwan Mandarin. In Proceedings of interspeech 2013, 553–557.10.21437/Interspeech.2013-154Search in Google Scholar

Hsieh, Shu-Kai, Yu-Hsiang Tseng, Hsin-Yu Chou, Ching-Wen Yang & Yu-Yun Chang. 2024. Resolving Regular Polysemy in Named Entities. http://arxiv.org/abs/2401.09758.Search in Google Scholar

Huang, Karen. 2018. Phonological identity of the neutral-tone syllables in Taiwan Mandarin: An acoustic study. Acta Linguistica Asiatica 8(2). 9–50. https://doi.org/10.4312/ala.8.2.9-50.Search in Google Scholar

Huang, Chu-Ren, Shu-Kai Hsieh, Jia-Fei Hong, Yun-Zhu Chen, I-Li Su, Yong-Xiang Chen & Sheng-Wei Huang. 2010. Chinese wordnet: Design, implementation, and application of an infrastructure for cross-lingual knowledge processing. Journal of Chinese Information Processing 24(2). 14–23.Search in Google Scholar

Imai, Mutsumi, Sotaro Kita, Miho Nagumo & Hiroyuki Okada. 2008. Sound symbolism facilitates early verb learning. Cognition 109(1). 54–65. https://doi.org/10.1016/j.cognition.2008.07.015.Search in Google Scholar

Jescheniak, Jörg D. & W. J. Willem Levelt. 1994. Word frequency effects in speech production: Retrieval of syntactic information and of phonological form. Journal of Experimental Psychology: Learning, Memory and Cognition 20(4). 824–843. https://doi.org/10.1037/0278-7393.20.4.824.Search in Google Scholar

Jin, Xiaoyun, Mirjam Ernestus & R. Harald Baayen. 2025. The new kid on the block: The role of word meaning in the realization of tone in conversational Mandarin speech. Under revision for Journal of Phonetics. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5304595.10.2139/ssrn.5304595Search in Google Scholar

Johnson, Keith. 2004. Massive reduction in conversational American English. In Spontaneous speech: Data and analysis. Proceedings of the 1st session of the 10th international symposium, 29–54. Tokyo, Japan.Search in Google Scholar

Ladd, Robert & Kim E. A. Silverman. 1984. Vowel intrinsic pitch in connected speech. Phonetica 41(1). 31–40. https://doi.org/10.1159/000261708.Search in Google Scholar

Landauer, Tomas K. & Susan T. Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104(2). 211–240. https://doi.org/10.1037/0033-295X.104.2.211.Search in Google Scholar

Liu, Min, Yiya Chen & Niels O. Schiller. 2016. Online processing of tone and intonation in Mandarin: Evidence from ERPs. Neuropsychologia 91. 307–317. https://doi.org/10.1016/j.neuropsychologia.2016.08.025.Search in Google Scholar

Lohmann, Arne. 2018. Cut (n) and cut (v) are not homophones: Lemma frequency affects the duration of noun–verb conversion pairs. Journal of Linguistics 54(4). 753–777. https://doi.org/10.1017/S0022226717000378.Search in Google Scholar

Lu, Yuxin, Yu-Ying Chuang & R. Harald Baayen. 2024. Form and meaning co-determine the realization of tone in Taiwan Mandarin spontaneous speech: The case of Tone 3 sandhi. Under revision for Journal of Chinese Linguistics. https://arxiv.org/abs/2408.15747.Search in Google Scholar

Martinet, André. 1965. La linguistique synchronique: etudes et recherches [synchronic linguistics: studies and research]. Paris: Presses Universitaires de France.Search in Google Scholar

Maurer, Daphne, Thanujeni Pathman & Catherine J. Mondloch. 2006. The shape of boubas: Sound — Shape correspondences in toddlers and adults. Developmental Science 9(3). 316–322. https://doi.org/10.1111/j.1467-7687.2006.00495.x.Search in Google Scholar

Meunier, Christine & Robert Espesser. 2011. Vowel reduction in conversational speech in French: The role of lexical factors. Journal of Phonetics 39(3). 271–278. https://doi.org/10.1016/j.wocn.2010.11.008.Search in Google Scholar

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado & Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of advances in neural information processing systems 26 (NIPS 2013), Vol. 26, 3111–3119.Search in Google Scholar

Ohala, John J. & Brian W. Eukel. 1976. Explaining the intrinsic pitch of vowels. The Journal of the Acoustical Society of America 60(S1). S44. https://doi.org/10.1121/1.2003351.Search in Google Scholar

Orzechowska, Paula & R. Harald Baayen. 2025. Polish phonology and morphology through the lens of distributional semantics. Manuscript: University of Poznan and University of Tübingen.Search in Google Scholar

Pitt, Mark A., Keith Johnson, Elizabeth Hume, Scott Kiesling & William Raymond. 2005. The Buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication 45(1). 89–95. https://doi.org/10.1016/j.specom.2004.09.001.Search in Google Scholar

Plag, Ingo, Julia Homann & Gero Kunter. 2017. Homophony and morphology: The acoustics of word-final S in English. Journal of Linguistics 53(1). 181–216. https://doi.org/10.1017/S0022226715000183.Search in Google Scholar

Plag, Ingo, Arne Lohmann, Sonia Ben Hedia & Julia Zimmermann. 2020. An ˂s> is an ˂s’>, or is it? Plural and genitive-plural are not homophonous. In Lívia Körtvélyessy & Pavol Štekauer (eds.), Complex words: Advances in morphology, 260–285. Cambridge: Cambridge University Press.10.1017/9781108780643.015Search in Google Scholar

Saito, Motoki. 2024. Enhancement effects of frequency: An explanation from the perspective of discriminative learning. University of Tübingen Doctoral Dissertation.Search in Google Scholar

Saito, Motoki, Fabian Tomaschek, Ching-Chu Sun & R. Harald Baayen. 2024. Articulatory effects of frequency modulated by semantics. In Marcel Schlechtweg (ed.), Interfaces of phonetics, vol. 38, 125. Berlin & Boston: Walter de Gruyter GmbH & Co KG.10.1515/9783110783452-005Search in Google Scholar

Schmitz, Dominic. 2022. Production, perception, and comprehension of subphonemic detail: Word-final /s/ in English. Berlin: Language Science Press.Search in Google Scholar

Schmitz, Dominic, Ingo Plag & Melanie J. Bell. 2025. Modeling the relationship between prominence and semantics in English compounds. Paper presented at the Workshop on Morphological Variation, 47th Annual Meeting of the German Linguistic Society (DGfS), Mainz, Germany. https://hdl.handle.net/10779/aru.28585379.v1.Search in Google Scholar

Shafaei-Bajestan, Elnaz, Masoumeh Moradipour-Tari, Peter Uhrig & R. Harald Baayen. 2024. The pluralization palette: Unveiling semantic clusters in English nominal pluralization through distributional semantics. Morphology 34. 369–413. https://doi.org/10.1007/s11525-024-09428-9.Search in Google Scholar

Shinohara, Kazuko & Shigeto Kawahara. 2010. A cross-linguistic study of sound symbolism: The images of size. In Proceedings of annual meeting of the Berkeley linguistics society, 396–410.10.3765/bls.v36i1.3926Search in Google Scholar

Stanford, James N. 2016. Sociotonetics using connected speech: A study of Sui tone variation in free-speech style. Asia-Pacific Language Variation 2(1). 48–82. https://doi.org/10.1075/aplv.2.1.02sta.Search in Google Scholar

Tang, Kevin & Ryan Bennett. 2018. Contextual predictability influences word and morpheme duration in a morphologically complex language (Kaqchikel Mayan). The Journal of the Acoustical Society of America 144(2). 997–1017. https://doi.org/10.1121/1.5046095.Search in Google Scholar

Team, R. Core. 2020. R Core Team R: A language and environment for statistical computing. Foundation for Statistical Computing. https://www.R-project.org/.Search in Google Scholar

Thompson, Arthur Lewis, Thomas Van Hoey & Youngah Do. 2021. Articulatory features of phonemes pattern to iconic meanings: Evidence from cross-linguistic ideophones. Cognitive Linguistics 32(4). 563–608. https://doi.org/10.1515/cog-2020-0055.Search in Google Scholar

Tomaschek, Fabian, Ingo Plag, Mirjam Ernestus & R. Harald Baayen. 2021. Phonetic effects of morphology and context: Modeling the duration of word-final S in English with naïve discriminative learning. Journal of Linguistics 57(1). 123–161. https://doi.org/10.1017/S0022226719000203.Search in Google Scholar

Turnbull, Rory. 2017. The role of predictability in intonational variability. Language and Speech 60(1). 123–153. https://doi.org/10.1177/00238309166470.Search in Google Scholar

Van der Maaten, Laurens & Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9(11).Search in Google Scholar

Wang, Sheng-Fu. 2024. Contrast and predictability in the variability of tonal realizations in Taiwan Southern Min. In Proceedings of speech Prosody 2024, 542–546.10.21437/SpeechProsody.2024-110Search in Google Scholar

Westbury, Chris & Lee H. Wurm. 2022. Is it you you’re looking for? Personal relevance as a principal component of semantics. The Mental Lexicon 17(1). 1–33. https://doi.org/10.1075/ml.20031.wes.Search in Google Scholar

Westbury, Chris, Michelle Yang & Kris Anderson. 2025. The principal components of meaning, revisited. Psychonomic Bulletin & Review 32(1). 203–225. https://doi.org/10.3758/s13423-024-02551-y.Search in Google Scholar

Whalen, Douglas H. & Andrea G. Levitt. 1995. The universality of intrinsic f0 of vowels. Journal of Phonetics 23(3). 349–366. https://doi.org/10.1016/s0095-4470(95)80165-0.Search in Google Scholar

Wood, Simon N. 2017. Generalized additive models: An introduction with R, 2nd edn. New York: CRC Press.10.1201/9781315370279Search in Google Scholar

Wu, Yaru, Martine Adda-Decker & Lori Lamel. 2020. Mandarin lexical tones: A corpus-based study of word length, syllable position and prosodic position on duration. In Proceedings of interspeech 2020, 1908–1912. https://hal.science/hal-03153402.10.21437/Interspeech.2020-1614Search in Google Scholar

Wu, Yaru, Lori Lamel & Martine Adda-Decker. 2021. Tone realization in Mandarin speech: A large corpus based study of disyllabic words. In Proceedings of 2021 12th international symposium on Chinese spoken language processing [ISCSLP], 1–5.10.1109/ISCSLP49672.2021.9362073Search in Google Scholar

Xu, Yi. 1997. Contextual tonal variations in Mandarin. Journal of Phonetics 25(1). 61–83. https://doi.org/10.1006/jpho.1996.0034.Search in Google Scholar

Xu, Chenzi. 2024. Cross-dialectal perspectives on Mandarin neutral tone. Journal of Phonetics 106. 101341. https://doi.org/10.1016/j.wocn.2024.101341.Search in Google Scholar

Xu, Yi & Xuejing Sun. 2002. Maximum speed of pitch change and how it may relate to speech. The Journal of the Acoustical Society of America 111(3). 1399–1413. https://doi.org/10.1121/1.1445789.Search in Google Scholar

Xu, Ching X. & Yi Xu. 2003. Effects of consonant aspiration on Mandarin tones. Journal of the International Phonetic Association 33(2). 165–181. https://doi.org/10.1017/S0025100303001270.Search in Google Scholar

Yuan, Jiahong & Yiya Chen. 2014. 3rd tone sandhi in standard Chinese: A corpus approach. Journal of Chinese Linguistics 42(1). 218–237. http://www.jstor.org/stable/23753947.Search in Google Scholar

Zhao, Liang. 2023. Production and perception of lexical tone variation in Mandarin dialects. University of York dissertation.Search in Google Scholar

Zhao, Yuan & Dan Jurafsky. 2007. The effect of lexical frequency on tone production. In Proceedings of the 16th international congress of phonetic sciences [ICPhS] 2007, 477–480.Search in Google Scholar

Zimmermann, Julia, Christopher Carignan & Michael D. Tyler. 2016. Morphological status and acoustic realization: Findings from New Zealand English. In Proceedings of the 16th australasian international conference on speech science and technology, 6–9.Search in Google Scholar

Received: 2025-03-29
Accepted: 2025-09-09
Published Online: 2026-02-09

© 2026 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 31.3.2026 from https://www.degruyterbrill.com/document/doi/10.1515/cllt-2025-0028/html
Scroll to top button