Home Reintroducing and testing the Probabilistic Sliding Template Model of vowel perception
Article
Licensed
Unlicensed Requires Authentication

Reintroducing and testing the Probabilistic Sliding Template Model of vowel perception

  • Santiago Barreda EMAIL logo and T. Florian Jaeger
Published/Copyright: September 29, 2025
Linguistics Vanguard
From the journal Linguistics Vanguard

Abstract

Normalization of the speech signal onto comparatively invariant phonetic representations is critical to speech perception. Assumptions about this process also play a central role in phonetics and phonology. Yet, despite the importance of these assumptions, popular normalization accounts continue to incorrectly assume that the relevant normalization parameters are “known” – for example, because the researcher can estimate them from a fully balanced set of recordings. Listeners, however, have to incrementally infer these parameters from the speech input. We reintroduce a seminal, but still underappreciated, model of this inference process – the Probabilistic Sliding Template Model of vowel normalization and perception (PSTM) – and show why researchers cannot afford to ignore it. We provide the R package STM, which makes it trivial to apply the PSTM to new data. We present the first quantitative assessment of the PSTM against human perception. We show that the model excels at predicting listeners’ vowel categorization behavior, clearly outperforming the popular Lobanov normalization.


Corresponding author: Santiago Barreda, Department of Linguistics, University of California Davis, Davis, CA, USA, E-mail:

Appendix: Bootstrap details

The Hillenbrand et al. (1995) data include acoustic measures from 139 speakers producing one instance of 12 vowels each, including formant measures at multiple time slices. We used formant measurements at 20% and 80% of the vowel duration. This meant each vowel token was represented by a vector of length six, two measurements for each of three formants. Following Nearey and Assmann (2007), we used two time points, capturing the fact that formant dynamics affect vowel perception for the dialect in the database (Hillenbrand and Nearey 1999). By fitting multivariate normal distributions in this six-dimensional space, we assume that listeners have learned expectations about the (co)variance between formants within and across time points.

The Hillenbrand et al. (1995) data also contain 12-alternative forced-choice categorization responses for each vowel token, aggregated across 20 listeners of the same dialect as the speakers (average accuracy = 95 %).[4] Crucially, these responses were elicited over stimuli from the different speakers in the database, presented in randomized order – that is, precisely the type of input for which the naive estimation of ψ s (Equation (6)) is expected to yield unreliable results for listeners (confirmed in Figure 4).

We bootstrapped the following process over 1,000 iterations for each method (all functions referred to are from the STM package):

  1. Randomly divide the data into a 79-speaker training set and a 60-speaker testing set. Resample the training data (with replacement) at the speaker level. Sixty-three vowel tokens (4 %) had one or two missing formant measurements (out of 6), representing 0.09 % of the total formant measurements in the data. The missing values were imputed using stochastic regression imputation, independently for each iteration (using the impute_NA function).

  2. Uniformly scale the training data (the normalize function), by estimating each speaker’s ψ s using their complete vowel system as in Equation (6), and normalizing it as in Equation (3).

  3. Estimate the dialect-specific category means and covariance matrices of all vowels using the normalized training data (with create_template).[5] Each iteration of the bootstrap simulates a “listener” of the dialect with slightly different speech experience.

  4. For each token in the testing data, estimate ψ ˆ s , v for each vowel category using each method. Use these values of ψ ˆ s , v to obtain the winning estimate ψ ˆ s and calculate the posterior probability of each vowel category.

  5. Calculate performance metrics A–C described in Section 5.

We compare eight different methods that differ in terms of whether and/or how they estimate ψ s :

  1. No normalization

    1. input is F  (Hz)

    2. input is G (log Hz), allowing an assessment of whether log-transformation in and of itself helps

  2. Normalization with ψ ˆ s estimated using the “classic approach” in Equation (6)

    1. over the entire training data (“PSTM1 (balanced data)”)

    2. based on the individual token (“PSTM1 (single trial)”), as a more direct baseline for the remaining methods (all of which are based on individual tokens)

  3. Normalization using the most commonly used normalization method (Lobanov 1971)

    1. using the entire training data (Hz); this method involves standardization of each speaker’s formants by subtracting the mean and dividing by the standard deviation of each formant, independently for each time point

  4. Normalization with ψ ˆ s estimated using PSTM

    1. method 2

    2. method 3

    3. method 6

References

Assmann, Peter F. & William F. Katz. 2000. Time-varying spectral change in the vowels of children and adults. Journal of the Acoustical Society of America 108(4). 1856–1866. https://doi.org/10.1121/1.1289363.Search in Google Scholar

Barreda, Santiago. 2012. Vowel normalization and the perception of speaker changes: An exploration of the contextual tuning hypothesis. Journal of the Acoustical Society of America 132(5). 3453–3464. https://doi.org/10.1121/1.4747011.Search in Google Scholar

Barreda, Santiago. 2020. Vowel normalization as perceptual constancy. Language 96(2). 224–254. https://doi.org/10.1353/lan.0.0242.Search in Google Scholar

Barreda, Santiago. 2021. Perceptual validation of vowel normalization methods for variationist research. Language Variation and Change 33(1). 27–53. https://doi.org/10.1017/s0954394521000016.Search in Google Scholar

Barreda, Santiago & Terrance M. Nearey. 2012. The direct and indirect roles of fundamental frequency in vowel perception. Journal of the Acoustical Society of America 131(1). 466–477. https://doi.org/10.1121/1.3662068.Search in Google Scholar

Barreda, Santiago & Terrance M. Nearey. 2018. A regression approach to vowel normalization for missing and unbalanced data. Journal of the Acoustical Society of America 144(1). 500–520. https://doi.org/10.1121/1.5047742.Search in Google Scholar

Charlton, Benjamin D., William A. H. Ellis, Rebecca Larkin & W. Tecumseh Fitch. 2012. Perception of size-related formant information in male koalas (Phascolarctos cinereus). Animal Cognition 15. 999–1006. https://doi.org/10.1007/s10071-012-0527-5.Search in Google Scholar

Chernyak, Bronya R., Ann R. Bradlow, Joseph Keshet & Matthew Goldrick. 2024. A perceptual similarity space for speech based on self-supervised speech representations. Journal of the Acoustical Society of America 155(6). 3915–3929. https://doi.org/10.1121/10.0026358.Search in Google Scholar

Hillenbrand, James M. & Terrance M. Nearey. 1999. Identification of resynthesized /hVd/ utterances: Effects of formant contour. Journal of the Acoustical Society of America 105(6). 3509–3523. https://doi.org/10.1121/1.424676.Search in Google Scholar

Hillenbrand, James M., Laura A. Getty, Michael J. Clark & Kevin Wheeler. 1995. Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America 97(5). https://doi.org/10.1121/1.413041.Search in Google Scholar

Jin, Zhengyang, Yuhao Zhu & T. Florian Jaeger. 2025. Latent speech representations learned through self-supervised learning predict listeners’ generalization of adaptation across talkers. In Proceedings of the annual meeting of the cognitive science society, vol. 47. https://escholarship.org/uc/item/465543dw (accessed 23 August 2025).Search in Google Scholar

Johnson, Keith. 1997. Speech perception without speaker normalization. In K. Johnson & John W. Mullennix (eds.), Talker variability in speech processing, 145–146. San Diego: Academic Press.Search in Google Scholar

Kim, Seung-Eun, Bronya R. Chernyak, Olga Seleznova, Joseph Keshet, Matthew Goldrick & Ann R. Bradlow. 2024. Automatic recognition of second language speech-in-noise. JASA Express Letters 4(2). https://doi.org/10.1121/10.0024877.Search in Google Scholar

Kleinschmidt, Dave F. & T. Florian Jaeger. 2015. Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review 122(2). https://doi.org/10.1037/a0038695.Search in Google Scholar

Kleinschmidt, Dave F., Kodi Weatherholtz & T. Florian Jaeger. 2018. Sociolinguistic perception as inference under uncertainty. Topics in Cognitive Science 10(4). 818–834. https://doi.org/10.1111/tops.12331.Search in Google Scholar

Lobanov, Boris M. 1971. Classification of Russian vowels spoken by different speakers. Journal of the Acoustical Society of America 49(2B). 606–608. https://doi.org/10.1121/1.1912396.Search in Google Scholar

Luce, Paul A. & David B. Pisoni. 1998. Recognizing spoken words: The neighborhood activation model. Ear and Hearing 19(1). 1–36. https://doi.org/10.1097/00003446-199802000-00001.Search in Google Scholar

Magnuson, James S. & Howard C. Nusbaum. 2007. Acoustic differences, listener expectations, and the perceptual accommodation of talker variability. Journal of Experimental Psychology: Human Perception and Performance 33(2). 391–409. https://doi.org/10.1037/0096-1523.33.2.391.Search in Google Scholar

Massaro, Dominic W. & Daniel Friedman. 1990. Models of integration given multiple sources of information. Psychological Review 97(2). 225–252. https://doi.org/10.1037/0033-295X.97.2.225.Search in Google Scholar

Nearey, Terrance M. & Peter F. Assmann. 2007. Probabilistic “sliding-template” models for indirect vowel normalization. In Maria-Josep Solé, Patrice Speeter Beddor & Manjari Ohala (eds.), Experimental approaches to phonology, 246–269. Oxford: Oxford University Press.10.1093/oso/9780199296675.003.0016Search in Google Scholar

Norris, Dennis & James M. McQueen. 2008. Shortlist b: A Bayesian model of continuous speech recognition. Psychological Review 115(2). 357–395. https://doi.org/10.1037/0033-295x.115.2.357.Search in Google Scholar

Persson, Anna, Santiago Barreda & T. Florian Jaeger. 2025. Comparing accounts of formant normalization against US English listeners’ vowel perception. Journal of the Acoustical Society of America 157. 1458–1482. https://doi.org/10.1121/10.0035476.Search in Google Scholar

Peterson, Gordon E. & Harold L. Barney. 1952. Control methods used in a study of the vowels. Journal of the Acoustical Society of America 24(2). 175–184. https://doi.org/10.1121/1.1906875.Search in Google Scholar

R Core Team. 2023. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at: https://www.R-project.org/.Search in Google Scholar

Reby, David, Karen McComb, Bruno Cargnelutti, Chris Darwin, W. Tecumseh Fitch & Tim Clutton-Brock. 2005. Red deer stags use formants as assessment cues during intrasexual agonistic interactions. Proceedings of the Royal Society B: Biological Sciences 272(1566). 941–947. https://doi.org/10.1098/rspb.2004.2954.Search in Google Scholar

Richter, Caitlin, Naomi H. Feldman, Harini Salgado & Aren Jansen. 2017. Evaluating low-level speech features against human perceptual data. Transactions of the Association for Computational Linguistics 5. 425–440. https://doi.org/10.1162/tacl_a_00071.Search in Google Scholar

RStudio Team. 2024. RStudio: Integrated development environment for R. Boston, MA: RStudio, PBC. Available at: https://www.rstudio.com/.Search in Google Scholar

Sumner, Meghan. 2011. The role of variation in the perception of accented speech. Cognition 119(1). 131–136. https://doi.org/10.1016/j.cognition.2010.10.018.Search in Google Scholar

Taylor, A. M., David Reby & Karen McComb. 2010. Size communication in domestic dog, Canis familiaris, growls. Animal Behaviour 79(1). 205–210. https://doi.org/10.1016/j.anbehav.2009.10.030.Search in Google Scholar

Xie, Xin, T. Florian Jaeger & Chigusa Kurumada. 2023. What we do (not) know about the mechanisms underlying adaptive speech perception: A computational framework and review. Cortex 166. 377–424. https://doi.org/10.1016/j.cortex.2023.05.003.Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/lingvan-2024-0239).


Received: 2024-11-23
Accepted: 2025-06-17
Published Online: 2025-09-29

© 2025 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 19.11.2025 from https://www.degruyterbrill.com/document/doi/10.1515/lingvan-2024-0239/html
Scroll to top button