Skip to main content
Article
Licensed
Unlicensed Requires Authentication

Vowel formant track normalization using discrete cosine transform coefficients

  • ORCID logo EMAIL logo
Published/Copyright: August 20, 2025

Abstract

This paper provides an overview of the discrete cosine transform (DCT) as a method for smoothing vowel formant tracks, as well as a procedure to take any speaker normalization method that has been defined for formant point measurements and define an equivalent method to be applied directly to DCT coefficients. This procedure is followed for three established normalization methods, and the difference between DCT normalization and formant point normalization is found to be marginal.


Corresponding author: Josef Fruehwald, University of Kentucky, Lexington, USA, E-mail:
I would like to thank Santiago Barreda, Kevin McGowan, Dan Villarreal, Jack Rechsteiner, the journal’s anonymous reviewers, and the audience at NWAV 52 for feedback on this work.

Appendix A: The scipy implementation of the DCT

The scipy documentation for the DCT describes three ways the DCT can be “normalized”, and two ways the DCT can be “orthogonalized” or “non-orthogonalized”. All of these options on the DCT alter the terms to the left of the sum in the DCT formula. Let’s define and simplify these components.

I will use S to indicate the sum function, which is defined as

S k ( x ) = j = 0 N 1 x j cos π k ( 2 j + 1 ) 2 N

This term is unaltered by any of the different options scipy offers. Any given DCT implementation can be given as

y k = o c S k ( x ) ,

where o is the orthogonalization term and c is the normalization constant.

The orthogonalization term is the easiest to define:

o = 1 if orth  =  False 1 2 if orth  =  True

The scipy documentation provides the mathematical definition for “backward” normalization constant only, but the “forward” normalization can be inferred from its output:

c = 2 if norm  =  backward 1 N if norm  =  forward

As a demonstration by example, we can define a python function for just the sum function (Listing 2), then apply it to the formant track in Figure 16:

Figure 16: 
Demonstration formant track.
Figure 16:

Demonstration formant track.

Listing 2: 
Definition of the DCT sum function.
Listing 2:

Definition of the DCT sum function.

We can get the result of the sum function for the 0th and 1st DCT coefficients to then examine the outcome of the different normalizations, as in Listing 3.

Listing 3: 
Sum terms of the DCT.
Listing 3:

Sum terms of the DCT.

At this point, we can also get the 0th and 1st DCT coefficients from the scipy implementation (Listing 4):

Listing 4: 
Application of the scipy DCT.
Listing 4:

Application of the scipy DCT.

The normalizing constant for norm = “backward” is documented to be 2, so multiplying s_0 and s_1 by 2 should be equal to the 0th and 1st coefficients in dct_backward (Listing 5):

Listing 5: 
Backward DCT normalization.
Listing 5:

Backward DCT normalization.

If the normalizing constant for norm=“forward” is 1 N , dividing s_0 and s_1 by the length of the input vector should be equal to the 0th and 1st coefficients in dct_forward (Listing 6):

Listing 6: 
Forward DCT normalization.
Listing 6:

Forward DCT normalization.

Admittedly, it would be more ideal to be able to reference the actual forward normalization constant from the scipy documentation, but it is not provided.

Appendix B: The DCT basis

While the formula in Equation (1) can be used to calculate the DCT coefficients, the formula to calculate the DCT basis functions in Figure 2 is different. If B is a matrix of the basis functions, the kth basis function will be in its columns. To get B, we apply the DCT with backward normalization to an identity matrix I (that is, a matrix with 1s along the diagonal, and 0s elsewhere). The orthogonalization term o is included in Equation (45).

(45) B j k = 2 o j = 0 N 1 I j k cos π k ( 2 j + 1 ) 2 N

This can be quickly implemented using the scipy DCT implementation like so (Listing 7):

Listing 7: 
Getting the DCT basis functions.
Listing 7:

Getting the DCT basis functions.

Appendix C: The choice of orthogonalization

The choice of “orthogonalizing” the DCT coefficients, that is, dividing the 0th coefficient by 2 , does introduce some awkwardness into the normalization procedures described here. Using the orthogonalized DCT was a design decision within fasttrackpy due to its reliance on regression-based DCT coefficients.

As a practical issue, formant tracking sometimes returns missing, or NA values for some, but not all, time points along a formant track. With missing values, the DCT cannot be directly applied. However, the DCT coefficients can be approximated by linear regression, using the DCT basis as the “predictors” (Listing 8):

Listing 8: 
Comparison of direct versus regression-based DCT.
Listing 8:

Comparison of direct versus regression-based DCT.

Orthogonalizing the first coefficient was the only option that resulted in the same coefficients for both regression and direct DCT within the scipy implementation. Without orthogonalizing the first coefficient, the 0th coefficient is not equal between the regression-based DCT and direct DCT (Listing 9):

Listing 9: 
Comparison of direct versus regression-based non-orthogonalized DCT.
Listing 9:

Comparison of direct versus regression-based non-orthogonalized DCT.

Since the design decision to orthogonalize the DCT coefficients was made within fasttrackpy, which was the tool used to arrive at these DCT coefficients in this paper, this was also the version of the DCT used here.

References

Adank, Patti, Roel Smits & Roeland van Hout. 2004. A comparison of vowel normalization procedures for language variation research. The Journal of the Acoustical Society of America 116(5). 3099–3107. https://doi.org/10.1121/1.1795335.Search in Google Scholar

Barreda, Santiago. 2021a. Fast track: Fast (nearly) automatic formant-tracking using Praat. Linguistics Vanguard 7(1). 20200051. https://doi.org/10.1515/lingvan-2020-0051.Search in Google Scholar

Barreda, Santiago. 2021b. Perceptual validation of vowel normalization methods for variationist research. Language Variation and Change 33(1). 27–53. https://doi.org/10.1017/S0954394521000016.Search in Google Scholar

Cox, Felicity & Sallyanne Palethorpe. 2019. Vowel variation in a standard context across four major Australian cities. In Sasha Calhoun, Paola Escudero, Marija Tabain & Paul Warren (eds.), Proceedings of the 19th international congress of phonetic sciences, 577–581. Melbourne: Australasian Speech Science and Technology Association Inc & International Phonetic Association. Available at: https://assta.org/proceedings/ICPhS2019/papers/ICPhS_626.pdf.Search in Google Scholar

Docherty, Gerard, Simón Gonzalez & Nathaniel Mitchell. 2015. Static vs dynamic perspectives on the realization of vowel nucleii in West Australian English. In The Scottish Consortium for ICPhS 2015 (ed.), Proceedings of the 18th international congress of phonetic sciences. Glasgow: University of Glasgow. Available at: https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2015/Papers/ICPHS0956.pdf.Search in Google Scholar

Fox, Robert Allen & Ewa Jacewicz. 2009. Cross-dialectal variation in formant dynamics of American English vowels. The Journal of the Acoustical Society of America 126(5). 2603–2618. https://doi.org/10.1121/1.3212921.Search in Google Scholar

Fruehwald, Josef. 2025. new-fave, version 1.1.1 [python package]. Available at: https://pypi.org/project/new-fave/.Search in Google Scholar

Fruehwald, Josef & Santiago Barreda. 2024. fasttrackpy, version 0.5.3 [python package]. Available at: https://pypi.org/project/fasttrackpy/.Search in Google Scholar

Gubian, Michele, Francisco Torreira & Lou Boves. 2015. Using functional data analysis for investigating multidimensional dynamic phonetic contrasts. Journal of Phonetics 49. 16–40. https://doi.org/10.1016/j.wocn.2014.10.001.Search in Google Scholar

Guzik, Karita M. & Jonathan Harrington. 2007. The quantification of place of articulation assimilation in electropalatographic data using the similarity index (SI). Advances in Speech Language Pathology 9(1). 109–119. https://doi.org/10.1080/07268600601094294.Search in Google Scholar

Hillenbrand, James M., Michael J. Clark & Terrance M. Nearey. 2001. Effects of consonant environment on vowel formant patterns. The Journal of the Acoustical Society of America 109(2). 748–763. https://doi.org/10.1121/1.1337959.Search in Google Scholar

Jannedy, Stefanie & Melanie Weirich. 2017. Spectral moments vs discrete cosine transformation coefficients: Evaluation of acoustic measures distinguishing two merging German fricatives. The Journal of the Acoustical Society of America 142(1). 395–405. https://doi.org/10.1121/1.4991347.Search in Google Scholar

Jochim, Markus, Raphael Winkelmann, Klaus Jaensch, Steve Cassidy & Jonathan Harrington. 2024. emuR: Main package of the EMU speech database management system, version 2.5.0 [R package]. Available at: https://cran.r-project.org/web/packages/emuR/.Search in Google Scholar

Johnson, Keith. 2020. The ΔF method of vocal tract length normalization for vowels. Laboratory Phonology: Journal of the Association for Laboratory Phonology 11(1). 10. https://doi.org/10.5334/labphon.196.Search in Google Scholar

Labov, William. 2001. Principles of linguistic change, vol. 2, Social factors (Language in Society). Oxford: Blackwell.Search in Google Scholar

Labov, William & Ingrid Rosenfelder. 2011. The Philadelphia neighborhood corpus. Philadelphia: University of Pennsylvania. Available at: http://fave.ling.upenn.edu/pnc.html.Search in Google Scholar

Labov, William, Sherry Ash & Charles Boberg. 2006. The atlas of North American English: Phonetics, phonology and sound change. New York: Mouton de Gruyter.10.1515/9783110167467Search in Google Scholar

Lobanov, Boris. 1971. Classification of Russian vowels spoken by different listeners. Journal of the Acoustical Society of America 49. 606–608. https://doi.org/10.1121/1.1912396.Search in Google Scholar

Mersmann, Olaf. 2024. Fftw: Fast FFT and DCT based on the FFTW library, version 1.0-9 [R package]. Available at: https://CRAN.R-project.org/package=fftw.Search in Google Scholar

Morrison, Geoffrey Stewart. 2009. Likelihood-ratio forensic voice comparison using parametric representations of the formant trajectories of diphthongs. The Journal of the Acoustical Society of America 125(4). 2387–2397. https://doi.org/10.1121/1.3081384.Search in Google Scholar

Nearey, Terrance M. 1978. Phonetic feature systems for vowels. Edmonton: University of Alberta PhD thesis. Available at: https://sites.ualberta.ca/∼tnearey/Nearey1978_compressed.pdf.Search in Google Scholar

Nearey, Terrance M. & Peter F. Assmann. 1986. Modeling the role of inherent spectral change in vowel identification. The Journal of the Acoustical Society of America 80(5). 1297–1308. https://doi.org/10.1121/1.394433.Search in Google Scholar

Ramsay, James & Bernard W. Silverman. 2006. Functional data analysis. New York: Springer.10.1007/b98888Search in Google Scholar

Risdal, Megan L. & Mary E. Kohn. 2014. Ethnolectal and generational differences in vowel trajectories: Evidence from African American English and the Southern Vowel System. Penn Working Papers in Linguistics 20(2). 139–148. https://doi.org/20.500.14332/45004.Search in Google Scholar

Rosenfelder, Ingrid, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, Christian Brickhouse, Kyle Gorman, Hillary Prichard & Jiahong Yuan. 2024. FAVE: Forced alignment and vowel extraction, version 2.0.3 [python package]. Available at: https://pypi.org/project/fave/.Search in Google Scholar

Sóskuthy, Márton. 2017. Generalised additive mixed models for dynamic analysis in linguistics: A practical introduction. arXiv. https://doi.org/10.48550/arXiv.1703.05339.Search in Google Scholar

Sóskuthy, Márton. 2021. Evaluating generalised additive mixed modelling strategies for dynamic speech analysis. Journal of Phonetics 84. 101017. https://doi.org/10.1016/j.wocn.2020.101017.Search in Google Scholar

Tanner, James, Morgan Sonderegger & Jane Stuart-Smith. 2022. Multidimensional acoustic variation in vowels across English dialects. In Garrett Nicolai & Eleanor Chodroff (eds.), Proceedings of the 19th SIGMORPHON workshop on computational research in phonetics, phonology, and morphology, 72–82. Seattle, WA: Association for Computational Linguistics.10.18653/v1/2022.sigmorphon-1.8Search in Google Scholar

Virtanen, Pauli, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt. 2020. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods 17(3). 261–272. https://doi.org/10.1038/s41592-019-0686-2.Search in Google Scholar

Wallace, Gregory K. 1992. The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38(1). xviii–xxxiv. https://doi.org/10.1109/30.125072.Search in Google Scholar

Watson, Catherine I. & Jonathan Harrington. 1999. Acoustic evidence for dynamic formant trajectories in Australian English vowels. The Journal of the Acoustical Society of America 106(1). 458–468. https://doi.org/10.1121/1.427069.Search in Google Scholar

Williams, Daniel & Paola Escudero. 2014. A cross-dialectal acoustic comparison of vowels in Northern and Southern British English. The Journal of the Acoustical Society of America 136(5). 2751–2761. https://doi.org/10.1121/1.4896471.Search in Google Scholar

Williams, Daniel, Jan-Willem van Leussen & Paola Escudero. 2015. Beyond North American English: Modelling vowel inherent spectral change in British English and Dutch. In The Scottish Consortium for ICPhS 2015 (ed.), Proceedings of the 18th international congress of phonetic sciences. Glasgow: University of Glasgow. Available at: https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2015/Papers/ICPHS0596.pdf.Search in Google Scholar

Zahorian, Stephen A. & Amir J. Jagharghi. 1991. Speaker normalization of static and dynamic vowel spectral features. The Journal of the Acoustical Society of America 90(1). 67–75. https://doi.org/10.1121/1.402350.Search in Google Scholar

Zahorian, Stephen A. & Amir Jalali Jagharghi. 1993. Spectral-shape features versus formants as acoustic correlates for vowels. The Journal of the Acoustical Society of America 94(4). 1966–1982. https://doi.org/10.1121/1.407520.Search in Google Scholar

Received: 2024-05-10
Accepted: 2025-06-04
Published Online: 2025-08-20

© 2025 Walter de Gruyter GmbH, Berlin/Boston

Articles in the same Issue

  1. Frontmatter
  2. Editorial
  3. Editorial 2025
  4. Research Articles
  5. Vowel formant track normalization using discrete cosine transform coefficients
  6. Asymmetry in French speech-in-noise perception: the effects of native dialect and cross-dialectal exposure
  7. Direct pseudo-partitives in US English
  8. A baseline for object clitic climbing in Italian
  9. Semantic granularity in derivation
  10. Shared processing strategies as a mechanism for contact-induced change in flexible constituent order
  11. The (non)canonical status of the ka- passive in Balinese
  12. A comparative study of 时 si 2 /shi 2 in Meixian Hakka and Ancient Chinese using the Minimalist Program
  13. A quantitative method for syntactic gradience: words, phrases, and the constructions in between
  14. Yeah, but how? Operationalizing the functions of the discourse-pragmatic marker yeah
  15. Hotspots for acoustic politeness in Korean and Japanese deferential speech
  16. How fast is fast and how slow is slow in mental simulation? Two rating studies on Estonian speed adverbs
  17. Discourse effects in processing Chinese reflexive pronouns
  18. Attitudinal negotiation: the analysis of online commentary videos about an international event on Chinese social media platform bilibili.com
  19. Crosslinguistic constructions and strategies: where do concessive conditionals fit in?
  20. Recurring patterns in tone (chain) shift
  21. Null pronoun interpretation probed via thematic role ambiguity: a case in Korean
  22. Experimental investigation on quantifier scope in Chinese relative clauses
  23. Sensitivity to honorific agreement: a window into predictive processing
  24. The negative concord illusion: an acceptability study with Czech neg-words
  25. Expletive negation in Italian temporal clauses: an acceptability judgement and a self-paced reading study
  26. Effects of information structure on pronoun resolution: the number of pronouns matters
  27. The cognitive processing of nouns and verbs in second language reading: an eye-tracking study
  28. Comprehension of conversational implicatures in L3 Mandarin
  29. Effects of crosslinguistic influence in definiteness acquisition: comparing HL-English and HL-Russian bilingual children acquiring Hebrew
  30. Multimodal language processing in school-aged Mandarin-speaking children: the role of beat gesture in enhancing memory for discourse information
  31. My Memoji, my self: prosodic correlates of online performed code-switching via avatar
  32. Gender effects in Mandarin creaky voice evaluation: a matched-guise study
  33. Narrating the doctoral journey on Chinese social media: chronotopes and scales in user interaction on Xiaohongshu
  34. Salient Language in Context (SLIC): a web app for collecting real-time attention data in response to audio samples
  35. Children’s emerging sociolinguistic expectations around social roles: a triangulated approach
  36. Situating speakers in change: a methodology for quantifying degree and direction of change over the lifespan
  37. Testing the effect of speech separation on vowel formant estimates
  38. Researching dialects with high school students: a citizen science approach
  39. Sociolinguistic research projects as brands
  40. Do readers perceive various types of knowledge expressed through evidentials in news reports with different degrees of certainty?
  41. Quantitative relationship between distribution of sentence length and dependency distance in Spanish
  42. Large corpora and large language models: a replicable method for automating grammatical annotation
  43. Using ATLAS.ti for constructing and analysing multimodal social media corpora
  44. Exploring the effect of semantic diversity on boundary permeability in verb/noun heterosemy using deep contextualized word embedding
  45. Communicative pressures influence the use of adverbs as well as adjectives: evidence from a crosslinguistic investigation
  46. Non-signers favor two-handed gestures when expressing inherently plural meanings
  47. Encoding Chinese metaphorical motion: a typological perspective
  48. Frequency does not predict the processing speed of multi-morpheme sequences in Japanese
  49. Did he lead monologues or did he talk to himself? How typological distance between source and target language influences the preservation of metaphorical mappings in translation
  50. How long is too long? Production-internal and communicative constraints in the coding of conditionality in Spanish
  51. Long English objects and short Chinese objects: language diversity shaped by cognitive universality
  52. Corrigendum
  53. Corrigendum to: Sign recognition: the effect of parameters and features in sign mispronunciations
Downloaded on 26.4.2026 from https://www.degruyterbrill.com/document/doi/10.1515/lingvan-2024-0095/html?lang=en
Scroll to top button