Vowel formant track normalization using discrete cosine transform coefficients

Josef Fruehwald

doi:10.1515/lingvan-2024-0095

Article

Vowel formant track normalization using discrete cosine transform coefficients

Published/Copyright: August 20, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Linguistics Vanguard Volume 11 Issue 1

Abstract

This paper provides an overview of the discrete cosine transform (DCT) as a method for smoothing vowel formant tracks, as well as a procedure to take any speaker normalization method that has been defined for formant point measurements and define an equivalent method to be applied directly to DCT coefficients. This procedure is followed for three established normalization methods, and the difference between DCT normalization and formant point normalization is found to be marginal.

Keywords: vowel formants; speaker normalization; vowel intrinsic spectral change; discrete cosine transform

Corresponding author: Josef Fruehwald, University of Kentucky, Lexington, USA, E-mail: josef.fruehwald@uky.edu

I would like to thank Santiago Barreda, Kevin McGowan, Dan Villarreal, Jack Rechsteiner, the journal’s anonymous reviewers, and the audience at NWAV 52 for feedback on this work.

Appendix A: The scipy implementation of the DCT

The scipy documentation for the DCT describes three ways the DCT can be “normalized”, and two ways the DCT can be “orthogonalized” or “non-orthogonalized”. All of these options on the DCT alter the terms to the left of the sum in the DCT formula. Let’s define and simplify these components.

I will use S to indicate the sum function, which is defined as

S k ( x ) = ∑ j = 0 N − 1 x j cos π k ( 2 j + 1 ) 2 N

This term is unaltered by any of the different options scipy offers. Any given DCT implementation can be given as

y k = o c S k ( x ) ,

where o is the orthogonalization term and c is the normalization constant.

The orthogonalization term is the easiest to define:

o = 1 if orth = False 1 2 if orth = True

The scipy documentation provides the mathematical definition for “backward” normalization constant only, but the “forward” normalization can be inferred from its output:

c = 2 if norm = backward 1 N if norm = forward

As a demonstration by example, we can define a python function for just the sum function (Listing 2), then apply it to the formant track in Figure 16:

Figure 16:

Demonstration formant track.

Listing 2:

Definition of the DCT sum function.

We can get the result of the sum function for the 0th and 1st DCT coefficients to then examine the outcome of the different normalizations, as in Listing 3.

Listing 3:

Sum terms of the DCT.

At this point, we can also get the 0th and 1st DCT coefficients from the scipy implementation (Listing 4):

Listing 4:

Application of the scipy DCT.

The normalizing constant for norm = “backward” is documented to be 2, so multiplying s_0 and s_1 by 2 should be equal to the 0th and 1st coefficients in dct_backward (Listing 5):

Listing 5:

Backward DCT normalization.

If the normalizing constant for norm=“forward” is 1 N , dividing s_0 and s_1 by the length of the input vector should be equal to the 0th and 1st coefficients in dct_forward (Listing 6):

Listing 6:

Forward DCT normalization.

Admittedly, it would be more ideal to be able to reference the actual forward normalization constant from the scipy documentation, but it is not provided.

Appendix B: The DCT basis

While the formula in Equation (1) can be used to calculate the DCT coefficients, the formula to calculate the DCT basis functions in Figure 2 is different. If B is a matrix of the basis functions, the kth basis function will be in its columns. To get B, we apply the DCT with backward normalization to an identity matrix I (that is, a matrix with 1s along the diagonal, and 0s elsewhere). The orthogonalization term o is included in Equation (45).

(45) B j k = 2 o ∑ j = 0 N − 1 I j k cos π k ( 2 j + 1 ) 2 N

This can be quickly implemented using the scipy DCT implementation like so (Listing 7):

Listing 7:

Getting the DCT basis functions.

Appendix C: The choice of orthogonalization

The choice of “orthogonalizing” the DCT coefficients, that is, dividing the 0th coefficient by 2 , does introduce some awkwardness into the normalization procedures described here. Using the orthogonalized DCT was a design decision within fasttrackpy due to its reliance on regression-based DCT coefficients.

As a practical issue, formant tracking sometimes returns missing, or NA values for some, but not all, time points along a formant track. With missing values, the DCT cannot be directly applied. However, the DCT coefficients can be approximated by linear regression, using the DCT basis as the “predictors” (Listing 8):

Listing 8:

Comparison of direct versus regression-based DCT.

Orthogonalizing the first coefficient was the only option that resulted in the same coefficients for both regression and direct DCT within the scipy implementation. Without orthogonalizing the first coefficient, the 0th coefficient is not equal between the regression-based DCT and direct DCT (Listing 9):

Listing 9:

Comparison of direct versus regression-based non-orthogonalized DCT.

Since the design decision to orthogonalize the DCT coefficients was made within fasttrackpy, which was the tool used to arrive at these DCT coefficients in this paper, this was also the version of the DCT used here.

References

Adank, Patti, Roel Smits & Roeland van Hout. 2004. A comparison of vowel normalization procedures for language variation research. The Journal of the Acoustical Society of America 116(5). 3099–3107. https://doi.org/10.1121/1.1795335.Search in Google Scholar

Barreda, Santiago. 2021a. Fast track: Fast (nearly) automatic formant-tracking using Praat. Linguistics Vanguard 7(1). 20200051. https://doi.org/10.1515/lingvan-2020-0051.Search in Google Scholar

Barreda, Santiago. 2021b. Perceptual validation of vowel normalization methods for variationist research. Language Variation and Change 33(1). 27–53. https://doi.org/10.1017/S0954394521000016.Search in Google Scholar

Cox, Felicity & Sallyanne Palethorpe. 2019. Vowel variation in a standard context across four major Australian cities. In Sasha Calhoun, Paola Escudero, Marija Tabain & Paul Warren (eds.), Proceedings of the 19th international congress of phonetic sciences, 577–581. Melbourne: Australasian Speech Science and Technology Association Inc & International Phonetic Association. Available at: https://assta.org/proceedings/ICPhS2019/papers/ICPhS_626.pdf.Search in Google Scholar

Docherty, Gerard, Simón Gonzalez & Nathaniel Mitchell. 2015. Static vs dynamic perspectives on the realization of vowel nucleii in West Australian English. In The Scottish Consortium for ICPhS 2015 (ed.), Proceedings of the 18th international congress of phonetic sciences. Glasgow: University of Glasgow. Available at: https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2015/Papers/ICPHS0956.pdf.Search in Google Scholar

Fox, Robert Allen & Ewa Jacewicz. 2009. Cross-dialectal variation in formant dynamics of American English vowels. The Journal of the Acoustical Society of America 126(5). 2603–2618. https://doi.org/10.1121/1.3212921.Search in Google Scholar

Fruehwald, Josef. 2025. new-fave, version 1.1.1 [python package]. Available at: https://pypi.org/project/new-fave/.Search in Google Scholar

Fruehwald, Josef & Santiago Barreda. 2024. fasttrackpy, version 0.5.3 [python package]. Available at: https://pypi.org/project/fasttrackpy/.Search in Google Scholar

Gubian, Michele, Francisco Torreira & Lou Boves. 2015. Using functional data analysis for investigating multidimensional dynamic phonetic contrasts. Journal of Phonetics 49. 16–40. https://doi.org/10.1016/j.wocn.2014.10.001.Search in Google Scholar

Guzik, Karita M. & Jonathan Harrington. 2007. The quantification of place of articulation assimilation in electropalatographic data using the similarity index (SI). Advances in Speech Language Pathology 9(1). 109–119. https://doi.org/10.1080/07268600601094294.Search in Google Scholar

Hillenbrand, James M., Michael J. Clark & Terrance M. Nearey. 2001. Effects of consonant environment on vowel formant patterns. The Journal of the Acoustical Society of America 109(2). 748–763. https://doi.org/10.1121/1.1337959.Search in Google Scholar

Jannedy, Stefanie & Melanie Weirich. 2017. Spectral moments vs discrete cosine transformation coefficients: Evaluation of acoustic measures distinguishing two merging German fricatives. The Journal of the Acoustical Society of America 142(1). 395–405. https://doi.org/10.1121/1.4991347.Search in Google Scholar

Jochim, Markus, Raphael Winkelmann, Klaus Jaensch, Steve Cassidy & Jonathan Harrington. 2024. emuR: Main package of the EMU speech database management system, version 2.5.0 [R package]. Available at: https://cran.r-project.org/web/packages/emuR/.Search in Google Scholar

Johnson, Keith. 2020. The ΔF method of vocal tract length normalization for vowels. Laboratory Phonology: Journal of the Association for Laboratory Phonology 11(1). 10. https://doi.org/10.5334/labphon.196.Search in Google Scholar

Labov, William. 2001. Principles of linguistic change, vol. 2, Social factors (Language in Society). Oxford: Blackwell.Search in Google Scholar

Labov, William & Ingrid Rosenfelder. 2011. The Philadelphia neighborhood corpus. Philadelphia: University of Pennsylvania. Available at: http://fave.ling.upenn.edu/pnc.html.Search in Google Scholar

Labov, William, Sherry Ash & Charles Boberg. 2006. The atlas of North American English: Phonetics, phonology and sound change. New York: Mouton de Gruyter.10.1515/9783110167467Search in Google Scholar

Lobanov, Boris. 1971. Classification of Russian vowels spoken by different listeners. Journal of the Acoustical Society of America 49. 606–608. https://doi.org/10.1121/1.1912396.Search in Google Scholar

Mersmann, Olaf. 2024. Fftw: Fast FFT and DCT based on the FFTW library, version 1.0-9 [R package]. Available at: https://CRAN.R-project.org/package=fftw.Search in Google Scholar

Morrison, Geoffrey Stewart. 2009. Likelihood-ratio forensic voice comparison using parametric representations of the formant trajectories of diphthongs. The Journal of the Acoustical Society of America 125(4). 2387–2397. https://doi.org/10.1121/1.3081384.Search in Google Scholar

Nearey, Terrance M. 1978. Phonetic feature systems for vowels. Edmonton: University of Alberta PhD thesis. Available at: https://sites.ualberta.ca/∼tnearey/Nearey1978_compressed.pdf.Search in Google Scholar

Nearey, Terrance M. & Peter F. Assmann. 1986. Modeling the role of inherent spectral change in vowel identification. The Journal of the Acoustical Society of America 80(5). 1297–1308. https://doi.org/10.1121/1.394433.Search in Google Scholar

Ramsay, James & Bernard W. Silverman. 2006. Functional data analysis. New York: Springer.10.1007/b98888Search in Google Scholar

Risdal, Megan L. & Mary E. Kohn. 2014. Ethnolectal and generational differences in vowel trajectories: Evidence from African American English and the Southern Vowel System. Penn Working Papers in Linguistics 20(2). 139–148. https://doi.org/20.500.14332/45004.Search in Google Scholar

Rosenfelder, Ingrid, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, Christian Brickhouse, Kyle Gorman, Hillary Prichard & Jiahong Yuan. 2024. FAVE: Forced alignment and vowel extraction, version 2.0.3 [python package]. Available at: https://pypi.org/project/fave/.Search in Google Scholar

Sóskuthy, Márton. 2017. Generalised additive mixed models for dynamic analysis in linguistics: A practical introduction. arXiv. https://doi.org/10.48550/arXiv.1703.05339.Search in Google Scholar

Sóskuthy, Márton. 2021. Evaluating generalised additive mixed modelling strategies for dynamic speech analysis. Journal of Phonetics 84. 101017. https://doi.org/10.1016/j.wocn.2020.101017.Search in Google Scholar

Tanner, James, Morgan Sonderegger & Jane Stuart-Smith. 2022. Multidimensional acoustic variation in vowels across English dialects. In Garrett Nicolai & Eleanor Chodroff (eds.), Proceedings of the 19th SIGMORPHON workshop on computational research in phonetics, phonology, and morphology, 72–82. Seattle, WA: Association for Computational Linguistics.10.18653/v1/2022.sigmorphon-1.8Search in Google Scholar

Virtanen, Pauli, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt. 2020. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods 17(3). 261–272. https://doi.org/10.1038/s41592-019-0686-2.Search in Google Scholar

Wallace, Gregory K. 1992. The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38(1). xviii–xxxiv. https://doi.org/10.1109/30.125072.Search in Google Scholar

Watson, Catherine I. & Jonathan Harrington. 1999. Acoustic evidence for dynamic formant trajectories in Australian English vowels. The Journal of the Acoustical Society of America 106(1). 458–468. https://doi.org/10.1121/1.427069.Search in Google Scholar

Williams, Daniel & Paola Escudero. 2014. A cross-dialectal acoustic comparison of vowels in Northern and Southern British English. The Journal of the Acoustical Society of America 136(5). 2751–2761. https://doi.org/10.1121/1.4896471.Search in Google Scholar

Williams, Daniel, Jan-Willem van Leussen & Paola Escudero. 2015. Beyond North American English: Modelling vowel inherent spectral change in British English and Dutch. In The Scottish Consortium for ICPhS 2015 (ed.), Proceedings of the 18th international congress of phonetic sciences. Glasgow: University of Glasgow. Available at: https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2015/Papers/ICPHS0596.pdf.Search in Google Scholar

Zahorian, Stephen A. & Amir J. Jagharghi. 1991. Speaker normalization of static and dynamic vowel spectral features. The Journal of the Acoustical Society of America 90(1). 67–75. https://doi.org/10.1121/1.402350.Search in Google Scholar

Zahorian, Stephen A. & Amir Jalali Jagharghi. 1993. Spectral-shape features versus formants as acoustic correlates for vowels. The Journal of the Acoustical Society of America 94(4). 1966–1982. https://doi.org/10.1121/1.407520.Search in Google Scholar

Received: 2024-05-10

Accepted: 2025-06-04

Published Online: 2025-08-20

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/lingvan-2024-0095

Keywords for this article

vowel formants; speaker normalization; vowel intrinsic spectral change; discrete cosine transform