Testing the effect of speech separation on vowel formant estimates

Joseph A. Stanley; Lisa Morgan Johnson; Earl Kjar Brown

doi:10.1515/lingvan-2024-0152

Article

Testing the effect of speech separation on vowel formant estimates

Joseph A. Stanley , Lisa Morgan Johnson and Earl Kjar Brown

Published/Copyright: January 30, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Linguistics Vanguard Volume 11 Issue 1

Abstract

While recent advances in sociophonetic data processing have made it possible to analyze large datasets and audio not originally intended for linguistic analysis, overlapping speech in recordings with multiple speakers continues to be an issue that results in lost data. We evaluate whether current source separation models produce audio that is clean enough to produce reliable measurements for sociophonetic analysis. We compare formant estimates from a pair of pristine recordings and merged-and-separated versions of those same recordings using the Libri2mix, Whamr16K, and WSJ02mix source separation models. Based on auditory inspection of the separated files, visualization of vowel formant estimates, and statistical analysis, Libri2 performed best and WSJ02 was worst. While the mean formant measurements per vowel were usually small, differences for each observation were larger in unpredictable ways. We are cautiously optimistic about using these tools in sociophonetic analysis, so long as analysis is conducted on vowel means. We conclude with recommendations that researchers can implement when using source separation in sociophonetic research.

Keywords: source separation; sociophonetics; methods; data processing; vowel formant analysis

Corresponding author: Joseph A. Stanley, Brigham Young University, Provo, USA, E-mail: joey_stanley@byu.edu

References

Barreda, Santiago. 2021. Fast Track: Fast (nearly) automatic formant-tracking using Praat. Linguistics Vanguard 7(1). 20200051. https://doi.org/10.1515/lingvan-2020-0051.Search in Google Scholar

Boudahmane, Karim, Mathieu Manta, Fabien Antoine, Sylvian Galliano & Claude Barras. 1998. Transcriber. Available at: http://trans.sourceforge.net/.Search in Google Scholar

Bowie, David. 2003. Early development of the card-cord merger in Utah. American Speech 78(1). 31–51. https://doi.org/10.1215/00031283-78-1-31.Search in Google Scholar

Brugman, Hennie & Albert Russel. 2004. Annotating multimedia/multi-modal resources with ELAN. In Proceedings of the Fourth International Conference on Language Resources and Evaluation. Lisbon, 26–28 May. http://lrec-conf.org/proceedings/lrec2004/ (accessed 6 January 2025).Search in Google Scholar

Cheng, Andrew. 2018. A longitudinal acoustic study of two transgender women on YouTube. UC Berkeley Phonology Lab Annual Reports 14. 168–188. https://doi.org/10.5070/P7141042480.Search in Google Scholar

Cheng, Andrew. 2023. Second dialect acquisition “in real time”: Two longitudinal case studies from YouTube. American Speech 98(2). 194–224. https://doi.org/10.1215/00031283-9766922.Search in Google Scholar

Cosentino, Joris, Manuel Pariente, Samuele Cornell, Antoine Deleforge & Emmanuel Vincent. 2020. LibriMix: An open-source dataset for generalizable speech separation. arXiv. http://arxiv.org/abs/2005.11262.Search in Google Scholar

Gorman, Kyle, Jonathan Howell & Michael Wagner. 2011. Prosodylab-Aligner: A tool for forced alignment of laboratory speech. Canadian Acoustics 39(3). 192–193.Search in Google Scholar

Harrington, Jonathan, Sallyanne Palethorpe & Watson Catherine. 2000. Monophthongal vowel changes in Received Pronunciation: An acoustic analysis of the Queen’s Christmas broadcasts. Journal of the International Phonetic Association 30(1–2). 63–78. https://doi.org/10.1017/S0025100300006666.Search in Google Scholar

Hickey, Raymond (ed.). 2017. Listening to the past: Audio records of accents of English (Studies in English Language). Cambridge: Cambridge University Press.10.1017/9781107279865Search in Google Scholar

Holliday, Nicole. 2024. Complex variation in the construction of a sociolinguistic persona: The case of Vice President Kamala Harris. American Speech 99(2). 135–166. https://doi.org/10.1215/00031283-10867240.Search in Google Scholar

Kendall, Tyler & Charlotte Vaughn. 2020. Exploring vowel formant estimation through simulation-based techniques. Linguistics Vanguard 6(s1). 20180060. https://doi.org/10.1515/lingvan-2018-0060.Search in Google Scholar

Kisler, Thomas, Uwe Reichel & Florian Schiel. 2017. Multilingual processing of speech via web services. Computer Speech & Language 45. 326–347. https://doi.org/10.1016/j.csl.2017.01.005.Search in Google Scholar

Lee, Sarah. 2017. Style-shifting in vlogging: An acoustic analysis of “YouTube Voice”. Lifespans and Styles 3(1). 28–39. https://doi.org/10.2218/ls.v3i1.2017.1826.Search in Google Scholar

Levshina, Natalia. 2015. How to do linguistics with R: Data exploration and statistical analysis. Amsterdam: John Benjamins.10.1075/z.195Search in Google Scholar

Ma, Marcus, Lelia Glass & James Stanford. 2024. Introducing Bed Word: A new automated speech recognition tool for sociolinguistic interview transcription. Linguistics Vanguard 10(1). 641–653. https://doi.org/10.1515/lingvan-2023-0073.Search in Google Scholar

McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner & Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proceedings of the 18th Conference of the International Speech Communication Association [Interspeech], 498–502. Stockholm, Sweden.10.21437/Interspeech.2017-1386Search in Google Scholar

Mendoza-Denton, Norma. 2011. The semiotic hitchhiker’s guide to creaky voice: Circulation and gendered hardcore in a Chicana/o gang persona. Journal of Linguistic Anthropology 21(2). 261–280. https://doi.org/10.1111/j.1548-1395.2011.01110.x.Search in Google Scholar

Mikolov, Tomas, Kai Chen, Greg Corrado & Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. arXiv. http://arxiv.org/abs/1301.3781.Search in Google Scholar

Olsen, Rachel M., Michael L. Olsen, Joseph A. Stanley, Margaret E. L. Renwick & William A. Kretzschmar Jr. 2017. Methods for transcription and forced alignment of a legacy speech corpus. Proceedings of Meetings on Acoustics 30(1). 060001. https://doi.org/10.1121/2.0000559.Search in Google Scholar

Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey & Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning. Honolulu, HI.Search in Google Scholar

Reddy, Sravana & James N. Stanford. 2015. Toward completely automated vowel extraction: Introducing DARLA. Linguistics Vanguard 1(1). 15–28. https://doi.org/10.1515/lingvan-2015-0002.Search in Google Scholar

Renwick, Margaret E. L. & D. Robert Ladd. 2016. Phonetic distinctiveness vs. lexical contrastiveness in non-robust phonemic contrasts. Laboratory Phonology 7(1). 1–29. https://doi.org/10.5334/labphon.17.Search in Google Scholar

Rosenfelder, Ingrid, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, Kyle Gorman, Hilary Prichard & Jiahong Yuan. 2014. FAVE (Forced alignment and vowel extraction) program suite, version 1.2.2. Available at: https://doi.org/10.5281/zenodo.22281.Search in Google Scholar

Schiel, Florian. 1999. Automatic phonetic transcription of non-prompted speech. In Proceedings of the 14th International Congress of Phonetic Sciences. San Francisco: University of California. https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS1999/papers/p14_0607.pdf (accessed 7 January 2025).Search in Google Scholar

Stanley, Joseph A. 2022. Order of operations in sociophonetic analysis. University of Pennsylvania Working Papers in Linguistics 28(1). Available at: https://repository.upenn.edu/pwpl/vol28/iss2/17.Search in Google Scholar

Strelluf, Christopher & Matthew J. Gordon. 2024. The origins of Missouri English: A historical sociophonetic analysis. Lanham: Lexington Books.10.5771/9781498597272Search in Google Scholar

Subakan, Cem, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi & Jianyuan Zhong. 2020. Attention is all you need in speech separation. arXiv. https://doi.org/10.48550/arXiv.2010.13154.Search in Google Scholar

Wolfram, Walt, Caroline Myrick, Jon Forrest & Michael J. Fox. 2016. The significance of linguistic variation in the speeches of Rev. Dr. Martin Luther King Jr. American Speech 91(3). 269–300. https://doi.org/10.1215/00031283-3701015.Search in Google Scholar

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/lingvan-2024-0152).

Received: 2024-07-26

Accepted: 2024-12-06

Published Online: 2025-01-30

You are currently not able to access this content.

Supplementary Material

Articles in the same Issue

https://doi.org/10.1515/lingvan-2024-0152

Keywords for this article

source separation; sociophonetics; methods; data processing; vowel formant analysis