Home Linguistics & Semiotics Testing the effect of speech separation on vowel formant estimates
Article
Licensed
Unlicensed Requires Authentication

Testing the effect of speech separation on vowel formant estimates

  • Joseph A. Stanley ORCID logo EMAIL logo , Lisa Morgan Johnson ORCID logo and Earl Kjar Brown ORCID logo
Published/Copyright: January 30, 2025

Abstract

While recent advances in sociophonetic data processing have made it possible to analyze large datasets and audio not originally intended for linguistic analysis, overlapping speech in recordings with multiple speakers continues to be an issue that results in lost data. We evaluate whether current source separation models produce audio that is clean enough to produce reliable measurements for sociophonetic analysis. We compare formant estimates from a pair of pristine recordings and merged-and-separated versions of those same recordings using the Libri2mix, Whamr16K, and WSJ02mix source separation models. Based on auditory inspection of the separated files, visualization of vowel formant estimates, and statistical analysis, Libri2 performed best and WSJ02 was worst. While the mean formant measurements per vowel were usually small, differences for each observation were larger in unpredictable ways. We are cautiously optimistic about using these tools in sociophonetic analysis, so long as analysis is conducted on vowel means. We conclude with recommendations that researchers can implement when using source separation in sociophonetic research.


Corresponding author: Joseph A. Stanley, Brigham Young University, Provo, USA, E-mail:

References

Barreda, Santiago. 2021. Fast Track: Fast (nearly) automatic formant-tracking using Praat. Linguistics Vanguard 7(1). 20200051. https://doi.org/10.1515/lingvan-2020-0051.Search in Google Scholar

Boudahmane, Karim, Mathieu Manta, Fabien Antoine, Sylvian Galliano & Claude Barras. 1998. Transcriber. Available at: http://trans.sourceforge.net/.Search in Google Scholar

Bowie, David. 2003. Early development of the card-cord merger in Utah. American Speech 78(1). 31–51. https://doi.org/10.1215/00031283-78-1-31.Search in Google Scholar

Brugman, Hennie & Albert Russel. 2004. Annotating multimedia/multi-modal resources with ELAN. In Proceedings of the Fourth International Conference on Language Resources and Evaluation. Lisbon, 26–28 May. http://lrec-conf.org/proceedings/lrec2004/ (accessed 6 January 2025).Search in Google Scholar

Cheng, Andrew. 2018. A longitudinal acoustic study of two transgender women on YouTube. UC Berkeley Phonology Lab Annual Reports 14. 168–188. https://doi.org/10.5070/P7141042480.Search in Google Scholar

Cheng, Andrew. 2023. Second dialect acquisition “in real time”: Two longitudinal case studies from YouTube. American Speech 98(2). 194–224. https://doi.org/10.1215/00031283-9766922.Search in Google Scholar

Cosentino, Joris, Manuel Pariente, Samuele Cornell, Antoine Deleforge & Emmanuel Vincent. 2020. LibriMix: An open-source dataset for generalizable speech separation. arXiv. http://arxiv.org/abs/2005.11262.Search in Google Scholar

Gorman, Kyle, Jonathan Howell & Michael Wagner. 2011. Prosodylab-Aligner: A tool for forced alignment of laboratory speech. Canadian Acoustics 39(3). 192–193.Search in Google Scholar

Harrington, Jonathan, Sallyanne Palethorpe & Watson Catherine. 2000. Monophthongal vowel changes in Received Pronunciation: An acoustic analysis of the Queen’s Christmas broadcasts. Journal of the International Phonetic Association 30(1–2). 63–78. https://doi.org/10.1017/S0025100300006666.Search in Google Scholar

Hickey, Raymond (ed.). 2017. Listening to the past: Audio records of accents of English (Studies in English Language). Cambridge: Cambridge University Press.10.1017/9781107279865Search in Google Scholar

Holliday, Nicole. 2024. Complex variation in the construction of a sociolinguistic persona: The case of Vice President Kamala Harris. American Speech 99(2). 135–166. https://doi.org/10.1215/00031283-10867240.Search in Google Scholar

Kendall, Tyler & Charlotte Vaughn. 2020. Exploring vowel formant estimation through simulation-based techniques. Linguistics Vanguard 6(s1). 20180060. https://doi.org/10.1515/lingvan-2018-0060.Search in Google Scholar

Kisler, Thomas, Uwe Reichel & Florian Schiel. 2017. Multilingual processing of speech via web services. Computer Speech & Language 45. 326–347. https://doi.org/10.1016/j.csl.2017.01.005.Search in Google Scholar

Lee, Sarah. 2017. Style-shifting in vlogging: An acoustic analysis of “YouTube Voice”. Lifespans and Styles 3(1). 28–39. https://doi.org/10.2218/ls.v3i1.2017.1826.Search in Google Scholar

Levshina, Natalia. 2015. How to do linguistics with R: Data exploration and statistical analysis. Amsterdam: John Benjamins.10.1075/z.195Search in Google Scholar

Ma, Marcus, Lelia Glass & James Stanford. 2024. Introducing Bed Word: A new automated speech recognition tool for sociolinguistic interview transcription. Linguistics Vanguard 10(1). 641–653. https://doi.org/10.1515/lingvan-2023-0073.Search in Google Scholar

McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner & Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proceedings of the 18th Conference of the International Speech Communication Association [Interspeech], 498–502. Stockholm, Sweden.10.21437/Interspeech.2017-1386Search in Google Scholar

Mendoza-Denton, Norma. 2011. The semiotic hitchhiker’s guide to creaky voice: Circulation and gendered hardcore in a Chicana/o gang persona. Journal of Linguistic Anthropology 21(2). 261–280. https://doi.org/10.1111/j.1548-1395.2011.01110.x.Search in Google Scholar

Mikolov, Tomas, Kai Chen, Greg Corrado & Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. arXiv. http://arxiv.org/abs/1301.3781.Search in Google Scholar

Olsen, Rachel M., Michael L. Olsen, Joseph A. Stanley, Margaret E. L. Renwick & William A. Kretzschmar Jr. 2017. Methods for transcription and forced alignment of a legacy speech corpus. Proceedings of Meetings on Acoustics 30(1). 060001. https://doi.org/10.1121/2.0000559.Search in Google Scholar

Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey & Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning. Honolulu, HI.Search in Google Scholar

Reddy, Sravana & James N. Stanford. 2015. Toward completely automated vowel extraction: Introducing DARLA. Linguistics Vanguard 1(1). 15–28. https://doi.org/10.1515/lingvan-2015-0002.Search in Google Scholar

Renwick, Margaret E. L. & D. Robert Ladd. 2016. Phonetic distinctiveness vs. lexical contrastiveness in non-robust phonemic contrasts. Laboratory Phonology 7(1). 1–29. https://doi.org/10.5334/labphon.17.Search in Google Scholar

Rosenfelder, Ingrid, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, Kyle Gorman, Hilary Prichard & Jiahong Yuan. 2014. FAVE (Forced alignment and vowel extraction) program suite, version 1.2.2. Available at: https://doi.org/10.5281/zenodo.22281.Search in Google Scholar

Schiel, Florian. 1999. Automatic phonetic transcription of non-prompted speech. In Proceedings of the 14th International Congress of Phonetic Sciences. San Francisco: University of California. https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS1999/papers/p14_0607.pdf (accessed 7 January 2025).Search in Google Scholar

Stanley, Joseph A. 2022. Order of operations in sociophonetic analysis. University of Pennsylvania Working Papers in Linguistics 28(1). Available at: https://repository.upenn.edu/pwpl/vol28/iss2/17.Search in Google Scholar

Strelluf, Christopher & Matthew J. Gordon. 2024. The origins of Missouri English: A historical sociophonetic analysis. Lanham: Lexington Books.10.5771/9781498597272Search in Google Scholar

Subakan, Cem, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi & Jianyuan Zhong. 2020. Attention is all you need in speech separation. arXiv. https://doi.org/10.48550/arXiv.2010.13154.Search in Google Scholar

Wolfram, Walt, Caroline Myrick, Jon Forrest & Michael J. Fox. 2016. The significance of linguistic variation in the speeches of Rev. Dr. Martin Luther King Jr. American Speech 91(3). 269–300. https://doi.org/10.1215/00031283-3701015.Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/lingvan-2024-0152).


Received: 2024-07-26
Accepted: 2024-12-06
Published Online: 2025-01-30

© 2025 Walter de Gruyter GmbH, Berlin/Boston

Articles in the same Issue

  1. Frontmatter
  2. Editorial
  3. Editorial 2025
  4. Research Articles
  5. Vowel formant track normalization using discrete cosine transform coefficients
  6. Asymmetry in French speech-in-noise perception: the effects of native dialect and cross-dialectal exposure
  7. Direct pseudo-partitives in US English
  8. A baseline for object clitic climbing in Italian
  9. Semantic granularity in derivation
  10. Shared processing strategies as a mechanism for contact-induced change in flexible constituent order
  11. The (non)canonical status of the ka- passive in Balinese
  12. A comparative study of 时 si 2 /shi 2 in Meixian Hakka and Ancient Chinese using the Minimalist Program
  13. A quantitative method for syntactic gradience: words, phrases, and the constructions in between
  14. Yeah, but how? Operationalizing the functions of the discourse-pragmatic marker yeah
  15. Hotspots for acoustic politeness in Korean and Japanese deferential speech
  16. How fast is fast and how slow is slow in mental simulation? Two rating studies on Estonian speed adverbs
  17. Discourse effects in processing Chinese reflexive pronouns
  18. Attitudinal negotiation: the analysis of online commentary videos about an international event on Chinese social media platform bilibili.com
  19. Crosslinguistic constructions and strategies: where do concessive conditionals fit in?
  20. Recurring patterns in tone (chain) shift
  21. Null pronoun interpretation probed via thematic role ambiguity: a case in Korean
  22. Experimental investigation on quantifier scope in Chinese relative clauses
  23. Sensitivity to honorific agreement: a window into predictive processing
  24. The negative concord illusion: an acceptability study with Czech neg-words
  25. Expletive negation in Italian temporal clauses: an acceptability judgement and a self-paced reading study
  26. Effects of information structure on pronoun resolution: the number of pronouns matters
  27. The cognitive processing of nouns and verbs in second language reading: an eye-tracking study
  28. Comprehension of conversational implicatures in L3 Mandarin
  29. Effects of crosslinguistic influence in definiteness acquisition: comparing HL-English and HL-Russian bilingual children acquiring Hebrew
  30. Multimodal language processing in school-aged Mandarin-speaking children: the role of beat gesture in enhancing memory for discourse information
  31. My Memoji, my self: prosodic correlates of online performed code-switching via avatar
  32. Gender effects in Mandarin creaky voice evaluation: a matched-guise study
  33. Narrating the doctoral journey on Chinese social media: chronotopes and scales in user interaction on Xiaohongshu
  34. Salient Language in Context (SLIC): a web app for collecting real-time attention data in response to audio samples
  35. Children’s emerging sociolinguistic expectations around social roles: a triangulated approach
  36. Situating speakers in change: a methodology for quantifying degree and direction of change over the lifespan
  37. Testing the effect of speech separation on vowel formant estimates
  38. Researching dialects with high school students: a citizen science approach
  39. Sociolinguistic research projects as brands
  40. Do readers perceive various types of knowledge expressed through evidentials in news reports with different degrees of certainty?
  41. Quantitative relationship between distribution of sentence length and dependency distance in Spanish
  42. Large corpora and large language models: a replicable method for automating grammatical annotation
  43. Using ATLAS.ti for constructing and analysing multimodal social media corpora
  44. Exploring the effect of semantic diversity on boundary permeability in verb/noun heterosemy using deep contextualized word embedding
  45. Communicative pressures influence the use of adverbs as well as adjectives: evidence from a crosslinguistic investigation
  46. Non-signers favor two-handed gestures when expressing inherently plural meanings
  47. Encoding Chinese metaphorical motion: a typological perspective
  48. Frequency does not predict the processing speed of multi-morpheme sequences in Japanese
  49. Did he lead monologues or did he talk to himself? How typological distance between source and target language influences the preservation of metaphorical mappings in translation
  50. How long is too long? Production-internal and communicative constraints in the coding of conditionality in Spanish
  51. Long English objects and short Chinese objects: language diversity shaped by cognitive universality
  52. Corrigendum
  53. Corrigendum to: Sign recognition: the effect of parameters and features in sign mispronunciations
Downloaded on 23.1.2026 from https://www.degruyterbrill.com/document/doi/10.1515/lingvan-2024-0152/html
Scroll to top button