Abstract
While recent advances in sociophonetic data processing have made it possible to analyze large datasets and audio not originally intended for linguistic analysis, overlapping speech in recordings with multiple speakers continues to be an issue that results in lost data. We evaluate whether current source separation models produce audio that is clean enough to produce reliable measurements for sociophonetic analysis. We compare formant estimates from a pair of pristine recordings and merged-and-separated versions of those same recordings using the Libri2mix, Whamr16K, and WSJ02mix source separation models. Based on auditory inspection of the separated files, visualization of vowel formant estimates, and statistical analysis, Libri2 performed best and WSJ02 was worst. While the mean formant measurements per vowel were usually small, differences for each observation were larger in unpredictable ways. We are cautiously optimistic about using these tools in sociophonetic analysis, so long as analysis is conducted on vowel means. We conclude with recommendations that researchers can implement when using source separation in sociophonetic research.
References
Barreda, Santiago. 2021. Fast Track: Fast (nearly) automatic formant-tracking using Praat. Linguistics Vanguard 7(1). 20200051. https://doi.org/10.1515/lingvan-2020-0051.Search in Google Scholar
Boudahmane, Karim, Mathieu Manta, Fabien Antoine, Sylvian Galliano & Claude Barras. 1998. Transcriber. Available at: http://trans.sourceforge.net/.Search in Google Scholar
Bowie, David. 2003. Early development of the card-cord merger in Utah. American Speech 78(1). 31–51. https://doi.org/10.1215/00031283-78-1-31.Search in Google Scholar
Brugman, Hennie & Albert Russel. 2004. Annotating multimedia/multi-modal resources with ELAN. In Proceedings of the Fourth International Conference on Language Resources and Evaluation. Lisbon, 26–28 May. http://lrec-conf.org/proceedings/lrec2004/ (accessed 6 January 2025).Search in Google Scholar
Cheng, Andrew. 2018. A longitudinal acoustic study of two transgender women on YouTube. UC Berkeley Phonology Lab Annual Reports 14. 168–188. https://doi.org/10.5070/P7141042480.Search in Google Scholar
Cheng, Andrew. 2023. Second dialect acquisition “in real time”: Two longitudinal case studies from YouTube. American Speech 98(2). 194–224. https://doi.org/10.1215/00031283-9766922.Search in Google Scholar
Cosentino, Joris, Manuel Pariente, Samuele Cornell, Antoine Deleforge & Emmanuel Vincent. 2020. LibriMix: An open-source dataset for generalizable speech separation. arXiv. http://arxiv.org/abs/2005.11262.Search in Google Scholar
Gorman, Kyle, Jonathan Howell & Michael Wagner. 2011. Prosodylab-Aligner: A tool for forced alignment of laboratory speech. Canadian Acoustics 39(3). 192–193.Search in Google Scholar
Harrington, Jonathan, Sallyanne Palethorpe & Watson Catherine. 2000. Monophthongal vowel changes in Received Pronunciation: An acoustic analysis of the Queen’s Christmas broadcasts. Journal of the International Phonetic Association 30(1–2). 63–78. https://doi.org/10.1017/S0025100300006666.Search in Google Scholar
Hickey, Raymond (ed.). 2017. Listening to the past: Audio records of accents of English (Studies in English Language). Cambridge: Cambridge University Press.10.1017/9781107279865Search in Google Scholar
Holliday, Nicole. 2024. Complex variation in the construction of a sociolinguistic persona: The case of Vice President Kamala Harris. American Speech 99(2). 135–166. https://doi.org/10.1215/00031283-10867240.Search in Google Scholar
Kendall, Tyler & Charlotte Vaughn. 2020. Exploring vowel formant estimation through simulation-based techniques. Linguistics Vanguard 6(s1). 20180060. https://doi.org/10.1515/lingvan-2018-0060.Search in Google Scholar
Kisler, Thomas, Uwe Reichel & Florian Schiel. 2017. Multilingual processing of speech via web services. Computer Speech & Language 45. 326–347. https://doi.org/10.1016/j.csl.2017.01.005.Search in Google Scholar
Lee, Sarah. 2017. Style-shifting in vlogging: An acoustic analysis of “YouTube Voice”. Lifespans and Styles 3(1). 28–39. https://doi.org/10.2218/ls.v3i1.2017.1826.Search in Google Scholar
Levshina, Natalia. 2015. How to do linguistics with R: Data exploration and statistical analysis. Amsterdam: John Benjamins.10.1075/z.195Search in Google Scholar
Ma, Marcus, Lelia Glass & James Stanford. 2024. Introducing Bed Word: A new automated speech recognition tool for sociolinguistic interview transcription. Linguistics Vanguard 10(1). 641–653. https://doi.org/10.1515/lingvan-2023-0073.Search in Google Scholar
McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner & Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proceedings of the 18th Conference of the International Speech Communication Association [Interspeech], 498–502. Stockholm, Sweden.10.21437/Interspeech.2017-1386Search in Google Scholar
Mendoza-Denton, Norma. 2011. The semiotic hitchhiker’s guide to creaky voice: Circulation and gendered hardcore in a Chicana/o gang persona. Journal of Linguistic Anthropology 21(2). 261–280. https://doi.org/10.1111/j.1548-1395.2011.01110.x.Search in Google Scholar
Mikolov, Tomas, Kai Chen, Greg Corrado & Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. arXiv. http://arxiv.org/abs/1301.3781.Search in Google Scholar
Olsen, Rachel M., Michael L. Olsen, Joseph A. Stanley, Margaret E. L. Renwick & William A. Kretzschmar Jr. 2017. Methods for transcription and forced alignment of a legacy speech corpus. Proceedings of Meetings on Acoustics 30(1). 060001. https://doi.org/10.1121/2.0000559.Search in Google Scholar
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey & Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning. Honolulu, HI.Search in Google Scholar
Reddy, Sravana & James N. Stanford. 2015. Toward completely automated vowel extraction: Introducing DARLA. Linguistics Vanguard 1(1). 15–28. https://doi.org/10.1515/lingvan-2015-0002.Search in Google Scholar
Renwick, Margaret E. L. & D. Robert Ladd. 2016. Phonetic distinctiveness vs. lexical contrastiveness in non-robust phonemic contrasts. Laboratory Phonology 7(1). 1–29. https://doi.org/10.5334/labphon.17.Search in Google Scholar
Rosenfelder, Ingrid, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, Kyle Gorman, Hilary Prichard & Jiahong Yuan. 2014. FAVE (Forced alignment and vowel extraction) program suite, version 1.2.2. Available at: https://doi.org/10.5281/zenodo.22281.Search in Google Scholar
Schiel, Florian. 1999. Automatic phonetic transcription of non-prompted speech. In Proceedings of the 14th International Congress of Phonetic Sciences. San Francisco: University of California. https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS1999/papers/p14_0607.pdf (accessed 7 January 2025).Search in Google Scholar
Stanley, Joseph A. 2022. Order of operations in sociophonetic analysis. University of Pennsylvania Working Papers in Linguistics 28(1). Available at: https://repository.upenn.edu/pwpl/vol28/iss2/17.Search in Google Scholar
Strelluf, Christopher & Matthew J. Gordon. 2024. The origins of Missouri English: A historical sociophonetic analysis. Lanham: Lexington Books.10.5771/9781498597272Search in Google Scholar
Subakan, Cem, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi & Jianyuan Zhong. 2020. Attention is all you need in speech separation. arXiv. https://doi.org/10.48550/arXiv.2010.13154.Search in Google Scholar
Wolfram, Walt, Caroline Myrick, Jon Forrest & Michael J. Fox. 2016. The significance of linguistic variation in the speeches of Rev. Dr. Martin Luther King Jr. American Speech 91(3). 269–300. https://doi.org/10.1215/00031283-3701015.Search in Google Scholar
Supplementary Material
This article contains supplementary material (https://doi.org/10.1515/lingvan-2024-0152).
© 2025 Walter de Gruyter GmbH, Berlin/Boston