Abstract
While recent advances in sociophonetic data processing have made it possible to analyze large datasets and audio not originally intended for linguistic analysis, overlapping speech in recordings with multiple speakers continues to be an issue that results in lost data. We evaluate whether current source separation models produce audio that is clean enough to produce reliable measurements for sociophonetic analysis. We compare formant estimates from a pair of pristine recordings and merged-and-separated versions of those same recordings using the Libri2mix, Whamr16K, and WSJ02mix source separation models. Based on auditory inspection of the separated files, visualization of vowel formant estimates, and statistical analysis, Libri2 performed best and WSJ02 was worst. While the mean formant measurements per vowel were usually small, differences for each observation were larger in unpredictable ways. We are cautiously optimistic about using these tools in sociophonetic analysis, so long as analysis is conducted on vowel means. We conclude with recommendations that researchers can implement when using source separation in sociophonetic research.
References
Barreda, Santiago. 2021. Fast Track: Fast (nearly) automatic formant-tracking using Praat. Linguistics Vanguard 7(1). 20200051. https://doi.org/10.1515/lingvan-2020-0051.Search in Google Scholar
Boudahmane, Karim, Mathieu Manta, Fabien Antoine, Sylvian Galliano & Claude Barras. 1998. Transcriber. Available at: http://trans.sourceforge.net/.Search in Google Scholar
Bowie, David. 2003. Early development of the card-cord merger in Utah. American Speech 78(1). 31–51. https://doi.org/10.1215/00031283-78-1-31.Search in Google Scholar
Brugman, Hennie & Albert Russel. 2004. Annotating multimedia/multi-modal resources with ELAN. In Proceedings of the Fourth International Conference on Language Resources and Evaluation. Lisbon, 26–28 May. http://lrec-conf.org/proceedings/lrec2004/ (accessed 6 January 2025).Search in Google Scholar
Cheng, Andrew. 2018. A longitudinal acoustic study of two transgender women on YouTube. UC Berkeley Phonology Lab Annual Reports 14. 168–188. https://doi.org/10.5070/P7141042480.Search in Google Scholar
Cheng, Andrew. 2023. Second dialect acquisition “in real time”: Two longitudinal case studies from YouTube. American Speech 98(2). 194–224. https://doi.org/10.1215/00031283-9766922.Search in Google Scholar
Cosentino, Joris, Manuel Pariente, Samuele Cornell, Antoine Deleforge & Emmanuel Vincent. 2020. LibriMix: An open-source dataset for generalizable speech separation. arXiv. http://arxiv.org/abs/2005.11262.Search in Google Scholar
Gorman, Kyle, Jonathan Howell & Michael Wagner. 2011. Prosodylab-Aligner: A tool for forced alignment of laboratory speech. Canadian Acoustics 39(3). 192–193.Search in Google Scholar
Harrington, Jonathan, Sallyanne Palethorpe & Watson Catherine. 2000. Monophthongal vowel changes in Received Pronunciation: An acoustic analysis of the Queen’s Christmas broadcasts. Journal of the International Phonetic Association 30(1–2). 63–78. https://doi.org/10.1017/S0025100300006666.Search in Google Scholar
Hickey, Raymond (ed.). 2017. Listening to the past: Audio records of accents of English (Studies in English Language). Cambridge: Cambridge University Press.10.1017/9781107279865Search in Google Scholar
Holliday, Nicole. 2024. Complex variation in the construction of a sociolinguistic persona: The case of Vice President Kamala Harris. American Speech 99(2). 135–166. https://doi.org/10.1215/00031283-10867240.Search in Google Scholar
Kendall, Tyler & Charlotte Vaughn. 2020. Exploring vowel formant estimation through simulation-based techniques. Linguistics Vanguard 6(s1). 20180060. https://doi.org/10.1515/lingvan-2018-0060.Search in Google Scholar
Kisler, Thomas, Uwe Reichel & Florian Schiel. 2017. Multilingual processing of speech via web services. Computer Speech & Language 45. 326–347. https://doi.org/10.1016/j.csl.2017.01.005.Search in Google Scholar
Lee, Sarah. 2017. Style-shifting in vlogging: An acoustic analysis of “YouTube Voice”. Lifespans and Styles 3(1). 28–39. https://doi.org/10.2218/ls.v3i1.2017.1826.Search in Google Scholar
Levshina, Natalia. 2015. How to do linguistics with R: Data exploration and statistical analysis. Amsterdam: John Benjamins.10.1075/z.195Search in Google Scholar
Ma, Marcus, Lelia Glass & James Stanford. 2024. Introducing Bed Word: A new automated speech recognition tool for sociolinguistic interview transcription. Linguistics Vanguard 10(1). 641–653. https://doi.org/10.1515/lingvan-2023-0073.Search in Google Scholar
McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner & Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proceedings of the 18th Conference of the International Speech Communication Association [Interspeech], 498–502. Stockholm, Sweden.10.21437/Interspeech.2017-1386Search in Google Scholar
Mendoza-Denton, Norma. 2011. The semiotic hitchhiker’s guide to creaky voice: Circulation and gendered hardcore in a Chicana/o gang persona. Journal of Linguistic Anthropology 21(2). 261–280. https://doi.org/10.1111/j.1548-1395.2011.01110.x.Search in Google Scholar
Mikolov, Tomas, Kai Chen, Greg Corrado & Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. arXiv. http://arxiv.org/abs/1301.3781.Search in Google Scholar
Olsen, Rachel M., Michael L. Olsen, Joseph A. Stanley, Margaret E. L. Renwick & William A. Kretzschmar Jr. 2017. Methods for transcription and forced alignment of a legacy speech corpus. Proceedings of Meetings on Acoustics 30(1). 060001. https://doi.org/10.1121/2.0000559.Search in Google Scholar
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey & Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning. Honolulu, HI.Search in Google Scholar
Reddy, Sravana & James N. Stanford. 2015. Toward completely automated vowel extraction: Introducing DARLA. Linguistics Vanguard 1(1). 15–28. https://doi.org/10.1515/lingvan-2015-0002.Search in Google Scholar
Renwick, Margaret E. L. & D. Robert Ladd. 2016. Phonetic distinctiveness vs. lexical contrastiveness in non-robust phonemic contrasts. Laboratory Phonology 7(1). 1–29. https://doi.org/10.5334/labphon.17.Search in Google Scholar
Rosenfelder, Ingrid, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, Kyle Gorman, Hilary Prichard & Jiahong Yuan. 2014. FAVE (Forced alignment and vowel extraction) program suite, version 1.2.2. Available at: https://doi.org/10.5281/zenodo.22281.Search in Google Scholar
Schiel, Florian. 1999. Automatic phonetic transcription of non-prompted speech. In Proceedings of the 14th International Congress of Phonetic Sciences. San Francisco: University of California. https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS1999/papers/p14_0607.pdf (accessed 7 January 2025).Search in Google Scholar
Stanley, Joseph A. 2022. Order of operations in sociophonetic analysis. University of Pennsylvania Working Papers in Linguistics 28(1). Available at: https://repository.upenn.edu/pwpl/vol28/iss2/17.Search in Google Scholar
Strelluf, Christopher & Matthew J. Gordon. 2024. The origins of Missouri English: A historical sociophonetic analysis. Lanham: Lexington Books.10.5771/9781498597272Search in Google Scholar
Subakan, Cem, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi & Jianyuan Zhong. 2020. Attention is all you need in speech separation. arXiv. https://doi.org/10.48550/arXiv.2010.13154.Search in Google Scholar
Wolfram, Walt, Caroline Myrick, Jon Forrest & Michael J. Fox. 2016. The significance of linguistic variation in the speeches of Rev. Dr. Martin Luther King Jr. American Speech 91(3). 269–300. https://doi.org/10.1215/00031283-3701015.Search in Google Scholar
Supplementary Material
This article contains supplementary material (https://doi.org/10.1515/lingvan-2024-0152).
© 2025 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Editorial
- Editorial 2025
- Research Articles
- Vowel formant track normalization using discrete cosine transform coefficients
- Asymmetry in French speech-in-noise perception: the effects of native dialect and cross-dialectal exposure
- Direct pseudo-partitives in US English
- A baseline for object clitic climbing in Italian
- Semantic granularity in derivation
- Shared processing strategies as a mechanism for contact-induced change in flexible constituent order
- The (non)canonical status of the ka- passive in Balinese
- A comparative study of 时 si 2 /shi 2 in Meixian Hakka and Ancient Chinese using the Minimalist Program
- A quantitative method for syntactic gradience: words, phrases, and the constructions in between
- Yeah, but how? Operationalizing the functions of the discourse-pragmatic marker yeah
- Hotspots for acoustic politeness in Korean and Japanese deferential speech
- How fast is fast and how slow is slow in mental simulation? Two rating studies on Estonian speed adverbs
- Discourse effects in processing Chinese reflexive pronouns
- Attitudinal negotiation: the analysis of online commentary videos about an international event on Chinese social media platform bilibili.com
- Crosslinguistic constructions and strategies: where do concessive conditionals fit in?
- Recurring patterns in tone (chain) shift
- Null pronoun interpretation probed via thematic role ambiguity: a case in Korean
- Experimental investigation on quantifier scope in Chinese relative clauses
- Sensitivity to honorific agreement: a window into predictive processing
- The negative concord illusion: an acceptability study with Czech neg-words
- Expletive negation in Italian temporal clauses: an acceptability judgement and a self-paced reading study
- Effects of information structure on pronoun resolution: the number of pronouns matters
- The cognitive processing of nouns and verbs in second language reading: an eye-tracking study
- Comprehension of conversational implicatures in L3 Mandarin
- Effects of crosslinguistic influence in definiteness acquisition: comparing HL-English and HL-Russian bilingual children acquiring Hebrew
- Multimodal language processing in school-aged Mandarin-speaking children: the role of beat gesture in enhancing memory for discourse information
- My Memoji, my self: prosodic correlates of online performed code-switching via avatar
- Gender effects in Mandarin creaky voice evaluation: a matched-guise study
- Narrating the doctoral journey on Chinese social media: chronotopes and scales in user interaction on Xiaohongshu
- Salient Language in Context (SLIC): a web app for collecting real-time attention data in response to audio samples
- Children’s emerging sociolinguistic expectations around social roles: a triangulated approach
- Situating speakers in change: a methodology for quantifying degree and direction of change over the lifespan
- Testing the effect of speech separation on vowel formant estimates
- Researching dialects with high school students: a citizen science approach
- Sociolinguistic research projects as brands
- Do readers perceive various types of knowledge expressed through evidentials in news reports with different degrees of certainty?
- Quantitative relationship between distribution of sentence length and dependency distance in Spanish
- Large corpora and large language models: a replicable method for automating grammatical annotation
- Using ATLAS.ti for constructing and analysing multimodal social media corpora
- Exploring the effect of semantic diversity on boundary permeability in verb/noun heterosemy using deep contextualized word embedding
- Communicative pressures influence the use of adverbs as well as adjectives: evidence from a crosslinguistic investigation
- Non-signers favor two-handed gestures when expressing inherently plural meanings
- Encoding Chinese metaphorical motion: a typological perspective
- Frequency does not predict the processing speed of multi-morpheme sequences in Japanese
- Did he lead monologues or did he talk to himself? How typological distance between source and target language influences the preservation of metaphorical mappings in translation
- How long is too long? Production-internal and communicative constraints in the coding of conditionality in Spanish
- Long English objects and short Chinese objects: language diversity shaped by cognitive universality
- Corrigendum
- Corrigendum to: Sign recognition: the effect of parameters and features in sign mispronunciations
Articles in the same Issue
- Frontmatter
- Editorial
- Editorial 2025
- Research Articles
- Vowel formant track normalization using discrete cosine transform coefficients
- Asymmetry in French speech-in-noise perception: the effects of native dialect and cross-dialectal exposure
- Direct pseudo-partitives in US English
- A baseline for object clitic climbing in Italian
- Semantic granularity in derivation
- Shared processing strategies as a mechanism for contact-induced change in flexible constituent order
- The (non)canonical status of the ka- passive in Balinese
- A comparative study of 时 si 2 /shi 2 in Meixian Hakka and Ancient Chinese using the Minimalist Program
- A quantitative method for syntactic gradience: words, phrases, and the constructions in between
- Yeah, but how? Operationalizing the functions of the discourse-pragmatic marker yeah
- Hotspots for acoustic politeness in Korean and Japanese deferential speech
- How fast is fast and how slow is slow in mental simulation? Two rating studies on Estonian speed adverbs
- Discourse effects in processing Chinese reflexive pronouns
- Attitudinal negotiation: the analysis of online commentary videos about an international event on Chinese social media platform bilibili.com
- Crosslinguistic constructions and strategies: where do concessive conditionals fit in?
- Recurring patterns in tone (chain) shift
- Null pronoun interpretation probed via thematic role ambiguity: a case in Korean
- Experimental investigation on quantifier scope in Chinese relative clauses
- Sensitivity to honorific agreement: a window into predictive processing
- The negative concord illusion: an acceptability study with Czech neg-words
- Expletive negation in Italian temporal clauses: an acceptability judgement and a self-paced reading study
- Effects of information structure on pronoun resolution: the number of pronouns matters
- The cognitive processing of nouns and verbs in second language reading: an eye-tracking study
- Comprehension of conversational implicatures in L3 Mandarin
- Effects of crosslinguistic influence in definiteness acquisition: comparing HL-English and HL-Russian bilingual children acquiring Hebrew
- Multimodal language processing in school-aged Mandarin-speaking children: the role of beat gesture in enhancing memory for discourse information
- My Memoji, my self: prosodic correlates of online performed code-switching via avatar
- Gender effects in Mandarin creaky voice evaluation: a matched-guise study
- Narrating the doctoral journey on Chinese social media: chronotopes and scales in user interaction on Xiaohongshu
- Salient Language in Context (SLIC): a web app for collecting real-time attention data in response to audio samples
- Children’s emerging sociolinguistic expectations around social roles: a triangulated approach
- Situating speakers in change: a methodology for quantifying degree and direction of change over the lifespan
- Testing the effect of speech separation on vowel formant estimates
- Researching dialects with high school students: a citizen science approach
- Sociolinguistic research projects as brands
- Do readers perceive various types of knowledge expressed through evidentials in news reports with different degrees of certainty?
- Quantitative relationship between distribution of sentence length and dependency distance in Spanish
- Large corpora and large language models: a replicable method for automating grammatical annotation
- Using ATLAS.ti for constructing and analysing multimodal social media corpora
- Exploring the effect of semantic diversity on boundary permeability in verb/noun heterosemy using deep contextualized word embedding
- Communicative pressures influence the use of adverbs as well as adjectives: evidence from a crosslinguistic investigation
- Non-signers favor two-handed gestures when expressing inherently plural meanings
- Encoding Chinese metaphorical motion: a typological perspective
- Frequency does not predict the processing speed of multi-morpheme sequences in Japanese
- Did he lead monologues or did he talk to himself? How typological distance between source and target language influences the preservation of metaphorical mappings in translation
- How long is too long? Production-internal and communicative constraints in the coding of conditionality in Spanish
- Long English objects and short Chinese objects: language diversity shaped by cognitive universality
- Corrigendum
- Corrigendum to: Sign recognition: the effect of parameters and features in sign mispronunciations