Introducing Bed Word: a new automated speech recognition tool for sociolinguistic interview transcription

Marcus Ma; Lelia Glass; James Stanford

doi:10.1515/lingvan-2023-0073

Artikel

Introducing Bed Word: a new automated speech recognition tool for sociolinguistic interview transcription

Marcus Ma , Lelia Glass und James Stanford

Veröffentlicht/Copyright: 7. Mai 2024

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Linguistics Vanguard Band 10 Heft 1

Abstract

We present Bed Word, a tool leveraging industrial automatic speech recognition (ASR) to transcribe sociophonetic data. While we find lower accuracy for minoritized English varieties, the resulting vowel measurements are overall very close to those derived from human-corrected gold data, so fully automated transcription may be suitable for some research purposes. For purposes requiring greater accuracy, we present a pipeline for human post-editing of automatically generated drafts, which we show is far faster than transcribing from scratch. Thus, we offer two ways to leverage ASR in sociolinguistic research: full automation and human post-editing. Augmenting the DARLA tool developed by Reddy and Stanford (2015b. Toward completely automated vowel extraction: Introducing DARLA. Linguistics Vanguard 1(1). 15–28), we hope that this resource can help speed up transcription for sociophonetic research.

Keywords: automated transcription; automated speech recognition; linguistics; sociophonetics

Corresponding author: Marcus Ma, Georgia Institute of Technology, 564 Centennial Olympic Park Dr NW, Atlanta, GA, 30313, USA, E-mail: mma81@gatech.edu

Acknowledgments

We express gratitude to our speakers from the Roswell Voices project, Atlanta Speech Project, and Georgia Tech. We thank Sravana Reddy for her knowledge of and help with the DARLA system. We are indebted to the Georgia Tech VIP Team, Language and Identity in the New South, for inspiration and testing of Bed Word. Finally, we thank Joseph A. Stanley, Margaret Renwick, and Jon Forrest for invaluable feedback during the development of Bed Word.

Appendix

Regression coefficients for word error rate model

As described above, we ran a linear regression predicting each speaker’s WER as a function of their gender (male or female), ethnicity (Black or White), and the age of the data (legacy or recent), as well as the only interaction that improved the model as measured by the Akaike information criterion: an interaction between data age and ethnicity (9).

(9)

lm(WER ∼ DataAge * Ethnicity + Gender, data = wer)

The full model output is given in Table 3.

Table 3:

Output of word error rate regression.

	Estimate	SE	t value	Pr(>\|t\|)
(Intercept)	0.28	0.04	6.41	2.21e−07***
DataAgeLegacy	−0.02	0.06	−0.39	0.70
EthnicityBlack	0.01	0.06	0.24	0.81
GenderM	0.01	0.04	0.17	0.87
DataAgeLegacy:EthnicityBlack	0.32	0.08	4.04	0.0003***

Comparison of available industrial ASR models

To determine the industry model behind Bed Word, we surveyed the ASR models from Amazon, Google, Microsoft, and Deepgram. We evaluated WER on the same Georgia corpus used throughout the study (Table 4).

Table 4:

Evaluations of ASR models on our Georgia speaker corpus.

	Amazon	Google	Microsoft	Deepgram
Word error rate	0.562	0.555	0.564	0.375
Cost per audio hour	$0.96	$1.44	$1.00	$0.25
Ease of use (subjective)	Medium	Easy	Hard	Easy

Overall, we judge Deepgram as superior to the other three models.

Word error rates by vowel type

Here, we present the WER for Atl_002 broken down by vowel type. We manually compared the gold handwritten transcription with Bed Word’s silver auto-generated transcription and noted transcription errors (substitutions, deletions, and insertions) where the vowel type is mistaken. For example, if school was mistranscribed as skull, that would count as an error for the goose vowel (mischaracterizing it as the strut vowel), while mistranscribing state as of steak does not count as an error (because both share the same face vowel). In Figure 5, we see that overall the percentage of errors for each vowel type closely mirrors its overall frequency, meaning that Bed Word is likely not biased towards transcribing certain vowels with more errors than others.

Figure 5:

For each of the top 12 most frequent vowel types, its percentage of all vowel tokens, and its percentage of all vowel-mistaking transcription, for Atl_002.

References

Akaike, Hirotugu. 1974. A new look at the statistical model identification. IEEE (Institute of Electrical and Electronics Engineers) Transactions on Automatic Control 19(6). 716–723. https://doi.org/10.1109/tac.1974.1100705.Suche in Google Scholar

Baron, Reuben M. & David A. Kenny. 1986. The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology 51(6). 1173–1182. https://doi.org/10.1037/0022-3514.51.6.1173.Suche in Google Scholar

Becker, Kara (ed.). 2019. The low-back-merger shift: Uniting the Canadian vowel shift, the California vowel shift, and short front vowel shifts across North America. [Special Issue]. American Speech 104.Suche in Google Scholar

Benzeghiba, Mohamed, Renato De Mori, Olivier Deroo, Stephane Dupont, Teodora Erbes, Denis Jouvet, Luciano Fissore, Pietro Laface, Alfred Mertins, Christophe Ris, Richard Rose, Vivek Tyagi & Christian Wellekens. 2007. Automatic speech recognition and speech variability: A review. Speech Communication 49(10–11). 763–786. https://doi.org/10.1016/j.specom.2007.02.006.Suche in Google Scholar

Bhattacharyya, Anil. 1946. On a measure of divergence between two multinomial populations. Sankhyā: The Indian Journal of Statistics 7(4). 401–406.Suche in Google Scholar

Boberg, Charles. 2005. The Canadian shift in Montreal. Language Variation and Change 17(2). 133–154. https://doi.org/10.1017/s0954394505050064.Suche in Google Scholar

Boersma, Paul & David Weenink. 2024. Praat: Doing phonetics by computer. Version 6.4.08 [Computer program]. Available at: http://www.praat.org/.Suche in Google Scholar

Brozovsky, Erica. 2020. Taiwanese Texans: A sociolingustic study of language and cultural identity. Austin: The University of Texas at Austin PhD dissertation.Suche in Google Scholar

Cangemi, Francesco, Jessica Fründt, Harriet Hanekamp & Martine Grice. 2019. A semi-automatic workflow for orthographic transcription and syllabic segmentation. In XV AISV Conference: Audio archives at the crossroads of Speech Sciences, Digital Humanities and Digital Heritage, vol. 6, 419–425. Arezzo, Italy.Suche in Google Scholar

Chen, Guoguo, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhau You & Zhiyong Yan. 2021. Gigaspeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. In Hynek Heřmanský, Honza Çernocký, Lukáš Burget, Lori Lamel, Odette Scharenborg & Petr Motlicek (eds.), Proceedings of interspeech. Brno, Czech Republic: International Speech Communication Association (ISCA). Available at: https://arxiv.org/abs/2106.06909.10.21437/Interspeech.2021-1965Suche in Google Scholar

Choe, June, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao & Nicole Holliday. 2022. Language-specific effects on automatic speech recognition errors for world Englishes. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Warner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond & Seung-Hoon Na (eds.), Proceedings of the 29th international conference on computational linguistics, 7177–7186. Gyeongju, Republic of Korea: International Committee on Computational Linguistics. Available at: https://aclanthology.org/2022.coling-1.628.Suche in Google Scholar

Clark, Herbert H. & Jean E. Fox Tree. 2002. Using uh and um in spontaneous speaking. Cognition 84(1). 73–111. https://doi.org/10.1016/s0010-0277(02)00017-3.Suche in Google Scholar

Cohn, Abigail C. 1990. Phonetic and phonological rules of nasalization. Los Angeles: University of California PhD dissertation.Suche in Google Scholar

Coto-Solano, Rolando. 2022. Computational sociophonetics using automatic speech recognition. Language and Linguistics Compass 16(9). e12474. https://doi.org/10.1111/lnc3.12474.Suche in Google Scholar

Coto-Solano, Rolando, James N. Stanford & Sravana K. Reddy. 2021. Advances in completely automated vowel analysis for sociophonetics: Using end-to-end speech recognition systems with DARLA. Frontiers in Artificial Intelligence 4. 1–19. https://doi.org/10.3389/frai.2021.662097.Suche in Google Scholar

Cukor-Avila, Patricia & Guy Bailey. 2001. The effects of the race of the interviewer on sociolinguistic fieldwork. Journal of Sociolinguistics 5(2). 252–270. https://doi.org/10.1111/1467-9481.00150.Suche in Google Scholar

Dodsworth, Robin & Mary Kohn. 2012. Urban rejection of the vernacular: The SVS undone. Language Variation and Change 24(2). 221–245. https://doi.org/10.1017/s0954394512000105.Suche in Google Scholar

Eckert, Penelope. 2012. Three waves of variation study: The emergence of meaning in the study of sociolinguistic variation. Annual Review of Anthropology 41. 87–100. https://doi.org/10.1146/annurev-anthro-092611-145828.Suche in Google Scholar

Farrington, Charlie, Sharese King & Mary Kohn. 2021. Sources of variation in the speech of African Americans: Perspectives from sociophonetics. Wiley Interdisciplinary Reviews: Cognitive Science 12(3). e1550. https://doi.org/10.1002/wcs.1550.Suche in Google Scholar

Galvez, Daniel, Greg Diamos, Juan Ciro, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder & Vijay Janapa Reddi. 2021. The people’s speech: A large-scale diverse English speech recognition dataset for commercial usage. In Joaquin Vanschoren & Serena Yeung (eds.), Neural Information Processing Systems (NeurIPS) track on datasets and benchmarks, vol. 35. Curran Associates, Inc. https://arxiv.org/pdf/2111.09344.pdf (accessed 19 April 2024).Suche in Google Scholar

Green, Spence, Jeffrey Heer & Christopher D. Manning. 2013. The efficacy of human post-editing for language translation. In Wendy E. Mackay, Stephen Brewster & Bødker Susanne (eds.), Proceedings of the special interest group on computer-human interaction (SIGCHI) conference on human factors in computing systems, 439–448. Paris: Association for Computing Machinery.10.1145/2470654.2470718Suche in Google Scholar

Johnson, Daniel Ezra. 2015. Quantifying overlap with Bhattacharyya’s affinity and other measures. Paper presented at NWAV (New Ways of Analyzing Variation) 44, Toronto, Canada, Oct 22–25, 2015.Suche in Google Scholar

Jones, Taylor, Jessica Rose Kalbfeld, Ryan Hancock & Robin Clark. 2019. Testifying while Black: An experimental study of court reporter accuracy in transcription of African American English. Language 95(2). e216–e252. https://doi.org/10.1353/lan.2019.0042.Suche in Google Scholar

Kendall, Tyler & Valerie Fridland. 2012. Variation in perception and production of mid front vowels in the US Southern vowel shift. Journal of Phonetics 40(2). 289–306. https://doi.org/10.1016/j.wocn.2011.12.002.Suche in Google Scholar

Koenecke, Allison, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky & Sharad Goel. 2020. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117(14). 7684–7689. https://doi.org/10.1073/pnas.1915768117.Suche in Google Scholar

Kominek, John & Alan W. Black. 2004. The CMU Arctic speech databases. In Alan W. Black & Kevin Lenzo (eds.), International speech communication association (ISCA) workshop on speech synthesis, vol. 5. Pittsburgh, PA: International Speech Communication Association (ISCA).Suche in Google Scholar

Kretzschmar, William A. 2015. African American voices in Atlanta. In Sonja Lanehart (ed.), The Oxford handbook of African American Language, 219–235. Oxford, UK: Oxford University Press.Suche in Google Scholar

Kretzschmar, William A. 2016. Roswell voices: Community language in a living laboratory. In Karen P. Corrigan & Adam Mearns (eds.), Creating and digitizing language corpora, volume 3: Databases for public engagement, 159–175. London: Palgrave Macmillan.10.1057/978-1-137-38645-8_6Suche in Google Scholar

Kretzschmar, William A., Sonja Lanehart, Bridget L. Anderson & Becky Childs. 2003. Roswell voices: A community oral history and dialect study. Roswell, GA: Roswell Folk and Heritage Bureau.Suche in Google Scholar

Kretzschmar, William A., Sonja Lanehart, Betsy Barry, Iyabo Osiapem & Mi-Ran Kim. 2004. Atlanta in Black and White: A new random sample of urban speech. Presentation at NWAV (New Ways of Analyzing Variation) 33.Suche in Google Scholar

Kretzschmar, William A., Claire Andres, Rachel Votta & Sasha Johnson. 2006. Roswell voices: A community oral history and dialect study, phase II. Roswell, GA: Roswell Folk and Heritage Bureau.Suche in Google Scholar

Labov, William. 1966. The effect of social mobility on linguistic behavior. Sociological Inquiry 36(2). 186–203. https://doi.org/10.1111/j.1475-682x.1966.tb00624.x.Suche in Google Scholar

Labov, William. 1972. Language in the inner city: Studies in the Black English vernacular, vol. 3. Philadelphia, PA: University of Pennsylvania Press.Suche in Google Scholar

Labov, William, Sharon Ash & Charles Boberg. 2006. The atlas of North American English: Phonetics, phonology and sound change. Berlin: Mouton de Gruyter.10.1515/9783110167467Suche in Google Scholar

Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions, and reversals (translated from the 1965 Russian original). Doklady Physics 10(8). 707–710.Suche in Google Scholar

Lobanov, Boris M. 1971. Classification of Russian vowels spoken by different speakers. The Journal of the Acoustical Society of America 49(2B). 606–608. https://doi.org/10.1121/1.1912396.Suche in Google Scholar

MacKenzie, Laurel & Danielle Turton. 2020. Assessing the accuracy of existing forced alignment software on varieties of British English. Linguistics Vanguard 6(s1). 20180061. https://doi.org/10.1515/lingvan-2018-0061.Suche in Google Scholar

McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner & Sonderegger Morgan. 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Proceedings of interspeech, vol. 2017, 498–502. Stockholm, Sweden: International Speech Communication Association (ISCA).10.21437/Interspeech.2017-1386Suche in Google Scholar

Meier, Paul. 1997. International dialects of English archive. Available at: https://www.dialectsarchive.com/.Suche in Google Scholar

Nesbitt, Monica. 2018. Economic change and the decline of raised TRAP in Lansing, MI. University of Pennsylvania Working Papers in Linguistics 24(2). 9.Suche in Google Scholar

Pratap, Vineel, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve & Ronan Collobert. 2020. Mls: A large-scale multilingual dataset for speech research. In Interspeech 2020 (Interspeech 2020). ISCA.10.21437/Interspeech.2020-2826Suche in Google Scholar

R Core Team. 2012. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: http://www.R-project.org/.Suche in Google Scholar

Reddy, Sravana & James Stanford. 2015a. A web application for automated dialect analysis. In Proceedings of the North American chapter of the Association for Computational Linguistics (NAACL): Demonstrations, 71–75. Denver, CO: Association for Computational Linguistics (ACL) anthology.10.3115/v1/N15-3015Suche in Google Scholar

Reddy, Sravana & James N. Stanford. 2015b. Toward completely automated vowel extraction: Introducing DARLA. Linguistics Vanguard 1(1). 15–28. https://doi.org/10.1515/lingvan-2015-0002.Suche in Google Scholar

Renwick, Margaret E.L., Joseph A. Stanley, Jon Forrest & Lelia Glass. 2023. Boomer peak or Gen X cliff? from SVS to LBMS in Georgia English. Language Variation and Change 35. 175–197. https://doi.org/10.1017/s095439452300011x.Suche in Google Scholar

Rickford, John R. & Sharese King. 2016. Language and linguistics on trial: Hearing Rachel Jeantel (and other vernacular speakers) in the courtroom and beyond. Language 92(4). 948–988. https://doi.org/10.1353/lan.2016.0078.Suche in Google Scholar

Rickford, John R. & Faye McNair-Knox. 1994. Addressee- and topic-influenced style shift: A quantitative sociolinguistic study. In Douglas Biber & Edward Finnegan (eds.), Sociolinguistic perspectives on register, 235–276. New York: Oxford University Press.10.1093/oso/9780195083644.003.0011Suche in Google Scholar

Rosenfelder, Ingrid, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, Kyle Gorman, Hilary Prichard & Jiahong Yuan. 2014. FAVE (forced alignment and vowel extraction) suite version 1.1. 3. Version v1 [Computer program]. Available at: https://zenodo.org/records/9846.Suche in Google Scholar

Tatman, Rachael. 2017. Gender and dialect bias in YouTube’s automatic captions. In Dirk Hovy, Shannon Spruit, Margaret Mitchell, Emily M. Bender, Michael Strube & Hanna Wallach (eds.), Proceedings of the first Association for Computational Linguistics (ACL) workshop on ethics in natural language processing, 53–59. Valencia, Spain: Association for Computational Linguistics (ACL) Anthology.10.18653/v1/W17-1606Suche in Google Scholar

Tatman, Rachael & Conner Kasten. 2017. Effects of Talker dialect, gender & race on accuracy of bing speech and YouTube automatic captions. In Proc. interspeech 2017, 934–938.10.21437/Interspeech.2017-1746Suche in Google Scholar

Thomas, Erik R. 2003. Secrets revealed by Southern vowel shifting. American Speech 78(2). 150–170. https://doi.org/10.1215/00031283-78-2-150.Suche in Google Scholar

Thomas, Erik R. 2007. Phonological and phonetic characteristics of African American Vernacular English. Language and Linguistics Compass 1(5). 450–475. https://doi.org/10.1111/j.1749-818x.2007.00029.x.Suche in Google Scholar

Wassink, Alicia, Rob Squizzero, Campion Fellin & David Nichols. 2018. Client libraries oxford (CLOx): Automated transcription for sociolinguistic interviews. Version 7.17.2021. [Computer program]. Available at: https://clox.ling.washington.edu.Suche in Google Scholar

Wells, John Corson. 1982. Accents of English, vol. 1. Cambridge, UK: Cambridge University Press.10.1017/CBO9780511611759Suche in Google Scholar

West, Paula. 1999. The extent of coarticulation of English liquids: An acoustic and articulatory study. In Proceedings of the international congress of phonetic sciences (ICPhS), vol. 14, 1901–1904. San Francisco, CA. http://www.phon.ox.ac.uk/files/people/west/icphswest.pdf (accessed 19 April 2024).Suche in Google Scholar

Received: 2023-05-18

Accepted: 2023-09-28

Published Online: 2024-05-07

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/lingvan-2023-0073

Schlagwörter für diesen Artikel

automated transcription; automated speech recognition; linguistics; sociophonetics