Abstract
Variationist sociolinguistic methodology is grounded in the principle of accountability, which requires researchers to identify all of the contexts in which a given variable occurs or fails to occur. For morphosyntactic, lexical, and discourse variables, this process is notoriously time- and labor-intensive, as researchers manually sift through raw data in search of tokens to analyze. In this article, we demonstrate the usability of pretrained computational language models to automatically identify tokens of sociolinguistic variables in raw text. We focus on two English-language variables from different linguistic domains: intensifier choice (lexical; e.g., she is {very, really, so} smart) and complementizer selection (morphosyntactic; e.g., they thought {that, Ø} I understood). Text classifiers built with Bidirectional Encoder Representations from Transformers (BERT) achieve high precision and recall metrics for both variables, even with relatively little hand-annotated training data. Our findings suggest that computational language models can dramatically reduce the burden of preparing data for variationist analysis. Furthermore, by inspecting the classifiers’ scores for individual sentences, researchers can observe patterns that should be written into the description of the variable context for further study.
Acknowledgments
We gratefully acknowledge David Bamman, Lucy Li, Nicholas Tomlin, and audiences at Methods in Dialectology XVII, New Ways of Analyzing Variation 50, and the Sociolinguistics Lab at Berkeley for their feedback on this project. We also thank Sali Tagliamonte for sending us a dataset on intensifier variation (described in Tagliamonte and Roberts 2005), without which this research would not have been possible.
References
Adli, Aria & Gregory R. Guy. 2022. Globalising the study of language variation and change: A manifesto on cross-cultural sociolinguistics. Language and Linguistics Compass 16(5–6). 1–15. https://doi.org/10.1111/lnc3.12452.Suche in Google Scholar
Bayley, Robert. 2013. The quantitative paradigm. In J. K. Chambers & Natalie Schilling (eds.), The handbook of language variation and change, 85–107. Malden, MA: Wiley-Blackwell.Suche in Google Scholar
Bleaman, Isaac L., Katie Cugno & Annie Helms. 2022. Medium-shifting and intraspeaker variation in conversational interviews. Language Variation and Change 34. 305–329. https://doi.org/10.1017/S0954394522000151.Suche in Google Scholar
Campbell-Kibler, Kathryn. 2007. Accent, (ING), and the social logic of listener perceptions. American Speech 82(1). 32–64. https://doi.org/10.1215/00031283-2007-002.Suche in Google Scholar
Demszky, Dorottya, Devyani Sharma, Jonathan H. Clark, Vinodkumar Prabhakaran & Jacob Eisenstein. 2021. Learning to recognize dialect features. Proceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies, 2315–2338. Association for Computational Linguistics. Available at: https://aclanthology.org/2021.naacl-main.184/.10.18653/v1/2021.naacl-main.184Suche in Google Scholar
Devlin, Jacob, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies, 1, 4171–4186. Association for Computational Linguistics. Available at: https://aclanthology.org/N19-1423/.Suche in Google Scholar
Fischer, John L. 1958. Social influences on the choice of a linguistic variant. Word 14(1). 47–56. https://doi.org/10.1080/00437956.1958.11659655.Suche in Google Scholar
Forrest, Jon. 2017. The dynamic interaction between lexical and contextual frequency: A case study of (ING). Language Variation and Change 29(2). 129–156. https://doi.org/10.1017/S0954394517000072.Suche in Google Scholar
Gordon, Matthew J. 2013. Labov: A guide for the perplexed. London: Bloomsbury.10.5040/9781472541673Suche in Google Scholar
Hazen, Kirk. 2008. (ING): A vernacular baseline for English in Appalachia. American Speech 83(2). 116–140. https://doi.org/10.1215/00031283-2008-008.Suche in Google Scholar
Kroch, Anthony & Cathy Small. 1978. Grammatical ideology and its effect on speech. In David Sankoff (ed.), Linguistic variation: Models and methods, 45–55. New York: Academic Press.Suche in Google Scholar
Labov, William. 1969. Contraction, deletion, and inherent variability of the English copula. Language 45(4). 715–762. https://doi.org/10.2307/412333.Suche in Google Scholar
Labov, William. 1972a. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.Suche in Google Scholar
Labov, William. 1972b. Some principles of linguistic methodology. Language in Society 1(1). 97–120. https://doi.org/10.1017/S0047404500006576.Suche in Google Scholar
Labov, William. 1978. Where does the linguistic variable stop? A response to Beatriz Lavandera. In Richard Bauman & Joel Sherzer (eds.), Working papers in sociolinguistics, vol. 44, 1–17. Austin, TX: Southwest Educational Development Laboratory. Available at: https://eric.ed.gov/?id=ED157378.Suche in Google Scholar
Labov, William. 2004. Quantitative analysis of linguistic variation. In Ulrich Ammon, Norbert Dittmar, Klaus J. Mattheier & Peter Trudgill (eds.), Sociolinguistics: An international handbook of the science of language and society, 2nd edn, vol. 1, 6–21. Berlin: Walter de Gruyter.Suche in Google Scholar
Lavandera, Beatriz R. 1978. Where does the sociolinguistic variable stop? Language in Society 7(2). 171–182. https://doi.org/10.1017/S0047404500005510.Suche in Google Scholar
Liang, Yiming, Pascal Amsili & Heather Burnett. 2021. New ways of analyzing complementizer drop in Montréal French: Exploration of cognitive factors. Language Variation and Change 33(3). 359–385. https://doi.org/10.1017/S0954394521000223.Suche in Google Scholar
Marcus, Mitchell P., Beatrice Santorini, Mary Ann Marcinkiewicz & Ann Taylor. 1999. Treebank-3 LDC99T42. Philadelphia: Linguistic Data Consortium.Suche in Google Scholar
Masis, Tessa, Anissa Neal, Lisa Green & Brendan O’Connor. 2022. Corpus-guided contrast sets for morphosyntactic feature detection in low-resource English varieties. In Proceedings of the first workshop on NLP applications to field linguistics, COLING, 11–25. International Conference on Computational Linguistics. Available at: https://aclanthology.org/2022.fieldmatters-1.2.Suche in Google Scholar
Meyerhoff, Miriam. 2011. Introducing sociolinguistics, 2nd edn. London & New York: Routledge.Suche in Google Scholar
Meyerhoff, Miriam & Naomi Nagy. 2008. Introduction: Social lives in language. In Miriam Meyerhoff & Naomi Nagy (eds.), Social lives in language – sociolinguistics and multilingual speech communities: Celebrating the work of Gillian Sankoff, 1–16. Amsterdam: John Benjamins.10.1075/impact.24.02nagSuche in Google Scholar
Milroy, Lesley & Matthew Gordon. 2003. Sociolinguistics: Method and interpretation. Malden, MA: Blackwell.10.1002/9780470758359Suche in Google Scholar
Nguyen, Dong, A. Seza Doğruöz, Carolyn P. Rosé & Franciska de Jong. 2016. Computational sociolinguistics: A survey. Computational Linguistics 42(3). 537–593. https://doi.org/10.1162/coli_a_00258.Suche in Google Scholar
Rickford, John R., Arnetha Ball, Renee Blake, Raina Jackson & Nomi Martin. 1991. Rappin on the copula coffin: Theoretical and methodological issues in the analysis of copula variation in African-American Vernacular English. Language Variation and Change 3(1). 103–132. https://doi.org/10.1017/S0954394500000466.Suche in Google Scholar
Rissanen, Matti. 1991. On the history of that/zero as object clause links in English. In Karin Aijmer & Bengt Altenberg (eds.), English corpus linguistics: Studies in honour of Jan Svartvik, 272–289. New York & London: Routledge.Suche in Google Scholar
Rodríguez Riccelli, Adrián. 2018. Espero estén todos: The distribution of the null subordinating complementizer in two varieties of Spanish. In Jeremy King & Sandro Sessarego (eds.), Language variation and contact-induced change: Spanish across space and time, 299–333. Amsterdam: John Benjamins.10.1075/cilt.340.14ricSuche in Google Scholar
Rohdenburg, Günter. 1996. Cognitive complexity and increased grammatical explicitness in English. Cognitive Linguistics 7. 149–182. https://doi.org/10.1515/cogl.1996.7.2.149.Suche in Google Scholar
Romaine, Suzanne. 1984. On the problem of syntactic variation and pragmatic meaning in sociolinguistic theory. Folia Linguistica 18. 409–437. https://doi.org/10.1515/flin.1984.18.3-4.0.Suche in Google Scholar
Sankoff, Gillian. 1990. The grammaticalization of tense and aspect in Tok Pisin and Sranan. Language Variation and Change 2(3). 295–312. https://doi.org/10.1017/S0954394500000387.Suche in Google Scholar
Stanford, James N. & Dennis R. Preston. 2009. The lure of a distant horizon: Variation in indigenous minority languages. In James N. Stanford & Dennis R. Preston (eds.), Variation in indigenous minority languages, 1–20. Amsterdam: John Benjamins.10.1075/impact.25.01staSuche in Google Scholar
Szmrecsanyi, Benedikt, Jason Grafmiller, Joan Bresnan, Anette Rosenbach, Sali Tagliamonte & Simon Todd. 2017. Spoken syntax in a comparative perspective: The dative and genitive alternation in varieties of English. Glossa 2(1). 1–27. https://doi.org/10.5334/gjgl.310.Suche in Google Scholar
Tagliamonte, Sali A. 2006. Analysing sociolinguistic variation. Cambridge: Cambridge University Press.10.1017/CBO9780511801624Suche in Google Scholar
Tagliamonte, Sali A. 2012. Variationist sociolinguistics: Change, observation, interpretation. Malden, MA: Wiley-Blackwell.Suche in Google Scholar
Tagliamonte, Sali A. 2016. Teen talk: The language of adolescents. Cambridge: Cambridge University Press.10.1017/CBO9781139583800Suche in Google Scholar
Tagliamonte, Sali & Chris Roberts. 2005. So weird; so cool; so innovative: The use of intensifiers in the television series Friends. American Speech 80(3). 280–300. https://doi.org/10.1215/00031283-80-3-280.Suche in Google Scholar
Tagliamonte, Sali & Jennifer Smith. 2005. No momentary fancy! The zero “complementizer” in English dialects. English Language and Linguistics 9(2). 289–309. https://doi.org/10.1017/S1360674305001644.Suche in Google Scholar
Torres Cacoullos, Rena & James A. Walker. 2009. On the persistence of grammar in discourse formulas: A variationist study of that. Linguistics 47(1). 1–43. https://doi.org/10.1515/LING.2009.001.Suche in Google Scholar
Turc, Iulia, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2019. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint. https://doi.org/10.48550/arXiv.1908.08962.Suche in Google Scholar
Walker, James A. 2010. Variation in linguistic systems. New York: Routledge.Suche in Google Scholar
Walker, James A. 2013. Variation analysis. In Robert J. Podesva & Devyani Sharma (eds.), Research methods in linguistics, 440–459. Cambridge: Cambridge University Press.Suche in Google Scholar
© 2024 Walter de Gruyter GmbH, Berlin/Boston
Artikel in diesem Heft
- Frontmatter
- Editorial
- Editorial 2024
- Phonetics & Phonology
- The role of recoverability in the implementation of non-phonemic glottalization in Hawaiian
- Epenthetic vowel quality crosslinguistically, with focus on Modern Hebrew
- Japanese speakers can infer specific sub-lexicons using phonotactic cues
- Articulatory phonetics in the market: combining public engagement with ultrasound data collection
- Investigating the acoustic fidelity of vowels across remote recording methods
- The role of coarticulatory tonal information in Cantonese spoken word recognition: an eye-tracking study
- Tracking phonological regularities: exploring the influence of learning mode and regularity locus in adult phonological learning
- Morphology & Syntax
- #AreHashtagsWords? Structure, position, and syntactic integration of hashtags in (English) tweets
- The meaning of morphomes: distributional semantics of Spanish stem alternations
- A refinement of the analysis of the resultative V-de construction in Mandarin Chinese
- L2 cognitive construal and morphosyntactic acquisition of pseudo-passive constructions
- Semantics & Pragmatics
- “All women are like that”: an overview of linguistic deindividualization and dehumanization of women in the incelosphere
- Counterfactual language, emotion, and perspective: a sentence completion study during the COVID-19 pandemic
- Constructing elderly patients’ agency through conversational storytelling
- Language Documentation & Typology
- Conative animal calls in Macha Oromo: function and form
- The syntax of African American English borrowings in the Louisiana Creole tense-mood-aspect system
- Syntactic pausing? Re-examining the associations
- Bibliographic bias and information-density sampling
- Historical & Comparative Linguistics
- Revisiting the hypothesis of ideophones as windows to language evolution
- Verifying the morpho-semantics of aspect via typological homogeneity
- Psycholinguistics & Neurolinguistics
- Sign recognition: the effect of parameters and features in sign mispronunciations
- Influence of translation on perceived metaphor features: quality, aptness, metaphoricity, and familiarity
- Effects of grammatical gender on gender inferences: Evidence from French hybrid nouns
- Processing reflexives in adjunct control: an exploration of attraction effects
- Language Acquisition & Language Learning
- How do L1 glosses affect EFL learners’ reading comprehension performance? An eye-tracking study
- Modeling L2 motivation change and its predictive effects on learning behaviors in the extramural digital context: a quantitative investigation in China
- Ongoing exposure to an ambient language continues to build implicit knowledge across the lifespan
- On the relationship between complexity of primary occupation and L2 varietal behavior in adult migrants in Austria
- The acquisition of speaking fundamental frequency (F0) features in Cantonese and English by simultaneous bilingual children
- Sociolinguistics & Anthropological Linguistics
- A computational approach to detecting the envelope of variation
- Attitudes toward code-switching among bilingual Jordanians: a comparative study
- “Let’s ride this out together”: unpacking multilingual top-down and bottom-up pandemic communication evidenced in Singapore’s coronavirus-related linguistic and semiotic landscape
- Across time, space, and genres: measuring probabilistic grammar distances between varieties of Mandarin
- Navigating linguistic ideologies and market dynamics within China’s English language teaching landscape
- Streetscapes and memories of real socialist anti-fascism in south-eastern Europe: between dystopianism and utopianism
- What can NLP do for linguistics? Towards using grammatical error analysis to document non-standard English features
- From sociolinguistic perception to strategic action in the study of social meaning
- Minority genders in quantitative survey research: a data-driven approach to clear, inclusive, and accurate gender questions
- Variation is the way to perfection: imperfect rhyming in Chinese hip hop
- Shifts in digital media usage before and after the pandemic by Rusyns in Ukraine
- Computational & Corpus Linguistics
- Revisiting the automatic prediction of lexical errors in Mandarin
- Finding continuers in Swedish Sign Language
- Conversational priming in repetitional responses as a mechanism in language change: evidence from agent-based modelling
- Construction grammar and procedural semantics for human-interpretable grounded language processing
- Through the compression glass: language complexity and the linguistic structure of compressed strings
- Could this be next for corpus linguistics? Methods of semi-automatic data annotation with contextualized word embeddings
- The Red Hen Audio Tagger
- Code-switching in computer-mediated communication by Gen Z Japanese Americans
- Supervised prediction of production patterns using machine learning algorithms
- Introducing Bed Word: a new automated speech recognition tool for sociolinguistic interview transcription
- Decoding French equivalents of the English present perfect: evidence from parallel corpora of parliamentary documents
- Enhancing automated essay scoring with GCNs and multi-level features for robust multidimensional assessments
- Sociolinguistic auto-coding has fairness problems too: measuring and mitigating bias
- The role of syntax in hashtag popularity
- Language practices of Chinese doctoral students studying abroad on social media: a translanguaging perspective
- Cognitive Linguistics
- Metaphor and gender: are words associated with source domains perceived in a gendered way?
- Crossmodal correspondence between lexical tones and visual motions: a forced-choice mapping task on Mandarin Chinese
Artikel in diesem Heft
- Frontmatter
- Editorial
- Editorial 2024
- Phonetics & Phonology
- The role of recoverability in the implementation of non-phonemic glottalization in Hawaiian
- Epenthetic vowel quality crosslinguistically, with focus on Modern Hebrew
- Japanese speakers can infer specific sub-lexicons using phonotactic cues
- Articulatory phonetics in the market: combining public engagement with ultrasound data collection
- Investigating the acoustic fidelity of vowels across remote recording methods
- The role of coarticulatory tonal information in Cantonese spoken word recognition: an eye-tracking study
- Tracking phonological regularities: exploring the influence of learning mode and regularity locus in adult phonological learning
- Morphology & Syntax
- #AreHashtagsWords? Structure, position, and syntactic integration of hashtags in (English) tweets
- The meaning of morphomes: distributional semantics of Spanish stem alternations
- A refinement of the analysis of the resultative V-de construction in Mandarin Chinese
- L2 cognitive construal and morphosyntactic acquisition of pseudo-passive constructions
- Semantics & Pragmatics
- “All women are like that”: an overview of linguistic deindividualization and dehumanization of women in the incelosphere
- Counterfactual language, emotion, and perspective: a sentence completion study during the COVID-19 pandemic
- Constructing elderly patients’ agency through conversational storytelling
- Language Documentation & Typology
- Conative animal calls in Macha Oromo: function and form
- The syntax of African American English borrowings in the Louisiana Creole tense-mood-aspect system
- Syntactic pausing? Re-examining the associations
- Bibliographic bias and information-density sampling
- Historical & Comparative Linguistics
- Revisiting the hypothesis of ideophones as windows to language evolution
- Verifying the morpho-semantics of aspect via typological homogeneity
- Psycholinguistics & Neurolinguistics
- Sign recognition: the effect of parameters and features in sign mispronunciations
- Influence of translation on perceived metaphor features: quality, aptness, metaphoricity, and familiarity
- Effects of grammatical gender on gender inferences: Evidence from French hybrid nouns
- Processing reflexives in adjunct control: an exploration of attraction effects
- Language Acquisition & Language Learning
- How do L1 glosses affect EFL learners’ reading comprehension performance? An eye-tracking study
- Modeling L2 motivation change and its predictive effects on learning behaviors in the extramural digital context: a quantitative investigation in China
- Ongoing exposure to an ambient language continues to build implicit knowledge across the lifespan
- On the relationship between complexity of primary occupation and L2 varietal behavior in adult migrants in Austria
- The acquisition of speaking fundamental frequency (F0) features in Cantonese and English by simultaneous bilingual children
- Sociolinguistics & Anthropological Linguistics
- A computational approach to detecting the envelope of variation
- Attitudes toward code-switching among bilingual Jordanians: a comparative study
- “Let’s ride this out together”: unpacking multilingual top-down and bottom-up pandemic communication evidenced in Singapore’s coronavirus-related linguistic and semiotic landscape
- Across time, space, and genres: measuring probabilistic grammar distances between varieties of Mandarin
- Navigating linguistic ideologies and market dynamics within China’s English language teaching landscape
- Streetscapes and memories of real socialist anti-fascism in south-eastern Europe: between dystopianism and utopianism
- What can NLP do for linguistics? Towards using grammatical error analysis to document non-standard English features
- From sociolinguistic perception to strategic action in the study of social meaning
- Minority genders in quantitative survey research: a data-driven approach to clear, inclusive, and accurate gender questions
- Variation is the way to perfection: imperfect rhyming in Chinese hip hop
- Shifts in digital media usage before and after the pandemic by Rusyns in Ukraine
- Computational & Corpus Linguistics
- Revisiting the automatic prediction of lexical errors in Mandarin
- Finding continuers in Swedish Sign Language
- Conversational priming in repetitional responses as a mechanism in language change: evidence from agent-based modelling
- Construction grammar and procedural semantics for human-interpretable grounded language processing
- Through the compression glass: language complexity and the linguistic structure of compressed strings
- Could this be next for corpus linguistics? Methods of semi-automatic data annotation with contextualized word embeddings
- The Red Hen Audio Tagger
- Code-switching in computer-mediated communication by Gen Z Japanese Americans
- Supervised prediction of production patterns using machine learning algorithms
- Introducing Bed Word: a new automated speech recognition tool for sociolinguistic interview transcription
- Decoding French equivalents of the English present perfect: evidence from parallel corpora of parliamentary documents
- Enhancing automated essay scoring with GCNs and multi-level features for robust multidimensional assessments
- Sociolinguistic auto-coding has fairness problems too: measuring and mitigating bias
- The role of syntax in hashtag popularity
- Language practices of Chinese doctoral students studying abroad on social media: a translanguaging perspective
- Cognitive Linguistics
- Metaphor and gender: are words associated with source domains perceived in a gendered way?
- Crossmodal correspondence between lexical tones and visual motions: a forced-choice mapping task on Mandarin Chinese