Home Linguistics & Semiotics A computational approach to detecting the envelope of variation
Article
Licensed
Unlicensed Requires Authentication

A computational approach to detecting the envelope of variation

  • Isaac L. Bleaman ORCID logo EMAIL logo and Rhea Kommerell ORCID logo
Published/Copyright: September 25, 2024

Abstract

Variationist sociolinguistic methodology is grounded in the principle of accountability, which requires researchers to identify all of the contexts in which a given variable occurs or fails to occur. For morphosyntactic, lexical, and discourse variables, this process is notoriously time- and labor-intensive, as researchers manually sift through raw data in search of tokens to analyze. In this article, we demonstrate the usability of pretrained computational language models to automatically identify tokens of sociolinguistic variables in raw text. We focus on two English-language variables from different linguistic domains: intensifier choice (lexical; e.g., she is {very, really, so} smart) and complementizer selection (morphosyntactic; e.g., they thought {that, Ø} I understood). Text classifiers built with Bidirectional Encoder Representations from Transformers (BERT) achieve high precision and recall metrics for both variables, even with relatively little hand-annotated training data. Our findings suggest that computational language models can dramatically reduce the burden of preparing data for variationist analysis. Furthermore, by inspecting the classifiers’ scores for individual sentences, researchers can observe patterns that should be written into the description of the variable context for further study.


Corresponding author: Isaac L. Bleaman, University of California, Berkeley, USA, E-mail:

Acknowledgments

We gratefully acknowledge David Bamman, Lucy Li, Nicholas Tomlin, and audiences at Methods in Dialectology XVII, New Ways of Analyzing Variation 50, and the Sociolinguistics Lab at Berkeley for their feedback on this project. We also thank Sali Tagliamonte for sending us a dataset on intensifier variation (described in Tagliamonte and Roberts 2005), without which this research would not have been possible.

References

Adli, Aria & Gregory R. Guy. 2022. Globalising the study of language variation and change: A manifesto on cross-cultural sociolinguistics. Language and Linguistics Compass 16(5–6). 1–15. https://doi.org/10.1111/lnc3.12452.Search in Google Scholar

Bayley, Robert. 2013. The quantitative paradigm. In J. K. Chambers & Natalie Schilling (eds.), The handbook of language variation and change, 85–107. Malden, MA: Wiley-Blackwell.Search in Google Scholar

Bleaman, Isaac L., Katie Cugno & Annie Helms. 2022. Medium-shifting and intraspeaker variation in conversational interviews. Language Variation and Change 34. 305–329. https://doi.org/10.1017/S0954394522000151.Search in Google Scholar

Campbell-Kibler, Kathryn. 2007. Accent, (ING), and the social logic of listener perceptions. American Speech 82(1). 32–64. https://doi.org/10.1215/00031283-2007-002.Search in Google Scholar

Demszky, Dorottya, Devyani Sharma, Jonathan H. Clark, Vinodkumar Prabhakaran & Jacob Eisenstein. 2021. Learning to recognize dialect features. Proceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies, 2315–2338. Association for Computational Linguistics. Available at: https://aclanthology.org/2021.naacl-main.184/.10.18653/v1/2021.naacl-main.184Search in Google Scholar

Devlin, Jacob, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies, 1, 4171–4186. Association for Computational Linguistics. Available at: https://aclanthology.org/N19-1423/.Search in Google Scholar

Fischer, John L. 1958. Social influences on the choice of a linguistic variant. Word 14(1). 47–56. https://doi.org/10.1080/00437956.1958.11659655.Search in Google Scholar

Forrest, Jon. 2017. The dynamic interaction between lexical and contextual frequency: A case study of (ING). Language Variation and Change 29(2). 129–156. https://doi.org/10.1017/S0954394517000072.Search in Google Scholar

Gordon, Matthew J. 2013. Labov: A guide for the perplexed. London: Bloomsbury.10.5040/9781472541673Search in Google Scholar

Hazen, Kirk. 2008. (ING): A vernacular baseline for English in Appalachia. American Speech 83(2). 116–140. https://doi.org/10.1215/00031283-2008-008.Search in Google Scholar

Kroch, Anthony & Cathy Small. 1978. Grammatical ideology and its effect on speech. In David Sankoff (ed.), Linguistic variation: Models and methods, 45–55. New York: Academic Press.Search in Google Scholar

Labov, William. 1969. Contraction, deletion, and inherent variability of the English copula. Language 45(4). 715–762. https://doi.org/10.2307/412333.Search in Google Scholar

Labov, William. 1972a. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.Search in Google Scholar

Labov, William. 1972b. Some principles of linguistic methodology. Language in Society 1(1). 97–120. https://doi.org/10.1017/S0047404500006576.Search in Google Scholar

Labov, William. 1978. Where does the linguistic variable stop? A response to Beatriz Lavandera. In Richard Bauman & Joel Sherzer (eds.), Working papers in sociolinguistics, vol. 44, 1–17. Austin, TX: Southwest Educational Development Laboratory. Available at: https://eric.ed.gov/?id=ED157378.Search in Google Scholar

Labov, William. 2004. Quantitative analysis of linguistic variation. In Ulrich Ammon, Norbert Dittmar, Klaus J. Mattheier & Peter Trudgill (eds.), Sociolinguistics: An international handbook of the science of language and society, 2nd edn, vol. 1, 6–21. Berlin: Walter de Gruyter.Search in Google Scholar

Lavandera, Beatriz R. 1978. Where does the sociolinguistic variable stop? Language in Society 7(2). 171–182. https://doi.org/10.1017/S0047404500005510.Search in Google Scholar

Liang, Yiming, Pascal Amsili & Heather Burnett. 2021. New ways of analyzing complementizer drop in Montréal French: Exploration of cognitive factors. Language Variation and Change 33(3). 359–385. https://doi.org/10.1017/S0954394521000223.Search in Google Scholar

Marcus, Mitchell P., Beatrice Santorini, Mary Ann Marcinkiewicz & Ann Taylor. 1999. Treebank-3 LDC99T42. Philadelphia: Linguistic Data Consortium.Search in Google Scholar

Masis, Tessa, Anissa Neal, Lisa Green & Brendan O’Connor. 2022. Corpus-guided contrast sets for morphosyntactic feature detection in low-resource English varieties. In Proceedings of the first workshop on NLP applications to field linguistics, COLING, 11–25. International Conference on Computational Linguistics. Available at: https://aclanthology.org/2022.fieldmatters-1.2.Search in Google Scholar

Meyerhoff, Miriam. 2011. Introducing sociolinguistics, 2nd edn. London & New York: Routledge.Search in Google Scholar

Meyerhoff, Miriam & Naomi Nagy. 2008. Introduction: Social lives in language. In Miriam Meyerhoff & Naomi Nagy (eds.), Social lives in language – sociolinguistics and multilingual speech communities: Celebrating the work of Gillian Sankoff, 1–16. Amsterdam: John Benjamins.10.1075/impact.24.02nagSearch in Google Scholar

Milroy, Lesley & Matthew Gordon. 2003. Sociolinguistics: Method and interpretation. Malden, MA: Blackwell.10.1002/9780470758359Search in Google Scholar

Nguyen, Dong, A. Seza Doğruöz, Carolyn P. Rosé & Franciska de Jong. 2016. Computational sociolinguistics: A survey. Computational Linguistics 42(3). 537–593. https://doi.org/10.1162/coli_a_00258.Search in Google Scholar

Rickford, John R., Arnetha Ball, Renee Blake, Raina Jackson & Nomi Martin. 1991. Rappin on the copula coffin: Theoretical and methodological issues in the analysis of copula variation in African-American Vernacular English. Language Variation and Change 3(1). 103–132. https://doi.org/10.1017/S0954394500000466.Search in Google Scholar

Rissanen, Matti. 1991. On the history of that/zero as object clause links in English. In Karin Aijmer & Bengt Altenberg (eds.), English corpus linguistics: Studies in honour of Jan Svartvik, 272–289. New York & London: Routledge.Search in Google Scholar

Rodríguez Riccelli, Adrián. 2018. Espero estén todos: The distribution of the null subordinating complementizer in two varieties of Spanish. In Jeremy King & Sandro Sessarego (eds.), Language variation and contact-induced change: Spanish across space and time, 299–333. Amsterdam: John Benjamins.10.1075/cilt.340.14ricSearch in Google Scholar

Rohdenburg, Günter. 1996. Cognitive complexity and increased grammatical explicitness in English. Cognitive Linguistics 7. 149–182. https://doi.org/10.1515/cogl.1996.7.2.149.Search in Google Scholar

Romaine, Suzanne. 1984. On the problem of syntactic variation and pragmatic meaning in sociolinguistic theory. Folia Linguistica 18. 409–437. https://doi.org/10.1515/flin.1984.18.3-4.0.Search in Google Scholar

Sankoff, Gillian. 1990. The grammaticalization of tense and aspect in Tok Pisin and Sranan. Language Variation and Change 2(3). 295–312. https://doi.org/10.1017/S0954394500000387.Search in Google Scholar

Stanford, James N. & Dennis R. Preston. 2009. The lure of a distant horizon: Variation in indigenous minority languages. In James N. Stanford & Dennis R. Preston (eds.), Variation in indigenous minority languages, 1–20. Amsterdam: John Benjamins.10.1075/impact.25.01staSearch in Google Scholar

Szmrecsanyi, Benedikt, Jason Grafmiller, Joan Bresnan, Anette Rosenbach, Sali Tagliamonte & Simon Todd. 2017. Spoken syntax in a comparative perspective: The dative and genitive alternation in varieties of English. Glossa 2(1). 1–27. https://doi.org/10.5334/gjgl.310.Search in Google Scholar

Tagliamonte, Sali A. 2006. Analysing sociolinguistic variation. Cambridge: Cambridge University Press.10.1017/CBO9780511801624Search in Google Scholar

Tagliamonte, Sali A. 2012. Variationist sociolinguistics: Change, observation, interpretation. Malden, MA: Wiley-Blackwell.Search in Google Scholar

Tagliamonte, Sali A. 2016. Teen talk: The language of adolescents. Cambridge: Cambridge University Press.10.1017/CBO9781139583800Search in Google Scholar

Tagliamonte, Sali & Chris Roberts. 2005. So weird; so cool; so innovative: The use of intensifiers in the television series Friends. American Speech 80(3). 280–300. https://doi.org/10.1215/00031283-80-3-280.Search in Google Scholar

Tagliamonte, Sali & Jennifer Smith. 2005. No momentary fancy! The zero “complementizer” in English dialects. English Language and Linguistics 9(2). 289–309. https://doi.org/10.1017/S1360674305001644.Search in Google Scholar

Torres Cacoullos, Rena & James A. Walker. 2009. On the persistence of grammar in discourse formulas: A variationist study of that. Linguistics 47(1). 1–43. https://doi.org/10.1515/LING.2009.001.Search in Google Scholar

Turc, Iulia, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2019. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint. https://doi.org/10.48550/arXiv.1908.08962.Search in Google Scholar

Walker, James A. 2010. Variation in linguistic systems. New York: Routledge.Search in Google Scholar

Walker, James A. 2013. Variation analysis. In Robert J. Podesva & Devyani Sharma (eds.), Research methods in linguistics, 440–459. Cambridge: Cambridge University Press.Search in Google Scholar

Received: 2023-10-23
Accepted: 2024-06-10
Published Online: 2024-09-25

© 2024 Walter de Gruyter GmbH, Berlin/Boston

Articles in the same Issue

  1. Frontmatter
  2. Editorial
  3. Editorial 2024
  4. Phonetics & Phonology
  5. The role of recoverability in the implementation of non-phonemic glottalization in Hawaiian
  6. Epenthetic vowel quality crosslinguistically, with focus on Modern Hebrew
  7. Japanese speakers can infer specific sub-lexicons using phonotactic cues
  8. Articulatory phonetics in the market: combining public engagement with ultrasound data collection
  9. Investigating the acoustic fidelity of vowels across remote recording methods
  10. The role of coarticulatory tonal information in Cantonese spoken word recognition: an eye-tracking study
  11. Tracking phonological regularities: exploring the influence of learning mode and regularity locus in adult phonological learning
  12. Morphology & Syntax
  13. #AreHashtagsWords? Structure, position, and syntactic integration of hashtags in (English) tweets
  14. The meaning of morphomes: distributional semantics of Spanish stem alternations
  15. A refinement of the analysis of the resultative V-de construction in Mandarin Chinese
  16. L2 cognitive construal and morphosyntactic acquisition of pseudo-passive constructions
  17. Semantics & Pragmatics
  18. “All women are like that”: an overview of linguistic deindividualization and dehumanization of women in the incelosphere
  19. Counterfactual language, emotion, and perspective: a sentence completion study during the COVID-19 pandemic
  20. Constructing elderly patients’ agency through conversational storytelling
  21. Language Documentation & Typology
  22. Conative animal calls in Macha Oromo: function and form
  23. The syntax of African American English borrowings in the Louisiana Creole tense-mood-aspect system
  24. Syntactic pausing? Re-examining the associations
  25. Bibliographic bias and information-density sampling
  26. Historical & Comparative Linguistics
  27. Revisiting the hypothesis of ideophones as windows to language evolution
  28. Verifying the morpho-semantics of aspect via typological homogeneity
  29. Psycholinguistics & Neurolinguistics
  30. Sign recognition: the effect of parameters and features in sign mispronunciations
  31. Influence of translation on perceived metaphor features: quality, aptness, metaphoricity, and familiarity
  32. Effects of grammatical gender on gender inferences: Evidence from French hybrid nouns
  33. Processing reflexives in adjunct control: an exploration of attraction effects
  34. Language Acquisition & Language Learning
  35. How do L1 glosses affect EFL learners’ reading comprehension performance? An eye-tracking study
  36. Modeling L2 motivation change and its predictive effects on learning behaviors in the extramural digital context: a quantitative investigation in China
  37. Ongoing exposure to an ambient language continues to build implicit knowledge across the lifespan
  38. On the relationship between complexity of primary occupation and L2 varietal behavior in adult migrants in Austria
  39. The acquisition of speaking fundamental frequency (F0) features in Cantonese and English by simultaneous bilingual children
  40. Sociolinguistics & Anthropological Linguistics
  41. A computational approach to detecting the envelope of variation
  42. Attitudes toward code-switching among bilingual Jordanians: a comparative study
  43. “Let’s ride this out together”: unpacking multilingual top-down and bottom-up pandemic communication evidenced in Singapore’s coronavirus-related linguistic and semiotic landscape
  44. Across time, space, and genres: measuring probabilistic grammar distances between varieties of Mandarin
  45. Navigating linguistic ideologies and market dynamics within China’s English language teaching landscape
  46. Streetscapes and memories of real socialist anti-fascism in south-eastern Europe: between dystopianism and utopianism
  47. What can NLP do for linguistics? Towards using grammatical error analysis to document non-standard English features
  48. From sociolinguistic perception to strategic action in the study of social meaning
  49. Minority genders in quantitative survey research: a data-driven approach to clear, inclusive, and accurate gender questions
  50. Variation is the way to perfection: imperfect rhyming in Chinese hip hop
  51. Shifts in digital media usage before and after the pandemic by Rusyns in Ukraine
  52. Computational & Corpus Linguistics
  53. Revisiting the automatic prediction of lexical errors in Mandarin
  54. Finding continuers in Swedish Sign Language
  55. Conversational priming in repetitional responses as a mechanism in language change: evidence from agent-based modelling
  56. Construction grammar and procedural semantics for human-interpretable grounded language processing
  57. Through the compression glass: language complexity and the linguistic structure of compressed strings
  58. Could this be next for corpus linguistics? Methods of semi-automatic data annotation with contextualized word embeddings
  59. The Red Hen Audio Tagger
  60. Code-switching in computer-mediated communication by Gen Z Japanese Americans
  61. Supervised prediction of production patterns using machine learning algorithms
  62. Introducing Bed Word: a new automated speech recognition tool for sociolinguistic interview transcription
  63. Decoding French equivalents of the English present perfect: evidence from parallel corpora of parliamentary documents
  64. Enhancing automated essay scoring with GCNs and multi-level features for robust multidimensional assessments
  65. Sociolinguistic auto-coding has fairness problems too: measuring and mitigating bias
  66. The role of syntax in hashtag popularity
  67. Language practices of Chinese doctoral students studying abroad on social media: a translanguaging perspective
  68. Cognitive Linguistics
  69. Metaphor and gender: are words associated with source domains perceived in a gendered way?
  70. Crossmodal correspondence between lexical tones and visual motions: a forced-choice mapping task on Mandarin Chinese
Downloaded on 19.12.2025 from https://www.degruyterbrill.com/document/doi/10.1515/lingvan-2023-0157/html
Scroll to top button