Abstract
Usage-based linguistics abounds with studies that use statistical classification models to analyze either textual corpus data or behavioral experimental data. Yet, before we can draw conclusions from statistical models of empirical data that we can feed back into cognitive linguistic theory, we need to assess whether the text-based models are cognitively plausible and whether the behavior-based models are linguistically accurate. In this paper, we review four case studies that evaluate statistical classification models of richly annotated linguistic data by explicitly comparing the performance of a corpus-based model to the behavior of native speakers. The data come from four different languages (Arabic, English, Estonian, and Russian) and pertain to both lexical as well as syntactic near-synonymy. We show that behavioral evidence is needed in order to fine tune and improve statistical models built on data from a corpus. We argue that methodological pluralism is the key for a cognitively realistic linguistic theory.
References
Abdulrahim, Dana. 2013. A corpus study of basic motion events in Modern Standard Arabic. Edmonton: University of Alberta dissertation. http://hdl.handle.net/10402/era.33921 (accessed 20 January 2015)Suche in Google Scholar
Ambridge, Ben, Julian M. Pine, Caroline F. Rowland & Franklin Chang. 2012. The roles of verb semantics, entrenchment, and morphophonology in the retreat from dative argument-structure overgeneralization errors. Language 88(1). 45–81.10.1353/lan.2012.0000Suche in Google Scholar
Antić, Eugenia. 2012. Relative frequency effects in Russian morphology. In Stefan Th. Gries & Dagmar Divjak (eds.), Frequency effects in language learning and processing, Vol. 1, 83–102. Berlin: De Gruyter Mouton.10.1515/9783110274059.83Suche in Google Scholar
Arppe, Antti. 2008. Univariate, bivariate and multivariate methods in corpus-based lexicography – a study of synonymy. Helsinki: University of Helsinki dissertation. https://helda.helsinki.fi/handle/10138/19274 (accessed 28 May 2015)Suche in Google Scholar
Arppe, Antti. 2013a. Polytomous: Polytomous logistic regression for fixed and mixed effects. R package version 0.1.6. http://CRAN.R-project.org/package=polytomousSuche in Google Scholar
Arppe, Antti. 2013b. Extracting exemplars and prototypes. R vignette to accompany Divjak & Arppe (2013). http://cran.r-project.org/web/packages/polytomous/vignettes/exemplars2prototypes.pdfSuche in Google Scholar
Arppe, Antti & Dana Abdulrahim. 2013. Converging linguistic evidence on two flavors of production: The synonymy of Arabic COME verbs. Paper presented at Second Workshop on Arabic Corpus Linguistics, University of Lancaster, 22–26 July.Suche in Google Scholar
Arppe, Antti, Patrick Bolger & Dagmara Dowbor. 2012. The more evidential diversity, the merrier – contrasting linguistic data on frequency, selection, acceptability and processing. Paper presented at New Ways of Analyzing Syntactic Variation, Radboud University, Nijmegen, the Netherlands, 15–17 November.Suche in Google Scholar
Arppe, Antti & Juhani Järvikivi. 2007. Every method counts: Combining corpus-based and experimental evidence in the study of synonymy. Corpus Linguistics and Linguistic Theory 3(2). 131–159.10.1515/CLLT.2007.009Suche in Google Scholar
Baayen, R. Harald. 2008. Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.10.1017/CBO9780511801686Suche in Google Scholar
Baayen, R. Harald. 2011. Corpus linguistics and naive discriminative learning. Revista Brasileira de Linguística Aplicada 11(2). 295–328.10.1590/S1984-63982011000200003Suche in Google Scholar
Baayen, R. Harald & Antti Arppe. 2011. Statistical classification and principles of human learning. QITL-4-Proceedings of Quantitative Investigations in Theoretical Linguistics 4 (QITL-4). Berlin: Humboldt-Universität zu Berlin. http://edoc.hu-berlin.de/conferences/qitl-4/baayen-r-harald-8/PDF/baayen.pdf (accessed on 06 January 2015).Suche in Google Scholar
Baayen, R. Harald, Douglas J. Davidson & Douglas M. Bates. 2008. Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language 59. 390–412.10.1016/j.jml.2007.12.005Suche in Google Scholar
Baayen, R. Harald, Anna Endresen, Laura A. Janda, Anastasia Makarova & Tore Nesset. 2013. Making choices in Russian: Pros and cons of statistical methods for rival forms. Russian Linguistics 37. 253–291.10.1007/s11185-013-9118-6Suche in Google Scholar
Barth, Danielle & Vsevolod Kapatsinski. in press. A multimodel inference approach to categorical variant choice: Construction, priming and frequency effects on the choice between full and contracted forms of am, are and is. Corpus Linguistics and Linguistic Theory. http://www.degruyter.com/view/j/cllt.ahead-of-print/cllt-2014-0022/cllt-2014-0022.xml (accessed 28 May 2015)10.1515/cllt-2014-0022Suche in Google Scholar
Bermel, Neil & Luděk Knittl. 2012a. Corpus frequency and acceptability judgments: A study of morphosyntactic variants in Czech. Corpus Linguistics and Linguistic Theory 8(2). 241–275.10.1515/cllt-2012-0010Suche in Google Scholar
Bermel, Neil & Luděk Knittl. 2012b. Morphosyntactic variation and syntactic constructions in Czech nominal declension: corpus frequency and native-speaker judgments. Russian Linguistics 36(1). 91–119.10.1007/s11185-011-9083-xSuche in Google Scholar
Box, George E. P. 1976. Science and statistics. Journal of the American Statistical Association 71(356). 791–799.10.1080/01621459.1976.10480949Suche in Google Scholar
Bradshaw, John. 1984. A guide to norms, ratings, and lists. Memory & Cognition 12(2). 202–206.10.3758/BF03198435Suche in Google Scholar
Bresnan, Joan. 2007. Is syntactic knowledge probabilistic? Experiments with the English dative alternation. In Sam Featherston & Wolfgang Sternefeld (eds.), Roots: Linguistics in search of its evidential base, 77–96. Berlin: Mouton de Gruyter.Suche in Google Scholar
Bresnan, Joan, Anna Cueni, Tatiana Nikitina & R. Harald Baayen. 2007. Predicting the dative alternation. In Gerlof Bouma, Irene Krämer & Joost Zwarts (eds.) Cognitive foundations of interpretation, 69–94. Amsterdam: Royal Netherlands Academy of Science.Suche in Google Scholar
Bresnan, Joan & Marilyn Ford. 2010. Predicting syntax: Processing dative constructions in American and Australian varieties of English. Language 86(1). 186–213.10.1353/lan.0.0189Suche in Google Scholar
Burnham, Kenneth P. & David R. Anderson. 2002. Model selection and multimodel inference: A practical information-theoretic approach, 2nd edn. New York: Springer.Suche in Google Scholar
Bybee, Joan L. & David Eddington. 2006. A usage-based approach to Spanish verbs of ‘becoming’. Language 82(2). 323–355.10.1353/lan.2006.0081Suche in Google Scholar
Caines, Andrew. 2012. ‘You talking to me?’ Testing corpus data with a shadowing experiment. In Stefan Th. Gries & Dagmar Divjak (eds.), Frequency effects in language learning and processing, 177–206. Berlin: MDe Gruyter Mouton.10.1515/9783110274059.177Suche in Google Scholar
Chafe, Wallace. 1992. The importance of corpus linguistics to understanding the nature of language. In Jan Svartvik (ed.), Directions in corpus linguistics, 79–97. Berlin: Mouton de Gruyter.Suche in Google Scholar
Crawley, Michael J. 2007. The R book. Chichester: John Wiley & Sons.10.1002/9780470515075Suche in Google Scholar
De Sutter, Gert, Dirk Speelman & Dirk Geeraerts. 2008. Prosodic and syntactic-pragmatic mechanisms of grammatical variation: The impact of a postverbal constituent on the word order in Dutch clause final verb clusters. International Journal of Corpus Linguistics 13(2). 194–224.10.1075/ijcl.13.2.04desSuche in Google Scholar
Deignan, Alice H. 2005. Metaphor and corpus linguistics. Amsterdam: John Benjamins.10.1075/celcr.6Suche in Google Scholar
Divjak, Dagmar. 2003. On trying in Russian: A tentative network model for near(er) synonyms. Slavica Gandensia 30. 25–58.Suche in Google Scholar
Divjak, Dagmar. 2004. Degrees of verb integration: Conceptualizing and categorizing events in Russian. Leuven: University of Leuven (KU Leuven) dissertation.Suche in Google Scholar
Divjak, Dagmar. 2010. Structuring the lexicon: A clustered model for near-synonymy (Cognitive Linguistics Research). Berlin: De Gruyter Mouton.10.1515/9783110220599Suche in Google Scholar
Divjak, Dagmar. 2012. Introduction. In Dagmar Divjak & Stephan Th. Gries (eds.), Frequency effects in language. Vol. 2: Frequency effects in language representation. Berlin: De Gruyter Mouton, 1–10.Suche in Google Scholar
Divjak, Dagmar & Antti Arppe. 2013. Extracting prototypes from exemplars: What can corpus data tell us about concept representation? Cognitive Linguistics 24(2). 221–274.10.1515/cog-2013-0008Suche in Google Scholar
Divjak, Dagmar, Antti Arppe & Harald Baayen. 2016a. Does real language fit a self-paced reading paradigm? In Anja Gattnar, Tanja Anstatt & Christina Clasmeier (eds.), Slavic languages in psycholinguistics, 52–82. Tübingen: Narr Francke Attempto Verlag.Suche in Google Scholar
Divjak, Dagmar, Antti Arppe & Ewa Dąbrowska. 2016b. Machine meets man: Evaluating the psychological reality of corpus-based probabilistic models. Cognitive Linguistics 27(1). 1–33.10.1515/cog-2015-0101Suche in Google Scholar
Divjak, Dagmar & Stefan Th. Gries. 2006. Ways of trying in Russian. Clustering behavioral profiles. Journal of Corpus Linguistics and Linguistic Theory 2(1). 23–60.10.1515/CLLT.2006.002Suche in Google Scholar
Divjak, Dagmar & Stefan Th. Gries. 2008. Clusters in the mind? Converging evidence from near-synonymy in Russian. The Mental Lexicon 3(2). 188–213.10.1075/ml.3.2.03divSuche in Google Scholar
Divjak, Dagmar & Stefan Th. Gries (eds.). 2012. Frequency effects in language. Vol. 2: Frequency effects in language representation. Berlin: De Gruyter Mouton.10.1515/9783110274073Suche in Google Scholar
Erker, Daniel & Gregory R. Guy. 2012. The role of lexical frequency in syntactic variability: Variable subject personal pronoun expression in Spanish. Language 88(3). 526–557.10.1353/lan.2012.0050Suche in Google Scholar
Ford, Marilyn & Joan Bresnan. 2013a. Using convergent evidence from psycholinguistics and usage. In Manfred Krug & Julia Schlüter (eds.), Research methods in language variation and change, 295–312. Cambridge: Cambridge University Press.10.1017/CBO9780511792519.020Suche in Google Scholar
Ford, Marilyn & Joan Bresnan. 2013b. ‘They whispered me the answer’ in Australia and the US: A comparative experimental study. In Tracy Holloway King & Valeria de Paiva (eds.), From quirky case to representing space: Papers in honor of Annie Zaenen, 95–107. Stanford: CSLI Publications. http://web.stanford.edu/group/cslipublications/cslipublications/Online/azfest-final.pdf (accessed 22 January 2015).Suche in Google Scholar
Frary, Robert B. 1988. Formula scoring of multiple-choice tests (correction for guessing). Educational Measurement: Issues and Practice 7(2). 33–38.10.1111/j.1745-3992.1988.tb00434.xSuche in Google Scholar
Gilquin, Gaëtanelle & Stefan Th. Gries. 2009. Corpora and experimental methods: A state-of-the-art review. Corpus Linguistics and Linguistic Theory 5(1). 1–26.10.1515/CLLT.2009.001Suche in Google Scholar
Glynn, Dylan & Kerstin Fischer (eds.). 2010. Quantitative methods in cognitive semantics: Corpus-driven approaches (Cognitive Linguistics Research 46). Berlin: De Gruyter Mouton.10.1515/9783110226423Suche in Google Scholar
Glynn, Dylan & Justyna Robinson (eds.). 2014. Corpus methods for semantics: Quantitative studies in polysemy and synonymy (Human Cognitive Processing 43). Amsterdam: John Benjamins.10.1075/hcp.43Suche in Google Scholar
Gries, Stefan Th. 2003. Multifactorial analysis in corpus linguistics: A study of particle placement. London: Continuum Press.Suche in Google Scholar
Gries, Stefan Th., Beate Hampe & Doris Schönefeld. 2010. Converging evidence II: More on the association of verbs and constructions. In Sally Rice & John Newman (eds.), Empirical and experimental methods in cognitive/functional research, 59–72. Stanford, CA: Center for the Study of Language and Information.Suche in Google Scholar
Gries, Stefan Th. & Martin Hilpert. 2010. Modeling diachronic change in the third person singular: A multifactorial, verb- and author-specific exploratory approach. English Language and Linguistics 14(3). 293–320.10.1017/S1360674310000092Suche in Google Scholar
Gries, Stefan Th. & Dagmar Divjak (eds.). 2012. Frequency effects in language. Vol. 1: Frequency effects in language learning and processing. Berlin: De Gruyter Mouton.10.1515/9783110274059Suche in Google Scholar
Grondelaers, Stefan & Dirk Speelman. 2007. A variationist account of constituent ordering in presentative sentences in Belgian Dutch. Corpus Linguistics and Linguistic Theory 3(2). 161–193.10.1515/CLLT.2007.010Suche in Google Scholar
Harrell, Frank E. 2001. Regression modeling strategies: With applications to linear models, logistic regression and survival analysis. New York: Springer.10.1007/978-1-4757-3462-1Suche in Google Scholar
Hosmer, David W., Jr., Stanley Lemeshow & Rodney X. Sturdivant. 2013. Applied logistic regression. Hoboken, NJ: John Wiley & Sons.10.1002/9781118548387Suche in Google Scholar
Jaeger, T. Florian 2008. Categorical data analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. Journal of Memory and Language 59(4). 434–446.10.1016/j.jml.2007.11.007Suche in Google Scholar
Jurafsky, Dan. 2003. Probabilistic modeling in psycholinguistics: Linguistic comprehension and production. In Rens Bod, Jennifer Hay & Stefanie Jannedy (eds.), Probabilistic linguistics, 39–95. Cambridge, MA: MIT Press.Suche in Google Scholar
Kendall, Tyler, Joan Bresnan & Gerard Van Herk. 2011. The dative alternation in African American English: Researching syntactic variation and change across sociolinguistic datasets. Corpus Linguistics and Linguistic Theory 7(2). 229–244.10.1515/cllt.2011.011Suche in Google Scholar
Kilgariff, Adam. 2005. Language is never, ever, ever, random. Corpus Linguistics and Linguistic Theory 1(2). 263–276.10.1515/cllt.2005.1.2.263Suche in Google Scholar
Klavan, Jane 2012. Evidence in linguistics: Corpus-linguistic and experimental methods for studying grammatical synonymy. (Dissertationes Linguisticae Universitatis Tartuensis). Tartu: University of Tartu Press.Suche in Google Scholar
Klavan, Jane. 2014. How good is good? Evaluating the performance of probabilistic statistical classification models for predicting constructional choices. Paper presented at 5th UK Cognitive Linguistics Conference, University of Lancaster, 29–31 July.Suche in Google Scholar
Kotz, Samuel (ed.). 2006. Encyclopedia of statistical sciences, Vol. 11. Hoboken, NJ: Wiley and Sons.Suche in Google Scholar
McEnery, Tony & Andrew Hardie 2012. Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.10.1093/oxfordhb/9780199276349.013.0024Suche in Google Scholar
Milin, Petar, Dagmar Divjak, Strahinja Dimitrijević & R. Harald Baayen. 2016. Towards cognitively plausible data science in language research. Cognitive Linguistics 27(4).10.1515/cog-2016-0055Suche in Google Scholar
Mitchell, Gregory. 2012. Revisiting truth or triviality the external validity of research in the psychological laboratory. Perspectives on Psychological Science 7(2). 109–117.10.1177/1745691611432343Suche in Google Scholar
Pinheiro, José C. & Douglas M. Bates. 2000. Mixed-effects models in S and S-PLUS. New York: Springer.10.1007/978-1-4419-0318-1Suche in Google Scholar
Raymond, William D. & Esther L. Brown. 2012. Are effects of word frequency effects of context of use? An analysis of initial fricative reduction in Spanish. In Stefan Th. Gries & Dagmar Divjak (eds.), Frequency effects in language learning and processing, 35–52. Berlin: De Gruyter Mouton.10.1515/9783110274059.35Suche in Google Scholar
Resnik, Philip & Jimmy Lin. 2010. Evaluation of NLP systems. In Alexander Clark, Chris Fox & Shalom Lappin (eds.), The handbook of computational linguistics and natural language processing, 271–295. Oxford: Wiley-Blackwell.10.1002/9781444324044.ch11Suche in Google Scholar
Roland, Douglas, Jeffrey L. Elman & Victor S. Ferreira. 2006. Why is that? Structural prediction and ambiguity resolution in a very large corpus of English sentences. Cognition 98. 245–272.10.1016/j.cognition.2004.11.008Suche in Google Scholar
Sankoff, David & William Labov. 1979. On the uses of variable rules. Language in Society 8(3). 189–222.10.1017/S0047404500007430Suche in Google Scholar
Szmrecsanyi, Benedikt. 2013. Diachronic probabilistic grammar. English Language and Linguistics 19(3). 41–68.10.17960/ell.2013.19.3.002Suche in Google Scholar
Theijssen, Daphne, Louis ten Bosch, Lou Boves, Bert Cranen & Hans van Halteren. 2013. Choosing alternatives: Using Bayesian networks and memory-based learning to study the dative alternation. Corpus Linguistics and Linguistic Theory 9(2). 227–262.10.1515/cllt-2013-0007Suche in Google Scholar
Tooley, Kristen M. & Kathryn Bock. 2014. On the parity of structural persistence in language production and comprehension. Cognition 132(2). 101–136.10.1016/j.cognition.2014.04.002Suche in Google Scholar
Van de Weijer, Joost, Carita Paradis, Caroline Willners & Magnus Lindgren. 2012. As lexical as it gets: The role of co-occurrence of antonyms in a visual lexical decision experiment. In Dagmar Divjak & Stefan Th. Gries (eds.), Frequency effects in language representation, 255–279. Berlin: De Gruyter Mouton.10.1515/9783110274073.255Suche in Google Scholar
Wasow, Thomas & Jennifer Arnold. 2003. Post-verbal constituent ordering in English. Topics in English Linguistics 43. 119–154.10.1515/9783110900019.119Suche in Google Scholar
Wolk, Christoph, Joan Bresnan, Anette Rosenbach & Benedikt Szmrecsanyi. 2013. Dative and genitive variability in Late Modern English: Exploring cross-constructional variation and change. Diachronica 30(3). 382–419.10.1075/dia.30.3.04wolSuche in Google Scholar
©2016 by De Gruyter Mouton
Artikel in diesem Heft
- Frontmatter
- Usage-based cognitive-functional linguistics: From theory to method and back again
- The cognitive plausibility of statistical classification models: Comparing textual and behavioral evidence
- The usage and spread of sentence-internal capitalization in Early New High German: A multifactorial approach
- Quantifying polysemy: Corpus methodology for prototype theory
- A cognitive-constructionist approach to Spanish creo Ø and creo yo ‘[I] think’
- A corpus-based, cross-linguistic approach to mental predicates and their complementation: Performativity and descriptivity vis-à-vis boundedness and picturability
- Why we need a token-based typology: A case study of analytic and lexical causatives in fifteen European languages
- Constructional contamination: How does it work and how do we measure it?
- Regular FoL papers
- Mutual intelligibility of spoken Maltese, Libyan Arabic, and Tunisian Arabic functionally tested: A pilot study
- Intermediate information status for non-nominal constituents: Evidence from Spanish secondary predicates in adversatives
- Lower domain language shift in Taiwan: The case of Southern Min
- Book Reviews
- Crespo-Fernández, Eliecer: Sex in language: Euphemistic and dysphemistic metaphors in internet forums
- Sonnenhauser, Barbara and Patrizia Noel Aziz Hanna: Vocative! Addressing between system and performance
- Erratum
- Erratum
Artikel in diesem Heft
- Frontmatter
- Usage-based cognitive-functional linguistics: From theory to method and back again
- The cognitive plausibility of statistical classification models: Comparing textual and behavioral evidence
- The usage and spread of sentence-internal capitalization in Early New High German: A multifactorial approach
- Quantifying polysemy: Corpus methodology for prototype theory
- A cognitive-constructionist approach to Spanish creo Ø and creo yo ‘[I] think’
- A corpus-based, cross-linguistic approach to mental predicates and their complementation: Performativity and descriptivity vis-à-vis boundedness and picturability
- Why we need a token-based typology: A case study of analytic and lexical causatives in fifteen European languages
- Constructional contamination: How does it work and how do we measure it?
- Regular FoL papers
- Mutual intelligibility of spoken Maltese, Libyan Arabic, and Tunisian Arabic functionally tested: A pilot study
- Intermediate information status for non-nominal constituents: Evidence from Spanish secondary predicates in adversatives
- Lower domain language shift in Taiwan: The case of Southern Min
- Book Reviews
- Crespo-Fernández, Eliecer: Sex in language: Euphemistic and dysphemistic metaphors in internet forums
- Sonnenhauser, Barbara and Patrizia Noel Aziz Hanna: Vocative! Addressing between system and performance
- Erratum
- Erratum