Abstract
Classification trees and random forests offer a number of attractive features to corpus data analysts. However, the way in which these models are typically reported – a decision tree and/or set of variable importance scores – offers insufficient information if interest centers on the (form of) relationship between (multiple) predictors and the outcome. This paper develops predictive margins as an interpretative approach to ensemble techniques such as random forests. These are model summaries in the form of adjusted predictions, which provide a clearer picture of patterns in the data and allow us to query a model on potential nonlinear associations and interactions among predictor variables. The present paper outlines the general strategy for forming predictive margins and addresses methodological issues from an explicitly (corpus) linguistic perspective. For illustration, we use data on the English genitive alternation and provide an R package and code for their implementation.
Acknowledgements
We would like to thank Bodo Winter and the two anonymous reviewers for their constructive and insightful comments on earlier versions of this paper.
References
Breiman, Leo. 2001. Random forests. Machine Learning 45(1). 5–32. https://doi.org/10.1023/A:1010933404324.10.1023/A:1010933404324Suche in Google Scholar
Breiman, Leo, Jerome H. Friedman, Richard A. Olson & Charles J. Stone. 1984. Classification and regression trees. New York: Chapman & Hall.Suche in Google Scholar
Efron, Bradley. 2020. Prediction, estimation, and attribution. Journal of the American Statistical Association 115(530). 636–655. https://doi.org/10.1080/01621459.2020.1762613.Suche in Google Scholar
Francis, W. Nelson & Henry Kučera. 1964. Manual of information to accompany a standard corpus of present-day edited American English for use with digital computers. Providence, Rhode Island: Department of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html (accessed 22 February 2023).Suche in Google Scholar
Friedman, Jerome H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29(5). 1189–1232. https://doi.org/10.1214/aos/1013203451.Suche in Google Scholar
Friedman, Jerome H. & Bogdan E. Popescu. 2008. Predictive learning via rule ensembles. Annals of Applied Statistics 2(3). 916–954. https://doi.org/10.1214/07-AOAS148.Suche in Google Scholar
Gelman, Andrew, Jennifer Hill & Aki Vehtari. 2021. Regression and other stories. Cambridge: Cambridge University Press.10.1017/9781139161879Suche in Google Scholar
Grafmiller, Jason. 2023a. The genitive alternation in 1960s and 1990s American English: Data from the Brown and Frown corpora. Available at: https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/R7HM8J, DataverseNO, V1.Suche in Google Scholar
Grafmiller, Jason. 2023b. Visualizing grammatical similarities in comparative variationist analysis. In Lukas Sönning & Ole Schützler (eds.), Data visualization in corpus linguistics: Critical reflections and future directions. Helsinki: VARIENG. Available at: http://www.helsinki.fi/varieng/journal/volumes/22/grafmiller/.Suche in Google Scholar
Gries, Stefan Th. 2020. On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement. Corpus Linguistics and Linguistic Theory 16(3). 617–647. https://doi.org/10.1515/cllt-2018-0078.Suche in Google Scholar
Grissom, Robert J. & John J. Kim. 2012. Effect sizes for research: Univariate and multivariate applications. New York: Routledge.10.4324/9780203803233Suche in Google Scholar
Hundt, Marianne, Andrea Sand & Paul Skandera. 1999. Manual of information to accompany the Freiburg–Brown Corpus of American English (“Frown”). Freiburg: Englisches Seminar, Albert-Ludwigs-Universität Freiburg. http://korpus.uib.no/icame/manuals/FROWN/INDEX.HTM (accessed 22 February 2023).Suche in Google Scholar
Labov, William. 1972. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.Suche in Google Scholar
Lane, Peter W. & John A. Nelder. 1982. Analysis of covariance and standardization as instances of prediction. Biometrics 38(3). 613–621. https://doi.org/10.2307/2530043.Suche in Google Scholar
Levshina, Natalia. 2020. Conditional inference trees and random forests. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 611–643. New York: Springer.10.1007/978-3-030-46216-1_25Suche in Google Scholar
Long, J. Scott. 1997. Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage.Suche in Google Scholar
Molnar, Christoph. 2019. Interpretable machine learning. A guide for making black box models explainable. Avaliable at: https://christophm.github.io/interpretable-ml-book/.Suche in Google Scholar
R Core Team. 2022. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: https://www.R-project.org/.Suche in Google Scholar
Shah, Priti & Eric G. Freedman. 2011. Bar and line graph comprehension: An interaction of top-down and bottom-up processes. Topics in Cognitive Science 3(3). 560–578. https://doi.org/10.1111/j.1756-8765.2009.01066.x.Suche in Google Scholar
Sönning, Lukas. 2023. Drawing on principles of perception: The line plot. In Lukas Sönning & Ole Schützler (eds.), Data visualization in corpus linguistics: Critical reflections and future directions. Helsinki: VARIENG. Available at: http://www.helsinki.fi/varieng/journal/volumes/22/soenning/.10.31234/osf.io/tjfz5Suche in Google Scholar
Strobl, Carolin, James Malley & Gerhard Tutz. 2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods 14(4). 323–348. https://doi.org/10.1037/a0016973.Suche in Google Scholar
Szmrecsanyi, Benedikt, Jason Grafmiller, Benedikt Heller & Melanie Röthlisberger. 2016. Around the world in three alternations: Modeling syntactic variation in varieties of English. English World-Wide 37(2). 109–137. https://doi.org/10.13140/RG.2.1.1886.0248.Suche in Google Scholar
Tagliamonte, Sali & R. Harald Baayen. 2012. Models, forests and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change 24(2). 135–178. https://doi.org/10.1017/S0954394512000129.Suche in Google Scholar
Tukey, John W. 1980. We need both exploratory and confirmatory. The American Statistician 34(1). 23–25. https://doi.org/10.2307/2682991.Suche in Google Scholar
Winter, Bodo & Martine Grice. 2021. Independence and generalizability in linguistics. Linguistics 59(5). 1251–1277. https://doi.org/10.1515/ling-2019-0049.Suche in Google Scholar
Wright, Marvin N. & Andreas Ziegler. 2017. Ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77(1). 1–17. https://doi.org/10.18637/jss.v077.i01.Suche in Google Scholar
Wooldridge, Jeffrey M. 2016. Introductory econometrics: A modern approach. Boston, MA: Cengage Learning.Suche in Google Scholar
Yarkoni, Tal & Jacob Westfall. 2017. Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science 12(6). 1100–1122. https://doi.org/10.1177/1745691617693393.Suche in Google Scholar
© 2023 Walter de Gruyter GmbH, Berlin/Boston
Artikel in diesem Heft
- Frontmatter
- Coalescence and contraction of V-to-Vinf sequences in American English – Evidence from spoken language
- Metaphorical language change is Self-Organized Criticality
- A corpus-based quantitative study of numeral classifiers in Nepali
- They worked their hardest on the construction’s history: Superlative Objoid Constructions in Late Modern American English
- Large-scale patterns of number use in spoken and written English
- Seeing the wood for the trees: predictive margins for random forests
- A multifactorial aspectual analysis of verb concatenation with imperfective markers zhe in Mandarin
- To drop or not to drop? Predicting the omission of the infinitival marker in a Swedish future construction
Artikel in diesem Heft
- Frontmatter
- Coalescence and contraction of V-to-Vinf sequences in American English – Evidence from spoken language
- Metaphorical language change is Self-Organized Criticality
- A corpus-based quantitative study of numeral classifiers in Nepali
- They worked their hardest on the construction’s history: Superlative Objoid Constructions in Late Modern American English
- Large-scale patterns of number use in spoken and written English
- Seeing the wood for the trees: predictive margins for random forests
- A multifactorial aspectual analysis of verb concatenation with imperfective markers zhe in Mandarin
- To drop or not to drop? Predicting the omission of the infinitival marker in a Swedish future construction