Seeing the wood for the trees: predictive margins for random forests

Lukas Sönning; Jason Grafmiller

doi:10.1515/cllt-2022-0083

Artikel

Seeing the wood for the trees: predictive margins for random forests

Lukas Sönning und Jason Grafmiller

Veröffentlicht/Copyright: 28. März 2023

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Corpus Linguistics and Linguistic Theory Band 20 Heft 1

Abstract

Classification trees and random forests offer a number of attractive features to corpus data analysts. However, the way in which these models are typically reported – a decision tree and/or set of variable importance scores – offers insufficient information if interest centers on the (form of) relationship between (multiple) predictors and the outcome. This paper develops predictive margins as an interpretative approach to ensemble techniques such as random forests. These are model summaries in the form of adjusted predictions, which provide a clearer picture of patterns in the data and allow us to query a model on potential nonlinear associations and interactions among predictor variables. The present paper outlines the general strategy for forming predictive margins and addresses methodological issues from an explicitly (corpus) linguistic perspective. For illustration, we use data on the English genitive alternation and provide an R package and code for their implementation.

Keywords: average predictive comparisons; classification trees; interpretable machine learning; predictive modeling; random forests

Corresponding author: Lukas Sönning, English Linguistics, University of Bamberg, Germany, E-mail: lukas.soenning@uni-bamberg.de

Acknowledgements

We would like to thank Bodo Winter and the two anonymous reviewers for their constructive and insightful comments on earlier versions of this paper.

References

Breiman, Leo. 2001. Random forests. Machine Learning 45(1). 5–32. https://doi.org/10.1023/A:1010933404324.10.1023/A:1010933404324Suche in Google Scholar

Breiman, Leo, Jerome H. Friedman, Richard A. Olson & Charles J. Stone. 1984. Classification and regression trees. New York: Chapman & Hall.Suche in Google Scholar

Efron, Bradley. 2020. Prediction, estimation, and attribution. Journal of the American Statistical Association 115(530). 636–655. https://doi.org/10.1080/01621459.2020.1762613.Suche in Google Scholar

Francis, W. Nelson & Henry Kučera. 1964. Manual of information to accompany a standard corpus of present-day edited American English for use with digital computers. Providence, Rhode Island: Department of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html (accessed 22 February 2023).Suche in Google Scholar

Friedman, Jerome H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29(5). 1189–1232. https://doi.org/10.1214/aos/1013203451.Suche in Google Scholar

Friedman, Jerome H. & Bogdan E. Popescu. 2008. Predictive learning via rule ensembles. Annals of Applied Statistics 2(3). 916–954. https://doi.org/10.1214/07-AOAS148.Suche in Google Scholar

Gelman, Andrew, Jennifer Hill & Aki Vehtari. 2021. Regression and other stories. Cambridge: Cambridge University Press.10.1017/9781139161879Suche in Google Scholar

Grafmiller, Jason. 2023a. The genitive alternation in 1960s and 1990s American English: Data from the Brown and Frown corpora. Available at: https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/R7HM8J, DataverseNO, V1.Suche in Google Scholar

Grafmiller, Jason. 2023b. Visualizing grammatical similarities in comparative variationist analysis. In Lukas Sönning & Ole Schützler (eds.), Data visualization in corpus linguistics: Critical reflections and future directions. Helsinki: VARIENG. Available at: http://www.helsinki.fi/varieng/journal/volumes/22/grafmiller/.Suche in Google Scholar

Gries, Stefan Th. 2020. On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement. Corpus Linguistics and Linguistic Theory 16(3). 617–647. https://doi.org/10.1515/cllt-2018-0078.Suche in Google Scholar

Grissom, Robert J. & John J. Kim. 2012. Effect sizes for research: Univariate and multivariate applications. New York: Routledge.10.4324/9780203803233Suche in Google Scholar

Hundt, Marianne, Andrea Sand & Paul Skandera. 1999. Manual of information to accompany the Freiburg–Brown Corpus of American English (“Frown”). Freiburg: Englisches Seminar, Albert-Ludwigs-Universität Freiburg. http://korpus.uib.no/icame/manuals/FROWN/INDEX.HTM (accessed 22 February 2023).Suche in Google Scholar

Labov, William. 1972. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.Suche in Google Scholar

Lane, Peter W. & John A. Nelder. 1982. Analysis of covariance and standardization as instances of prediction. Biometrics 38(3). 613–621. https://doi.org/10.2307/2530043.Suche in Google Scholar

Levshina, Natalia. 2020. Conditional inference trees and random forests. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 611–643. New York: Springer.10.1007/978-3-030-46216-1_25Suche in Google Scholar

Long, J. Scott. 1997. Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage.Suche in Google Scholar

Molnar, Christoph. 2019. Interpretable machine learning. A guide for making black box models explainable. Avaliable at: https://christophm.github.io/interpretable-ml-book/.Suche in Google Scholar

R Core Team. 2022. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: https://www.R-project.org/.Suche in Google Scholar

Shah, Priti & Eric G. Freedman. 2011. Bar and line graph comprehension: An interaction of top-down and bottom-up processes. Topics in Cognitive Science 3(3). 560–578. https://doi.org/10.1111/j.1756-8765.2009.01066.x.Suche in Google Scholar

Sönning, Lukas. 2023. Drawing on principles of perception: The line plot. In Lukas Sönning & Ole Schützler (eds.), Data visualization in corpus linguistics: Critical reflections and future directions. Helsinki: VARIENG. Available at: http://www.helsinki.fi/varieng/journal/volumes/22/soenning/.10.31234/osf.io/tjfz5Suche in Google Scholar

Strobl, Carolin, James Malley & Gerhard Tutz. 2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods 14(4). 323–348. https://doi.org/10.1037/a0016973.Suche in Google Scholar

Szmrecsanyi, Benedikt, Jason Grafmiller, Benedikt Heller & Melanie Röthlisberger. 2016. Around the world in three alternations: Modeling syntactic variation in varieties of English. English World-Wide 37(2). 109–137. https://doi.org/10.13140/RG.2.1.1886.0248.Suche in Google Scholar

Tagliamonte, Sali & R. Harald Baayen. 2012. Models, forests and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change 24(2). 135–178. https://doi.org/10.1017/S0954394512000129.Suche in Google Scholar

Tukey, John W. 1980. We need both exploratory and confirmatory. The American Statistician 34(1). 23–25. https://doi.org/10.2307/2682991.Suche in Google Scholar

Winter, Bodo & Martine Grice. 2021. Independence and generalizability in linguistics. Linguistics 59(5). 1251–1277. https://doi.org/10.1515/ling-2019-0049.Suche in Google Scholar

Wright, Marvin N. & Andreas Ziegler. 2017. Ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77(1). 1–17. https://doi.org/10.18637/jss.v077.i01.Suche in Google Scholar

Wooldridge, Jeffrey M. 2016. Introductory econometrics: A modern approach. Boston, MA: Cengage Learning.Suche in Google Scholar

Yarkoni, Tal & Jacob Westfall. 2017. Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science 12(6). 1100–1122. https://doi.org/10.1177/1745691617693393.Suche in Google Scholar

Received: 2022-09-27

Accepted: 2023-03-06

Published Online: 2023-03-28

Published in Print: 2024-02-26

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/cllt-2022-0083

Schlagwörter für diesen Artikel

average predictive comparisons; classification trees; interpretable machine learning; predictive modeling; random forests