Startseite Seeing the wood for the trees: predictive margins for random forests
Artikel
Lizenziert
Nicht lizenziert Erfordert eine Authentifizierung

Seeing the wood for the trees: predictive margins for random forests

  • Lukas Sönning ORCID logo EMAIL logo und Jason Grafmiller ORCID logo
Veröffentlicht/Copyright: 28. März 2023

Abstract

Classification trees and random forests offer a number of attractive features to corpus data analysts. However, the way in which these models are typically reported – a decision tree and/or set of variable importance scores – offers insufficient information if interest centers on the (form of) relationship between (multiple) predictors and the outcome. This paper develops predictive margins as an interpretative approach to ensemble techniques such as random forests. These are model summaries in the form of adjusted predictions, which provide a clearer picture of patterns in the data and allow us to query a model on potential nonlinear associations and interactions among predictor variables. The present paper outlines the general strategy for forming predictive margins and addresses methodological issues from an explicitly (corpus) linguistic perspective. For illustration, we use data on the English genitive alternation and provide an R package and code for their implementation.


Corresponding author: Lukas Sönning, English Linguistics, University of Bamberg, Germany, E-mail:

Acknowledgements

We would like to thank Bodo Winter and the two anonymous reviewers for their constructive and insightful comments on earlier versions of this paper.

References

Breiman, Leo. 2001. Random forests. Machine Learning 45(1). 5–32. https://doi.org/10.1023/A:1010933404324.10.1023/A:1010933404324Suche in Google Scholar

Breiman, Leo, Jerome H. Friedman, Richard A. Olson & Charles J. Stone. 1984. Classification and regression trees. New York: Chapman & Hall.Suche in Google Scholar

Efron, Bradley. 2020. Prediction, estimation, and attribution. Journal of the American Statistical Association 115(530). 636–655. https://doi.org/10.1080/01621459.2020.1762613.Suche in Google Scholar

Francis, W. Nelson & Henry Kučera. 1964. Manual of information to accompany a standard corpus of present-day edited American English for use with digital computers. Providence, Rhode Island: Department of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html (accessed 22 February 2023).Suche in Google Scholar

Friedman, Jerome H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29(5). 1189–1232. https://doi.org/10.1214/aos/1013203451.Suche in Google Scholar

Friedman, Jerome H. & Bogdan E. Popescu. 2008. Predictive learning via rule ensembles. Annals of Applied Statistics 2(3). 916–954. https://doi.org/10.1214/07-AOAS148.Suche in Google Scholar

Gelman, Andrew, Jennifer Hill & Aki Vehtari. 2021. Regression and other stories. Cambridge: Cambridge University Press.10.1017/9781139161879Suche in Google Scholar

Grafmiller, Jason. 2023a. The genitive alternation in 1960s and 1990s American English: Data from the Brown and Frown corpora. Available at: https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/R7HM8J, DataverseNO, V1.Suche in Google Scholar

Grafmiller, Jason. 2023b. Visualizing grammatical similarities in comparative variationist analysis. In Lukas Sönning & Ole Schützler (eds.), Data visualization in corpus linguistics: Critical reflections and future directions. Helsinki: VARIENG. Available at: http://www.helsinki.fi/varieng/journal/volumes/22/grafmiller/.Suche in Google Scholar

Gries, Stefan Th. 2020. On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement. Corpus Linguistics and Linguistic Theory 16(3). 617–647. https://doi.org/10.1515/cllt-2018-0078.Suche in Google Scholar

Grissom, Robert J. & John J. Kim. 2012. Effect sizes for research: Univariate and multivariate applications. New York: Routledge.10.4324/9780203803233Suche in Google Scholar

Hundt, Marianne, Andrea Sand & Paul Skandera. 1999. Manual of information to accompany the Freiburg–Brown Corpus of American English (“Frown”). Freiburg: Englisches Seminar, Albert-Ludwigs-Universität Freiburg. http://korpus.uib.no/icame/manuals/FROWN/INDEX.HTM (accessed 22 February 2023).Suche in Google Scholar

Labov, William. 1972. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.Suche in Google Scholar

Lane, Peter W. & John A. Nelder. 1982. Analysis of covariance and standardization as instances of prediction. Biometrics 38(3). 613–621. https://doi.org/10.2307/2530043.Suche in Google Scholar

Levshina, Natalia. 2020. Conditional inference trees and random forests. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 611–643. New York: Springer.10.1007/978-3-030-46216-1_25Suche in Google Scholar

Long, J. Scott. 1997. Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage.Suche in Google Scholar

Molnar, Christoph. 2019. Interpretable machine learning. A guide for making black box models explainable. Avaliable at: https://christophm.github.io/interpretable-ml-book/.Suche in Google Scholar

R Core Team. 2022. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: https://www.R-project.org/.Suche in Google Scholar

Shah, Priti & Eric G. Freedman. 2011. Bar and line graph comprehension: An interaction of top-down and bottom-up processes. Topics in Cognitive Science 3(3). 560–578. https://doi.org/10.1111/j.1756-8765.2009.01066.x.Suche in Google Scholar

Sönning, Lukas. 2023. Drawing on principles of perception: The line plot. In Lukas Sönning & Ole Schützler (eds.), Data visualization in corpus linguistics: Critical reflections and future directions. Helsinki: VARIENG. Available at: http://www.helsinki.fi/varieng/journal/volumes/22/soenning/.10.31234/osf.io/tjfz5Suche in Google Scholar

Strobl, Carolin, James Malley & Gerhard Tutz. 2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods 14(4). 323–348. https://doi.org/10.1037/a0016973.Suche in Google Scholar

Szmrecsanyi, Benedikt, Jason Grafmiller, Benedikt Heller & Melanie Röthlisberger. 2016. Around the world in three alternations: Modeling syntactic variation in varieties of English. English World-Wide 37(2). 109–137. https://doi.org/10.13140/RG.2.1.1886.0248.Suche in Google Scholar

Tagliamonte, Sali & R. Harald Baayen. 2012. Models, forests and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change 24(2). 135–178. https://doi.org/10.1017/S0954394512000129.Suche in Google Scholar

Tukey, John W. 1980. We need both exploratory and confirmatory. The American Statistician 34(1). 23–25. https://doi.org/10.2307/2682991.Suche in Google Scholar

Winter, Bodo & Martine Grice. 2021. Independence and generalizability in linguistics. Linguistics 59(5). 1251–1277. https://doi.org/10.1515/ling-2019-0049.Suche in Google Scholar

Wright, Marvin N. & Andreas Ziegler. 2017. Ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77(1). 1–17. https://doi.org/10.18637/jss.v077.i01.Suche in Google Scholar

Wooldridge, Jeffrey M. 2016. Introductory econometrics: A modern approach. Boston, MA: Cengage Learning.Suche in Google Scholar

Yarkoni, Tal & Jacob Westfall. 2017. Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science 12(6). 1100–1122. https://doi.org/10.1177/1745691617693393.Suche in Google Scholar

Received: 2022-09-27
Accepted: 2023-03-06
Published Online: 2023-03-28
Published in Print: 2024-02-26

© 2023 Walter de Gruyter GmbH, Berlin/Boston

Heruntergeladen am 6.11.2025 von https://www.degruyterbrill.com/document/doi/10.1515/cllt-2022-0083/html?lang=de
Button zum nach oben scrollen