Home Seeing the wood for the trees: predictive margins for random forests
Article
Licensed
Unlicensed Requires Authentication

Seeing the wood for the trees: predictive margins for random forests

  • Lukas Sönning ORCID logo EMAIL logo and Jason Grafmiller ORCID logo
Published/Copyright: March 28, 2023

Abstract

Classification trees and random forests offer a number of attractive features to corpus data analysts. However, the way in which these models are typically reported – a decision tree and/or set of variable importance scores – offers insufficient information if interest centers on the (form of) relationship between (multiple) predictors and the outcome. This paper develops predictive margins as an interpretative approach to ensemble techniques such as random forests. These are model summaries in the form of adjusted predictions, which provide a clearer picture of patterns in the data and allow us to query a model on potential nonlinear associations and interactions among predictor variables. The present paper outlines the general strategy for forming predictive margins and addresses methodological issues from an explicitly (corpus) linguistic perspective. For illustration, we use data on the English genitive alternation and provide an R package and code for their implementation.


Corresponding author: Lukas Sönning, English Linguistics, University of Bamberg, Germany, E-mail:

Acknowledgements

We would like to thank Bodo Winter and the two anonymous reviewers for their constructive and insightful comments on earlier versions of this paper.

References

Breiman, Leo. 2001. Random forests. Machine Learning 45(1). 5–32. https://doi.org/10.1023/A:1010933404324.10.1023/A:1010933404324Search in Google Scholar

Breiman, Leo, Jerome H. Friedman, Richard A. Olson & Charles J. Stone. 1984. Classification and regression trees. New York: Chapman & Hall.Search in Google Scholar

Efron, Bradley. 2020. Prediction, estimation, and attribution. Journal of the American Statistical Association 115(530). 636–655. https://doi.org/10.1080/01621459.2020.1762613.Search in Google Scholar

Francis, W. Nelson & Henry Kučera. 1964. Manual of information to accompany a standard corpus of present-day edited American English for use with digital computers. Providence, Rhode Island: Department of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html (accessed 22 February 2023).Search in Google Scholar

Friedman, Jerome H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29(5). 1189–1232. https://doi.org/10.1214/aos/1013203451.Search in Google Scholar

Friedman, Jerome H. & Bogdan E. Popescu. 2008. Predictive learning via rule ensembles. Annals of Applied Statistics 2(3). 916–954. https://doi.org/10.1214/07-AOAS148.Search in Google Scholar

Gelman, Andrew, Jennifer Hill & Aki Vehtari. 2021. Regression and other stories. Cambridge: Cambridge University Press.10.1017/9781139161879Search in Google Scholar

Grafmiller, Jason. 2023a. The genitive alternation in 1960s and 1990s American English: Data from the Brown and Frown corpora. Available at: https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/R7HM8J, DataverseNO, V1.Search in Google Scholar

Grafmiller, Jason. 2023b. Visualizing grammatical similarities in comparative variationist analysis. In Lukas Sönning & Ole Schützler (eds.), Data visualization in corpus linguistics: Critical reflections and future directions. Helsinki: VARIENG. Available at: http://www.helsinki.fi/varieng/journal/volumes/22/grafmiller/.Search in Google Scholar

Gries, Stefan Th. 2020. On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement. Corpus Linguistics and Linguistic Theory 16(3). 617–647. https://doi.org/10.1515/cllt-2018-0078.Search in Google Scholar

Grissom, Robert J. & John J. Kim. 2012. Effect sizes for research: Univariate and multivariate applications. New York: Routledge.10.4324/9780203803233Search in Google Scholar

Hundt, Marianne, Andrea Sand & Paul Skandera. 1999. Manual of information to accompany the Freiburg–Brown Corpus of American English (“Frown”). Freiburg: Englisches Seminar, Albert-Ludwigs-Universität Freiburg. http://korpus.uib.no/icame/manuals/FROWN/INDEX.HTM (accessed 22 February 2023).Search in Google Scholar

Labov, William. 1972. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.Search in Google Scholar

Lane, Peter W. & John A. Nelder. 1982. Analysis of covariance and standardization as instances of prediction. Biometrics 38(3). 613–621. https://doi.org/10.2307/2530043.Search in Google Scholar

Levshina, Natalia. 2020. Conditional inference trees and random forests. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 611–643. New York: Springer.10.1007/978-3-030-46216-1_25Search in Google Scholar

Long, J. Scott. 1997. Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage.Search in Google Scholar

Molnar, Christoph. 2019. Interpretable machine learning. A guide for making black box models explainable. Avaliable at: https://christophm.github.io/interpretable-ml-book/.Search in Google Scholar

R Core Team. 2022. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: https://www.R-project.org/.Search in Google Scholar

Shah, Priti & Eric G. Freedman. 2011. Bar and line graph comprehension: An interaction of top-down and bottom-up processes. Topics in Cognitive Science 3(3). 560–578. https://doi.org/10.1111/j.1756-8765.2009.01066.x.Search in Google Scholar

Sönning, Lukas. 2023. Drawing on principles of perception: The line plot. In Lukas Sönning & Ole Schützler (eds.), Data visualization in corpus linguistics: Critical reflections and future directions. Helsinki: VARIENG. Available at: http://www.helsinki.fi/varieng/journal/volumes/22/soenning/.10.31234/osf.io/tjfz5Search in Google Scholar

Strobl, Carolin, James Malley & Gerhard Tutz. 2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods 14(4). 323–348. https://doi.org/10.1037/a0016973.Search in Google Scholar

Szmrecsanyi, Benedikt, Jason Grafmiller, Benedikt Heller & Melanie Röthlisberger. 2016. Around the world in three alternations: Modeling syntactic variation in varieties of English. English World-Wide 37(2). 109–137. https://doi.org/10.13140/RG.2.1.1886.0248.Search in Google Scholar

Tagliamonte, Sali & R. Harald Baayen. 2012. Models, forests and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change 24(2). 135–178. https://doi.org/10.1017/S0954394512000129.Search in Google Scholar

Tukey, John W. 1980. We need both exploratory and confirmatory. The American Statistician 34(1). 23–25. https://doi.org/10.2307/2682991.Search in Google Scholar

Winter, Bodo & Martine Grice. 2021. Independence and generalizability in linguistics. Linguistics 59(5). 1251–1277. https://doi.org/10.1515/ling-2019-0049.Search in Google Scholar

Wright, Marvin N. & Andreas Ziegler. 2017. Ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77(1). 1–17. https://doi.org/10.18637/jss.v077.i01.Search in Google Scholar

Wooldridge, Jeffrey M. 2016. Introductory econometrics: A modern approach. Boston, MA: Cengage Learning.Search in Google Scholar

Yarkoni, Tal & Jacob Westfall. 2017. Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science 12(6). 1100–1122. https://doi.org/10.1177/1745691617693393.Search in Google Scholar

Received: 2022-09-27
Accepted: 2023-03-06
Published Online: 2023-03-28
Published in Print: 2024-02-26

© 2023 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 6.11.2025 from https://www.degruyterbrill.com/document/doi/10.1515/cllt-2022-0083/html
Scroll to top button