Abstract
Survival analysis is a widely used method to establish a connection between a time to event outcome and a set of potential covariates. Accurately predicting the time of an event of interest is of primary importance in survival analysis. Many different algorithms have been proposed for survival prediction. However, for a given prediction problem it is rarely, if ever, possible to know in advance which algorithm will perform the best. In this paper we propose two algorithms for constructing super learners in survival data prediction where the individual algorithms are based on proportional hazards. A super learner is a flexible approach to statistical learning that finds the best weighted ensemble of the individual algorithms. Finding the optimal combination of the individual algorithms through minimizing cross-validated risk controls for over-fitting of the final ensemble learner. Candidate algorithms may range from a basic Cox model to tree-based machine learning algorithms, assuming all candidate algorithms are based on the proportional hazards framework. The ensemble weights are estimated by minimizing the cross-validated negative log partial likelihood. We compare the performance of the proposed super learners with existing models through extensive simulation studies. In all simulation scenarios, the proposed super learners are either the best fit or near the best fit. The performances of the newly proposed algorithms are also demonstrated with clinical data examples.
-
Conflict of Interest: None.
References
[1] Cox DR. Regression models and life-tables. J R Stat Soc. Ser B. 1972;34:187–220.10.1007/978-1-4612-4380-9_37Search in Google Scholar
[2] Tibshirani R. The lasso method for variable selection in the cox model. Stat Med. 1997;16:385–95.10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3Search in Google Scholar
[3] Verweij PJ, van Houwelingen HC. Penalized likelihood in cox regression. Stat Med. 199413:2427–36.10.1002/sim.4780132307Search in Google Scholar
[4] Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw 2011;39:1.10.18637/jss.v039.i05Search in Google Scholar
[5] Schapire RE. The strength of weak learnability. Mach Learn. 1990;5:197–227.10.1109/SFCS.1989.63451Search in Google Scholar
[6] Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Ann Stat. 2000;28:337–407.10.1214/aos/1016218223Search in Google Scholar
[7] Bühlmann P, Yu B. Boosting with the l2 loss: regression and classification. J Am Stat Assoc 2003;98:324–39.10.1198/016214503000125Search in Google Scholar
[8] Tutz G, Binder H. Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics 2006;62:961–71.10.1111/j.1541-0420.2006.00578.xSearch in Google Scholar
[9] De Bin R. Boosting in cox regression: a comparison between the likelihood-based and the model-based approaches with focus on the r-packages coxboost and mboost. Comput Stat 2016;31:513–31.10.1007/s00180-015-0642-2Search in Google Scholar
[10] Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat 2008;2:841–60.10.1002/9781118445112.stat08188Search in Google Scholar
[11] Breiman L. Random forests. Mach Learn. 2001;45:5–32.10.1023/A:1010933404324Search in Google Scholar
[12] Nelson W. Theory and applications of hazard plotting for censored failure data. Technometrics 1972;14:945–66.10.1080/00401706.1972.10488991Search in Google Scholar
[13] Aalen O. Nonparametric inference for a family of counting processes. Ann Stat. 1978;6:701–726.10.1214/aos/1176344247Search in Google Scholar
[14] van der Laan MJ, Polley EC, Hubbard AE. ‘Super learner. Stat Appl Genet Mol Biol. 2007;6:1–23.10.2202/1544-6115.1309Search in Google Scholar
[15] Wolpert DH. Stacked generalization. Neural networks 1992;5:241–59.10.1016/S0893-6080(05)80023-1Search in Google Scholar
[16] Breiman L. Stacked regressions. Mach Learn. 1996;24:49–64.10.1007/BF00117832Search in Google Scholar
[17] van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples, Uc berkeley division of biostatistics working papers series, paper 130, U.C. Berkeley, 2003. https://biostats.bepress.com/ucbbiostat/paper130.Search in Google Scholar
[18] van der Vaart A, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross validation. Stat Decisions. 2006;24:351–71.10.1524/stnd.2006.24.3.351Search in Google Scholar
[19] Polley EC, Rose S, van der Laan MJ. Super learner in prediction. In MJ van der Laan, S Rose, editors. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2011.10.1007/978-1-4419-9782-1Search in Google Scholar
[20] Polley EC, van der Laan MJ. Super learning for right-censored data. New York, NY: Springer New York, 2011.10.1007/978-1-4419-9782-1_16Search in Google Scholar
[21] Wey A, Connett J, Rudser K. Combining parametric, semi-parametric, and non-parametric survival models with stacked survival models. Biostatistics 2015;16:537–49.10.1093/biostatistics/kxv001Search in Google Scholar
[22] Hastie TJ, Tibshirani RJ. Generalized additive models, monographs on statistics and applied probability. London: Chapman & Hall, CRC, 1990.Search in Google Scholar
[23] Lorbert A, Ramadge P. Descent methods for tuning parameter refinement. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010:469–476.Search in Google Scholar
[24] Breslow NE. Contribution to the discussion of paper by d.r. cox. J R Stat Soc. Ser B. 1972;34:216–7.Search in Google Scholar
[25] Therneau TM, Lumley T. survival: R package version 2.42, 2018. https://CRAN.R-project.org/package=survival.Search in Google Scholar
[26] Binder H. CoxBoost: cox models by likelihood based boosting for a single survival endpoint or competing risks, R package version 1.0, 2013. https://CRAN.R-project.org/package=CoxBoost.Search in Google Scholar
[27] Hothorn T, Bühlmann P, Kneib T, Schmid M, Hofner B. mboost: Model-based boosting, R package version 2.5-0, 2015. https://CRAN.R-project.org/package=mboost.Search in Google Scholar
[28] Ridgeway G. gbm: Generalized boosted regression models. R package version 1.6-3, 2007. https://CRAN.R-project.org/package=gbm.Search in Google Scholar
[29] Ishwaran H, Kogalur UB. randomForestSRC. R package version 2.7, 2018. https://CRAN.R-project.org/package=randomForestSRC.Search in Google Scholar
[30] Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15:361–87.10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4Search in Google Scholar
[31] Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. Jama 1982;247:2543–6.10.1001/jama.1982.03320430047030Search in Google Scholar
[32] Loprinzi CL, Laurie JA, Wieand HS, Krook JE, Novotny PJ, Kugler JW, et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. north central cancer treatment group. J Clin Oncol 1994;12:601–7.10.1200/JCO.1994.12.3.601Search in Google Scholar PubMed
[33] Mantel N, Bohidar NR, Ciminera JL. Mantel-haenszel analyses of litter-matched time-to-response data, with modifications for recovery of interlitter information. Cancer Res. 1977;37:3863–8.Search in Google Scholar
[34] Laurie JA, Moertel C, Fleming TR, Wieand HS, Leigh JE, Rubin J et al. Surgical adjuvant therapy of large-bowel carcinoma: an evaluation of levamisole and the combination of levamisole and fluorouracil. the north central cancer treatment group and the mayo clinic. J Clin Oncol. 1989;7:1447–56.10.1200/JCO.1989.7.10.1447Search in Google Scholar PubMed
[35] Lin DY. Cox regression analysis of multivariate failure time data: the marginal approach. Stat Med 1994;13:2233–47.10.1002/sim.4780132105Search in Google Scholar PubMed
[36] Moertel CG, Fleming TR, Macdonald JS, Haller DG, Laurie JA, Goodman PJ, et al. Levamisole and fluorouracil for adjuvant therapy of resected colon carcinoma. N Engl J Med. 1990;322:352–8.10.1056/NEJM199002083220602Search in Google Scholar PubMed
[37] Moertel CG, Fleming TR, Macdonald JS, Haller DG, Laurie JA, Tangen CM, et al. Fluorouracil plus levamisole as effective adjuvant therapy after resection of stage iii colon carcinoma: a final report. Ann Int Med. 1995;122:321–6.10.7326/0003-4819-122-5-199503010-00001Search in Google Scholar PubMed
[38] Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. New York: Wiley, 1980.Search in Google Scholar
[39] McGilchrist CA, Aisbett CW. Regression with frailty in survival analysis. Biometrics, 1991;47:461–6.10.2307/2532138Search in Google Scholar
[40] Huster WJ, Brookmeyer R, Self SG. Modelling paired survival data with covariates. Biometrics. 1989;45:145–56.10.2307/2532041Search in Google Scholar
[41] Blair AL, Hadden DR, Weaver JA, Archer DB, Johnston PB, Maguire CJ. The 5-year prognosis for vision in diabetes. Ulster Med J. 1980;49:139.Search in Google Scholar
© 2020 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Research Articles
- Nonparametric bootstrap inference for the targeted highly adaptive least absolute shrinkage and selection operator (LASSO) estimator
- A Bayesian Framework for Robust Quantitative Trait Locus Mapping and Outlier Detection
- Inference for the Analysis of Ordinal Data with Spatio-Temporal Models
- An iterative algorithm for joint covariate and random effect selection in mixed effects models
- An extended trivariate vine copula mixed model for meta-analysis of diagnostic studies in the presence of non-evaluable outcomes
- Super Learner for Survival Data Prediction
- Model-based random forests for ordinal regression
- A Parametric Bootstrap for the Mean Measure of Divergence
- Derivation of Passing–Bablok regression from Kendall’s tau
- Direct effect and indirect effect on an outcome under nonlinear modeling
- GAMLSS for high-variability data: an application to liver fibrosis case
- Variable selection for high-dimensional quadratic Cox model with application to Alzheimer’s disease
Articles in the same Issue
- Research Articles
- Nonparametric bootstrap inference for the targeted highly adaptive least absolute shrinkage and selection operator (LASSO) estimator
- A Bayesian Framework for Robust Quantitative Trait Locus Mapping and Outlier Detection
- Inference for the Analysis of Ordinal Data with Spatio-Temporal Models
- An iterative algorithm for joint covariate and random effect selection in mixed effects models
- An extended trivariate vine copula mixed model for meta-analysis of diagnostic studies in the presence of non-evaluable outcomes
- Super Learner for Survival Data Prediction
- Model-based random forests for ordinal regression
- A Parametric Bootstrap for the Mean Measure of Divergence
- Derivation of Passing–Bablok regression from Kendall’s tau
- Direct effect and indirect effect on an outcome under nonlinear modeling
- GAMLSS for high-variability data: an application to liver fibrosis case
- Variable selection for high-dimensional quadratic Cox model with application to Alzheimer’s disease