Abstract
Risk prediction models can link high-dimensional molecular measurements, such as DNA methylation, to clinical endpoints. For biological interpretation, often a sparse fit is desirable. Different molecular aggregation levels, such as considering DNA methylation at the CpG, gene, or chromosome level, might demand different degrees of sparsity. Hence, model building and estimation techniques should be able to adapt their sparsity according to the setting. Additionally, underestimation of coefficients, which is a typical problem of sparse techniques, should also be addressed. We propose a comprehensive approach, based on a boosting technique that allows a flexible adaptation of model sparsity and addresses these problems in an integrative way. The main motivation is to have an automatic sparsity adaptation. In a simulation study, we show that this approach reduces underestimation in sparse settings and selects more adequate model sizes than the corresponding non-adaptive boosting technique in non-sparse settings. Using different aggregation levels of DNA methylation data from a study in kidney carcinoma patients, we illustrate how automatically selected values of the sparsity tuning parameter can reflect the underlying structure of the data. In addition to that, prediction performance and variable selection stability is compared to the non-adaptive boosting approach.
References
Asakura, T., A. Imai, N. Ohkubo-Uraoka, M. Kuroda, Y. Iidaka, K. Uchida, T. Shibasaki and K. Ohkawa (2005): “Relationship between expression of drugresistance factors and drug sensitivity in normal human renal proximal tubular epithelial cells in comparison with renal cell carcinoma.” Oncol. Rep., 14, 601–607.Suche in Google Scholar
Benner, A., M. Zucknick, T. Hielscher, C. Ittrich and U. Mansmann (2010): “High-dimensional cox models: the choice of penalty as part of the model Building process,” Biometrical J., 52, 50–69.Suche in Google Scholar
Binder, H. and M. Schumacher (2008a): “Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples,” Statistical Applications in Genetics and Molecular Biology, 7, Article 12.10.2202/1544-6115.1346Suche in Google Scholar PubMed
Binder, H. and M. Schumacher (2008b): “Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models,” BMC Bioinformatics, 9, 14.10.1186/1471-2105-9-14Suche in Google Scholar PubMed PubMed Central
Binder, H. and M. Schumacher (2009): “Incorporating pathway information into boosting estimation of high-dimensional risk prediction models,” BMC Bioinformatics, 10, 18.10.1186/1471-2105-10-18Suche in Google Scholar PubMed PubMed Central
Buhlmann, P. and B. Yu (2003): “Boosting with the L2 Loss: Regression and Classification,” J. Am. Stat. Assoc., 98, 324–339.Suche in Google Scholar
Candes, E. and T. Tao (2007): “The Dantzig selector: statistical estimation when p is much larger than n,” Ann. Stat., 35, 2313–2351.Suche in Google Scholar
Dedeurwaerder, S., M. Defrance, E. Calonne, H. Denis, C. Sotiriou and F. Fuks (2011): “Evaluation of the infinium methylation 450K technology,” Epigenomics, 3, 771–784.10.2217/epi.11.105Suche in Google Scholar PubMed
Efron, B., T. Hastie, I. Johnstone and R. Tibshirani (2004): “Least angle regression,” Ann. Stat., 32, 407–499.Suche in Google Scholar
Efron, B. and R. Tibshirani (1997): “Improvements on cross-validation: the 0.632 bootstrap method,” J. Am. Stat. Assoc., 92, 548–560.Suche in Google Scholar
Ein-Dor, L., O. Zuk and E. Domany (2006): “Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer,” P. Nat. Acad. Sci., 103, 5923–5928.Suche in Google Scholar
Engler, D. and Y. Li (2009): “Survival analysis with high-dimensional covariates: an application in microarray studies,” Statistical Applications in Genetics and Molecular Biology, 8, 1–56.10.2202/1544-6115.1423Suche in Google Scholar PubMed PubMed Central
Fan, J. and R. Li (2001): “Variable selection via nonconcave penalized likelihood and its oracle properties,” J. Am. Stat. Assoc., 96, 1348–1360.Suche in Google Scholar
Hakimi, A. A., I. Ostrovnaya, B. Reva, N. Schultz, Y.-B. Chen, M. Gonen, H. Liu, S. Takeda, M. H. Voss, S. K. Tickoo, V. E. Reuter, P. Russo, E. H. Cheng, C. Sander, R. J. Motzer and J. J. Hsieh (2013): “Adverse outcomes in clear cell renal cell carcinoma with mutations of 3p21 epigenetic regulators BAP1 and SETD2: A report by MSKCC and the KIRC TCGA research network,” Clin. Cancer Res., 19, 3259–3267.10.1158/1078-0432.CCR-12-3886Suche in Google Scholar PubMed PubMed Central
Li, J. and S. Ma (2013): Survival analysis in medicine and genetics, Chapman and Hall/CRC Biostatistics Series, CRC Press LLC.Suche in Google Scholar
R Development Core Team (2013): R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, URL http://www.R-project.org/.Suche in Google Scholar
Sandoval, J., H. Heyn, S. Moran, J. Serra-Musach, M. A. Pujana, M. Bibikova and M. Esteller (2011): “Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome,” Epigenetics, 6, 692–702.10.4161/epi.6.6.16196Suche in Google Scholar PubMed
Schmid, M. and T. Hothorn (2008): “Flexible boosting of accelerated failure time models,” BMC Bioinformatics, 9, 269.10.1186/1471-2105-9-269Suche in Google Scholar PubMed PubMed Central
Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso,” J. Roy. Stat. Soc., B, 58, 267–288.Suche in Google Scholar
Tutz, G. and H. Binder (2006): “Generalized additive modeling with implicit variable selection by likelihood-based boosting,” Biometrics, 62, 961–971.10.1111/j.1541-0420.2006.00578.xSuche in Google Scholar PubMed
Tutz, G. and H. Binder (2007): “Boosting ridge regression,” Comput. Stat. Data Anal., 51, 6044–6059.Suche in Google Scholar
Wang, S., B. Nan, J. Zhu and D. G. Beer (2008): “Doubly penalized buckleyjames method for survival data with high-dimensional covariates,” Biometrics, 64, 132–140.10.1111/j.1541-0420.2007.00877.xSuche in Google Scholar PubMed
Xie, H. and J. Huang (2009): “SCAD-Penalized regression in high-dimensional partially linear models,” Ann. Stat., 37, 673–696.Suche in Google Scholar
Yang, Y. (2007): “Prediction/Estimation with simple linear models: is it really that simple?” Economet. Theor., 23, 1–36.Suche in Google Scholar
Ziller, M. J., H. Gu, F. Mller, J. Donaghey, O. Kohlbacher, B. E. Bernstein, A. Gnirke and A. Meissner (2013): “Charting a dynamic DNA methylation landscape of the human genome,” Nature, 500, 477481.10.1038/nature12433Suche in Google Scholar PubMed PubMed Central
Zou, H. (2006): “The adaptive lasso and its oracle properties,” J. Am. Stat. Assoc., 101, 1418–1429.Suche in Google Scholar
Zou, H. and T. Hastie (2005): “Regularization and variable selection via the elastic net,” J. Roy. Stat. Soc.: Series B (Statistical Methodology), 67, 301–320.10.1111/j.1467-9868.2005.00503.xSuche in Google Scholar
©2014 by Walter de Gruyter Berlin/Boston
Artikel in diesem Heft
- Frontmatter
- Statistical inference of regulatory networks for circadian regulation
- A Bayesian clustering approach for detecting gene-gene interactions in high-dimensional genotype data
- A novel characterization of the generalized family wise error rate using empirical null distributions
- Applying shrinkage variance estimators to the TOST test in high dimensional settings
- A boosting approach for adapting the sparsity of risk prediction signatures based on different molecular levels
- Using the theory of added-variable plot for linear mixed models to decompose genetic effects in family data
- Efficient parametric inference for stochastic biological systems with measured variability
Artikel in diesem Heft
- Frontmatter
- Statistical inference of regulatory networks for circadian regulation
- A Bayesian clustering approach for detecting gene-gene interactions in high-dimensional genotype data
- A novel characterization of the generalized family wise error rate using empirical null distributions
- Applying shrinkage variance estimators to the TOST test in high dimensional settings
- A boosting approach for adapting the sparsity of risk prediction signatures based on different molecular levels
- Using the theory of added-variable plot for linear mixed models to decompose genetic effects in family data
- Efficient parametric inference for stochastic biological systems with measured variability