Abstract
In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.
Funding source: Office of the Director
Award Identifier / Grant number: S10OD028685
Funding source: National Cancer Institute
Award Identifier / Grant number: U24CA086368
Funding source: National Institute of Allergy and Infectious Diseases
Award Identifier / Grant number: R37AI054165
Funding source: National Institute of General Medical Sciences
Award Identifier / Grant number: R01GM106177
-
Research ethics: Not applicable.
-
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Competing interests: The authors state no conflict of interest.
-
Research funding: This work was supported by the National Institutes ofHealth (NIH) grants R37AI054165, R01GM106177, U24CA086368 andS10OD028685. The opinions expressed in this article are those of the authors and do not necessarily represent the official views of the NIH.
-
Data availability: The data that support the findings of this paper are not publicly available due to privacy or ethical concerns. The statisticalprograms for implementing the methods are available from the corresponding author upon request.
References
1. Little, R, Schluchter, M. Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika 1985;72:497–512. https://doi.org/10.1093/biomet/72.3.497.Search in Google Scholar
2. Long, Q, Johnson, B. Variable selection in the presence of missing data: resampling and imputation. Biostatistics 2015;16:596–610. https://doi.org/10.1093/biostatistics/kxv003.Search in Google Scholar PubMed PubMed Central
3. Liu, L, Qiu, Y, Natarajan, L, Messer, K. Imputation and post-selection inference in models with missing data: an application to colorectal cancer surveillance guidelines. Ann Appl Stat 2019;13:1370–96. https://doi.org/10.1214/19-aoas1239.Search in Google Scholar
4. Bang, H, Robins, J. Doubly robust estimation in missing data and causal inference models. Biometrics 2005;61:962–73. https://doi.org/10.1111/j.1541-0420.2005.00377.x.Search in Google Scholar PubMed
5. Tsiatis, A. Semiparametric theory and missing data. New York, NY: Springer Science & Business Media; 2007.Search in Google Scholar
6. Johnson, B, Lin, D, Zeng, D. Penalized estimating functions and variable selection in semiparametric regression models. J Am Stat Assoc 2008;103:672–80. https://doi.org/10.1198/016214508000000184.Search in Google Scholar PubMed PubMed Central
7. Wolfson, J. EEBoost: a general method for prediction and variable selection based on estimating equations. J Am Stat Assoc 2011;106:296–305. https://doi.org/10.1198/jasa.2011.tm10098.Search in Google Scholar
8. Sun, B, Tchetgen Tchetgen, E. On inverse probability weighting for nonmonotone missing at random data. J Am Stat Assoc 2018;113:369–79. https://doi.org/10.1080/01621459.2016.1256814.Search in Google Scholar PubMed PubMed Central
9. Rubin, D. Multiple imputation for nonresponse in surveys. New York, NY: John Wiley & Sons; 1987.10.1002/9780470316696Search in Google Scholar
10. van Buuren, S. Flexible imputation of missing data. Boca Raton, FL: CRC Press; 2018.10.1201/9780429492259Search in Google Scholar
11. Tibshirani, R. Regression shrinkage and selection via the lasso. J Roy Stat Soc B 1996;58:267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.Search in Google Scholar
12. Fan, J, Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 2001;96:1348–60. https://doi.org/10.1198/016214501753382273.Search in Google Scholar
13. Meinshausen, N, Bühlmann, P. Stability selection. J Roy Stat Soc B 2010;72:417–73. https://doi.org/10.1111/j.1467-9868.2010.00740.x.Search in Google Scholar
14. Barber, R, Candès, E. Controlling the false discovery rate via knockoffs. Ann Stat 2015;43:2055–85. https://doi.org/10.1214/15-aos1337.Search in Google Scholar
15. Candès, E, Fan, Y, Janson, L, Lv, J. Panning for gold: model-X knockoffs for high-dimensional controlled variable selection. J Roy Stat Soc B 2018;80:551–77. https://doi.org/10.1111/rssb.12265.Search in Google Scholar
16. Barber, RF, Candès, EJ, Samworth, RJ. Robust inference with knockoffs; 2020. arXiv preprint arXiv:1801.03896.10.1214/19-AOS1852Search in Google Scholar
17. Wu, Y, Boos, D, Stefanski, L. Controlling variable selection by the addition of pseudovariables. J Am Stat Assoc 2007;102:235–43. https://doi.org/10.1198/016214506000000843.Search in Google Scholar
18. Boos, D, Stefanski, L, Wu, Y. Fast FSR variable selection with applications to clinical trials. Biometrics 2009;65:692–700.10.1111/j.1541-0420.2008.01127.xSearch in Google Scholar PubMed PubMed Central
19. Shah, R, Samworth, R. Variable selection with error control: another look at stability selection. J Roy Stat Soc B 2013;75:55–80. https://doi.org/10.1111/j.1467-9868.2011.01034.x.Search in Google Scholar
20. Leng, C, Lin, Y, Wahba, G. A note on the lasso and related procedures in model selection. Stat Sin 2006;16:1273–84.Search in Google Scholar
21. Peterson, R. A simple aggregation rule for penalized regression coefficients after multiple imputation. J Data Sci 2021;19:1–14. https://doi.org/10.6339/21-jds995.Search in Google Scholar
22. Heymans, M, Van Buuren, S, Knol, D, Van Mechelen, W, De Vet, H. Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol 2007;7:1–10. https://doi.org/10.1186/1471-2288-7-33.Search in Google Scholar PubMed PubMed Central
23. Williamson, B, Gilbert, P, Simon, N, Carone, M. A general framework for inference on algorithm-agnostic variable importance. J Am Stat Assoc 2023;118:1645–58. https://doi.org/10.1080/01621459.2021.2003200.Search in Google Scholar PubMed PubMed Central
24. Williamson, B, Feng, J. Efficient nonparametric statistical inference on population feature importance using Shapley values. In: Proceedings of the 37th International Conference on Machine Learning, Volume 119 of Proceedings of Machine Learning Research; 2020:10282–91 pp.Search in Google Scholar
25. Lehmann, E, Romano, J. Generalizations of the familywise error rate. In: Rojo, J, editor. Selected works of E. L. Lehmann. Boston, MA: Springer; 2012.10.1007/978-1-4614-1412-4_57Search in Google Scholar
26. Pfanzagl, J. Contributions to a general asymptotic statistical theory. New York, NY: Springer; 1982.10.1007/978-1-4612-5769-1Search in Google Scholar
27. van der Laan, M, Polley, E, Hubbard, A. Super learner. Stat Appl Genet Mol Biol 2007;6:25. https://doi.org/10.2202/1544-6115.1309.Search in Google Scholar PubMed
28. Rubin, D. Multiple imputation after 18+ years. J Am Stat Assoc 1996;91:473–89. https://doi.org/10.1080/01621459.1996.10476908.Search in Google Scholar
29. Dudoit, S, van der Laan, M. Multiple testing procedures with applications to genomics. New York, NY: Springer Science & Business Media; 2008.10.1007/978-0-387-49317-6Search in Google Scholar
30. Holm, S. A simple sequentially rejective multiple test procedure. Scand J Stat 1979:65–70.Search in Google Scholar
31. Greenshtein, E, Ritov, Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 2004;10:971–88. https://doi.org/10.3150/bj/1106314846.Search in Google Scholar
32. Bach, F. Bolasso: model consistent lasso estimation through the bootstrap. In: Proceedings of the 25th International Conference on Machine Learning; 2008:33–40 pp.10.1145/1390156.1390161Search in Google Scholar
33. Brugge, W, Lewandrowski, K, Lee-Lewandrowski, E, Centeno, B, Szydlo, T, Regan, S, et al.. Diagnosis of pancreatic cystic neoplasms: a report of the cooperative pancreatic cyst study. Gastroenterology 2004;126:1330–6. https://doi.org/10.1053/j.gastro.2004.02.013.Search in Google Scholar PubMed
34. Liu, Y, Kaur, S, Huang, Y, Fahrmann, J, Rinaudo, J, Hanash, S, et al.. Biomarkers and strategy to detect preinvasive and early pancreatic cancer: state of the field and the impact of the EDRN. Cancer Epidemiol Biomarkers Prev 2020;29:2513–23. https://doi.org/10.1158/1055-9965.epi-20-0161.Search in Google Scholar
35. Tun, M, Pai, R, Kwok, S, Dong, A, Gupta, A, Visser, B, et al.. Diagnostic accuracy of cyst fluid amphiregulin in pancreatic cysts. BMC Gastroenterol 2012;12:1–6. https://doi.org/10.1186/1471-230x-12-15.Search in Google Scholar
36. Robins, J, Wang, N. Inference for imputation estimators. Biometrika 2000;87:113–24. https://doi.org/10.1093/biomet/87.1.113.Search in Google Scholar
Supplementary Material
This article contains supplementary material (https://doi.org/10.1515/ijb-2023-0059).
© 2024 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Research Articles
- Random forests for survival data: which methods work best and under what conditions?
- Flexible variable selection in the presence of missing data
- An interpretable cluster-based logistic regression model, with application to the characterization of response to therapy in severe eosinophilic asthma
- MBPCA-OS: an exploratory multiblock method for variables of different measurement levels. Application to study the immune response to SARS-CoV-2 infection and vaccination
- Detecting differentially expressed genes from RNA-seq data using fuzzy clustering
- Hypothesis testing for detecting outlier evaluators
- Response to comments on ‘sensitivity of estimands in clinical trials with imperfect compliance’
- Commentary
- Comments on “sensitivity of estimands in clinical trials with imperfect compliance” by Chen and Heitjan
- Research Articles
- Optimizing personalized treatments for targeted patient populations across multiple domains
- Statistical models for assessing agreement for quantitative data with heterogeneous random raters and replicate measurements
- History-restricted marginal structural model and latent class growth analysis of treatment trajectories for a time-dependent outcome
- Revisiting incidence rates comparison under right censorship
- Ensemble learning methods of inference for spatially stratified infectious disease systems
- The survival function NPMLE for combined right-censored and length-biased right-censored failure time data: properties and applications
- Hybrid classical-Bayesian approach to sample size determination for two-arm superiority clinical trials
- Estimation of a decreasing mean residual life based on ranked set sampling with an application to survival analysis
- Improving the mixed model for repeated measures to robustly increase precision in randomized trials
- Bayesian second-order sensitivity of longitudinal inferences to non-ignorability: an application to antidepressant clinical trial data
- A modified rule of three for the one-sided binomial confidence interval
- Kalman filter with impulse noised outliers: a robust sequential algorithm to filter data with a large number of outliers
- Bayesian estimation and prediction for network meta-analysis with contrast-based approach
- Testing for association between ordinal traits and genetic variants in pedigree-structured samples by collapsing and kernel methods
Articles in the same Issue
- Frontmatter
- Research Articles
- Random forests for survival data: which methods work best and under what conditions?
- Flexible variable selection in the presence of missing data
- An interpretable cluster-based logistic regression model, with application to the characterization of response to therapy in severe eosinophilic asthma
- MBPCA-OS: an exploratory multiblock method for variables of different measurement levels. Application to study the immune response to SARS-CoV-2 infection and vaccination
- Detecting differentially expressed genes from RNA-seq data using fuzzy clustering
- Hypothesis testing for detecting outlier evaluators
- Response to comments on ‘sensitivity of estimands in clinical trials with imperfect compliance’
- Commentary
- Comments on “sensitivity of estimands in clinical trials with imperfect compliance” by Chen and Heitjan
- Research Articles
- Optimizing personalized treatments for targeted patient populations across multiple domains
- Statistical models for assessing agreement for quantitative data with heterogeneous random raters and replicate measurements
- History-restricted marginal structural model and latent class growth analysis of treatment trajectories for a time-dependent outcome
- Revisiting incidence rates comparison under right censorship
- Ensemble learning methods of inference for spatially stratified infectious disease systems
- The survival function NPMLE for combined right-censored and length-biased right-censored failure time data: properties and applications
- Hybrid classical-Bayesian approach to sample size determination for two-arm superiority clinical trials
- Estimation of a decreasing mean residual life based on ranked set sampling with an application to survival analysis
- Improving the mixed model for repeated measures to robustly increase precision in randomized trials
- Bayesian second-order sensitivity of longitudinal inferences to non-ignorability: an application to antidepressant clinical trial data
- A modified rule of three for the one-sided binomial confidence interval
- Kalman filter with impulse noised outliers: a robust sequential algorithm to filter data with a large number of outliers
- Bayesian estimation and prediction for network meta-analysis with contrast-based approach
- Testing for association between ordinal traits and genetic variants in pedigree-structured samples by collapsing and kernel methods