Abstract
External validation is critical in risk prediction model development to assess the model’s performance on independent datasets before clinical application. Sample size determination (SSD) is a pivotal aspect for external validation, ensuring sufficient precision, adequate power, appropriate ethical considerations, and affordable resource allocation. SSD for external validation of a risk model with binary outcomes can be performed targeting the most commonly used evaluation criterion, the area under the receiver operating characteristic (AUROC) curve, for two study designs: one focusing on margin-of-error and another on hypothesis testing. Sample sizes for external validation are estimated using various AUROC variance estimators. In this article, we conduct extensive simulations to compare the performance of different AUROC-based SSD methods for external validation under the two study designs. These simulations involve models developed using various machine learning (ML) algorithms, different sets of predictors, and data generated under diverse distributions. The simulation results show significantly different sample size estimates based on the chosen SSD methods. Based on empirical evidence from these simulations, we recommend the Hanley and McNeil method or the rules-of-thumb approach for margin-of-error–based designs, given their balance between reliability and feasibility. For hypothesis testing designs, the Hanley and McNeil method demonstrates superior operating characteristics and is recommended for achieving adequate power. We also demonstrate the application of these AUROC-based SSD methods in the external validation of a risk prediction model for forecasting 1-year overall survival (OS) in pediatric patients who underwent allogeneic hematopoietic cell transplantation (alloHCT). Our findings highlight the importance of selecting appropriate SSD methods for external validation and provide practical guidance for implementing SSD in clinical prediction model validation.
Funding source: National Institutes of Health/National Cancer Institute
Award Identifier / Grant number: P30 CA021765
Funding source: American Lebanese Syrian Associated Charities
-
Research ethics: Not applicable.
-
Informed consent: Not applicable.
-
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission. Y. Zhou and L. Tang conceptualized the study, developed simulations, interpreted results, and wrote the manuscript; Y. Zhou and Y. Zheng implemented the computational algorithms, conducted simulations, and performed real data analysis; C. Li and A. Sharma provided inputs to the study and reviewed the analysis; and all authors approved the final version of the manuscript.
-
Use of Large Language Models, AI and Machine Learning Tools: During the preparation of this work the authors used [OpenAI’s ChatGPT, 2024] grammatical revisions. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
-
Conflict of interest: The authors state no conflict of interest.
-
Research funding: The authors are supported for their work at St Jude Children’s Research Hospital by the American Lebanese Syrian Associated Charities and National Institutes of Health/National Cancer Institute grant (P30 CA021765).
-
Data availability: Not applicable.
Appendix A: Algorithms of SSD based on AUROC
| Algorithm 1: SSD for Design 1 – Based on the length of the CI of AUROC (margin-of-error design) |
| 1: Input: θ, h, p, α |
| 2: if n cannot be directly calculated based on requirement
|
| 3: for l=1, …, 5,000 do |
| 4: Calculate Var(θ) based on inputs θ, l, p |
| 5: Calculate
|
| 6: end for |
| 7: Find the index l
* where
|
| 8: Set n=l *. |
| 9: else |
| 10: Directly calculate n based on requirement
|
| 11: end if |
| Algorithm 2: SSD for Design 2 – Based on the hypothesis test of AUROC (historical control design) |
| 1: Input: θ 0, θ 1, p, α, β |
| 2: if n cannot be directly calculated based on requirement
|
| 3: for l=1, …, 5,000 do |
| 4: Calculate
|
| 5: Calculate
|
| 6: Calculate
|
| 7: end for |
| 8: Find the index l
* where
|
| 9: Set n=l *. |
| 10: else |
| 11: Directly calculate n based on requirement
|
| 12: end if |
Appendix B: Simulation settings
We generated a population with a size of n=100,000 and considered it large enough as the entire population. Denote i as the index of individuals, where i=1, …, N. Each individual’s disease status was labeled by the variable y, where y=1 indicated a diseased patient and y=0 signified a non-diseased individual. Assumed that for each subject, there was a d-dimensional feature vector denoted as
In real data analysis, it is common to collect information for only a subset of predictors, with some predictors remaining unknown or unmeasured. In the simulation, we considered different scenarios where the number of predictors collected, denoted as m, varied within the set {20, 30, 50}. For each value of m, we randomly selected 40 % from {X 1, …, X 20}, 30 % from {X 21, …, X 40}, and the remaining 30 % from {X 41, …, X 60}. This selection process ensured that our model included only a proportion of important predictors. Before conducting SSD, we drew a dataset of 1,000 individuals from the population N. This dataset, referred to as D prior, was used as a prior dataset to estimate the disease proportion p and to fit a risk model using the selected m predictors. We then calculate the AUROC, θ 1, which was considered the underlying AUROC value necessary for SSD. D prior was randomly split into 80 % training and 20 % testing sets 10 times. The risk model was constructed on the training set using different ML algorithms, including Logistic Regression (LR), Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM) with the Radial Basis Function (RBF) kernel, and Neural Network (NN). Five-fold cross-validation was performed for model training for algorithms requiring parameter tuning (i.e., NB, RF, SVM, and NN). The average AUROC value from the testing data was reported as θ 1. The value θ 1 serves as the benchmark AUROC achievable in the external validation study. This process mimics prior knowledge, including estimates from published studies, as commonly practiced.
The remaining patients who were not part of D prior were designated as D remain. We conducted separate simulation experiments for Design 1 and Design 2. In Design 1, we assumed that, based on the prior study using D prior, the AUROC that can be achieved from the external validation study would be θ 1. The objective was to determine the necessary sample size for the external validation data required to estimate the AUROC with a margin of error of 5 % from a 95 % CI (i.e., α=0.05 and a 95 % CI length of L<h=0.1). To assess the sensitivity of our simulation results to the choice of h, we conducted additional simulations under alternative settings (h=0.06 and h=0.14), with results presented in the Supplementary Materials. In Design 2, we considered a scenario where an investigator aimed to test the hypothesis H 0: θ=θ 0=θ 1−0.05 vs. H a : θ=θ 1>θ 0. In the external validation study, the investigator sought to achieve 80 % power (i.e., type II error β=0.2) to detect this difference while maintaining a type I error rate of α=0.05, like in other clinical studies. Similarly, to evaluate the sensitivity of the simulation results to the choice of the effect size (θ 1−θ 0), we conducted additional simulations using alternative values (θ 1−θ 0=0.03 and θ 1−θ 0=0.07). The results can be found in the Supplementary Materials.
For both Design 1 and Design 2, we estimated the required sample sizes based on the variance of AUROC following the steps outlined in Figure 1. Additionally, we compared the performance of the AUROC-based SSD method with the rules-of-thumb approach, which commonly requires a minimum of 100 events (i.e., n 1=100 and n 0=(1−p)n 1/p) for an external validation study. It was worth noting that the rules-of-thumb approach did not consider the specific type of study design and yielded the same required sample size for both Design 1 and Design 2. Once the sample sizes were estimated, we randomly sampled a total of n subjects from D remain, with n 1=np patients from the disease group and n 0=n(1−p) patients from the non-disease group. We then constructed risk models using the sampled data. Similar to the estimation process of θ 1, the AUROC value from this external validation set was reported as the averaged AUROC obtained from 10 randomly split 20 % testing data. This random sampling process was repeated for 500 replicates on D remain to generate the empirical distribution of AUROC. The entire process, starting from generating the population with n subjects, was repeated 100 times to derive summary statistics for evaluating SSD using both the rules-of-thumb approach and different AUROC variance estimators.
References
1. Alonzo, TA. Clinical prediction models: a practical approach to development, validation, and updating: by Ewout W. Steyerberg. Am J Epidemiol 2009;170:528. https://doi.org/10.1093/aje/kwp129.Search in Google Scholar
2. Royston, P, Moons, KG, Altman, DG, Vergouwe, Y. Prognosis and prognostic research: developing a prognostic model. BMJ 2009;338. https://doi.org/10.1136/bmj.b604.Search in Google Scholar
3. Steyerberg, EW, Moons, KG, van der Windt, DA, Hayden, JA, Perel, P, Schroter, S, et al.. Prognosis research strategy (progress) 3: prognostic model research. PLoS Med 2013;10:e1001381. https://doi.org/10.1371/journal.pmed.1001381.Search in Google Scholar
4. Harrell, FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York, NY: Springer; 2001, 608.10.1007/978-1-4757-3462-1Search in Google Scholar
5. Riley, RD, D van der Windt, P Croft, KG Moons. Prognosis research in healthcare: concepts, methods, and impact. New York, NY: Oxford University Press; 2019.10.1093/med/9780198796619.001.0001Search in Google Scholar
6. Kalayeh, H, Landgrebe, DA. Predicting the required number of training samples. IEEE Trans Pattern Anal Mach Intell 1983:664–7. https://doi.org/10.1109/tpami.1983.4767459.Search in Google Scholar
7. Raudys, SJ, AK Jain. Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 1991;13:252–64. https://doi.org/10.1109/34.75512.Search in Google Scholar
8. Harrell, FEJr, Lee, KL, Mark, DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics Med 1996;15:361–87.10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4Search in Google Scholar
9. Vergouwe, Y, Steyerberg, EW, Eijkemans, MJ, Habbema, JDF. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol 2005;58:475–83. https://doi.org/10.1016/j.jclinepi.2004.06.017.Search in Google Scholar
10. Collins, GS, Ogundimu, EO, Altman, DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Statistics Med 2016;35:214–26. https://doi.org/10.1002/sim.6787.Search in Google Scholar
11. Ogundimu, EO, Altman, DG, Collins, GS. Adequate sample size for developing prediction models is not simply related to events per variable. J Clin Epidemiol 2016;76:175–82. https://doi.org/10.1016/j.jclinepi.2016.02.031.Search in Google Scholar PubMed PubMed Central
12. Newcombe, RG. Confidence intervals for an effect size measure based on the mann–whitney statistic. part 2: asymptotic methods and evaluation. Statistics Med 2006;25:559–73. https://doi.org/10.1002/sim.2324.Search in Google Scholar PubMed
13. Pavlou, M, Qu, C, Omar, RZ, Seaman, SR, Steyerberg, EW, White, IR, et al.. Estimation of required sample size for external validation of risk models for binary outcomes. Stat Methods Med Res 2021;30:2187–206. https://doi.org/10.1177/09622802211007522.Search in Google Scholar PubMed PubMed Central
14. Birnbaum, ZW, Klose, OM. Bounds for the variance of the mann-whitney statistic. Ann Math Statistics 1957:933–45. https://doi.org/10.1214/aoms/1177706794.Search in Google Scholar
15. Hanley, JA, McNeil, BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 1982;143:29–36. https://doi.org/10.1148/radiology.143.1.7063747.Search in Google Scholar PubMed
16. Rosner, B, Glynn, RJ. Power and sample size estimation for the wilcoxon rank sum test with application to comparisons of c statistics from alternative prediction models. Biometrics 2009;65:188–97. https://doi.org/10.1111/j.1541-0420.2008.01062.x.Search in Google Scholar PubMed
17. Lehmann, EL, D’Abrera, H. Nonparametrics: Statistical methods based on ranks (rev. ed.). Englewood Cliffs, NJ: Prentice-Hall; 1998, 292:23 p.Search in Google Scholar
18. Lehmann, EL. Elements of large-sample theory. New York, NY: Springer; 1999.10.1007/b98855Search in Google Scholar
19. Bamber, D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 1975;12:387–415. https://doi.org/10.1016/0022-2496-75-90001-2.Search in Google Scholar
20. Mee, RW. Confidence intervals for probabilities and tolerance regions based on a generalization of the mann-whitney statistic. J Am Stat Assoc 1990;85:793–800. https://doi.org/10.2307/2290017.Search in Google Scholar
21. Auletta, JJ, Kou, J, Chen, M, Bolon, Y-T, Broglie, L, Bupp, C, et al.. Real-world data showing trends and outcomes by race and ethnicity in allogeneic hematopoietic cell transplantation: a report from the center for international blood and marrow transplant research. Transplantation and Cellular Therapy 2023;29:346–e1. https://doi.org/10.1016/j.jtct.2023.03.007.Search in Google Scholar PubMed PubMed Central
22. Fausser, J-L, Tavenard, A, Rialland, F, Le Moine, P, Minckes, O, Jourdain, A, et al.. Should we pay attention to the delay before admission to a pediatric intensive care unit for children with cancer? Impact on 1-month mortality. a report from the french children’s oncology study group, goce. J Pediatr Hematol Oncol 2017;39:e244–8. https://doi.org/10.1097/mph.0000000000000816.Search in Google Scholar PubMed
23. Lee, D-S, Suh, GY, Ryu, J-A, Chung, CR, Yang, JH, Park, C-M, et al.. Effect of early intervention on long-term outcomes of critically ill cancer patients admitted to icus. Crit Care Med 2015;43:1439–48. https://doi.org/10.1097/ccm.0000000000000989.Search in Google Scholar PubMed
24. Zhou, Y, Smith, J, Keerthi, D, Li, C, Sun, Y, Mothi, SS, et al.. Longitudinal clinical data improves survival prediction after hematopoietic cell transplantation using machine learning. Blood Adv 2023 bloodadvances–2023011752. https://doi.org/10.1182/bloodadvances.2023011752.Search in Google Scholar PubMed PubMed Central
25. Gratwohl, A, Stern, M, Brand, R, Apperley, J, Baldomero, H, de Witte, T, et al.. Risk score for outcome after allogeneic hematopoietic stem cell transplantation: a retrospective analysis. Cancer 2009;115:4715–26. https://doi.org/10.1002/cncr.24531.Search in Google Scholar PubMed
26. Sorror, ML, Maris, MB, Storb, R, Baron, F, Sandmaier, BM, Maloney, DG, et al.. Hematopoietic cell transplantation (Hct)-specific comorbidity index: a new tool for risk assessment before allogeneic hct. Blood 2005;106:2912–19. https://doi.org/10.1182/blood-2005-05-2004.Search in Google Scholar PubMed PubMed Central
27. Armand, P, Kim, HT, Logan, BR, Wang, Z, Alyea, EP, Kalaycio, ME, et al.. Validation and refinement of the disease risk index for allogeneic stem cell transplantation. Blood, J Am Soc Hematol 2014;123:3664–71. https://doi.org/10.1182/blood-2014-01-552984.Search in Google Scholar PubMed PubMed Central
28. Broglie, L, Ruiz, J, Jin, Z, Kahn, JM, Bhatia, M, George, D, et al.. Limitations of applying the hematopoietic cell transplantation comorbidity index in pediatric patients receiving allogeneic hematopoietic cell transplantation. Transplant Cellular Ther 2021;27:74–e1. https://doi.org/10.1016/j.bbmt.2020.10.003.Search in Google Scholar PubMed
29. Nakaya, A, Mori, T, Tanaka, M, Tomita, N, Nakaseko, C, Yano, S, et al.. Does the hematopoietic cell transplantation specific comorbidity index (Hct-ci) predict transplantation outcomes? A prospective multicenter validation study of the kanto study group for cell therapy. Biol Blood Marrow Transplant 2014;20:1553–9. https://doi.org/10.1016/j.bbmt.2014.06.005.Search in Google Scholar PubMed
30. Raimondi, R, Tosetto, A, Oneto, R, Cavazzina, R, Rodeghiero, F, Bacigalupo, A, et al.. Validation of the hematopoietic cell transplantation-specific comorbidity index: a prospective, multicenter gitmo study, blood. J Am Soc Hematol 2012;120:1327–33. https://doi.org/10.1182/blood-2012-03-414573.Search in Google Scholar PubMed
31. Versluis, J, Labopin, M, Niederwieser, D, Socie, G, Schlenk, RF, Milpied, N, et al.. Prediction of non-relapse mortality in recipients of reduced intensity conditioning allogeneic stem cell transplantation with aml in first complete remission. Leukemia 2015;29:51–7. https://doi.org/10.1038/leu.2014.164.Search in Google Scholar PubMed
32. Bantis, LE, Brewer, B, Nakas, CT, Reiser, B. Statistical inference for box–cox based receiver operating characteristic curves. Statistics Med 2024;43:6099–122. https://doi.org/10.1002/sim.10252.Search in Google Scholar PubMed PubMed Central
33. Hanley, JA. The robustness of the” binormal” assumptions used in fitting roc curves. Med Decis Mak 1988;8:197–203. https://doi.org/10.1177/0272989x8800800308.Search in Google Scholar PubMed
34. Pepe, MS. The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press; 2003.10.1093/oso/9780198509844.001.0001Search in Google Scholar
35. Reiser, B, Faraggi, D, Guttman, I. Choice of sample size for testing the p (x>y). Commun Stat Theor Methods 1992;21:559–69. https://doi.org/10.1080/03610929208830798.Search in Google Scholar
36. Reiser, B, Guttman, I. Sample size choice for reliability verification in strength-stress models. Can J Stat 1989;17:253–9. https://doi.org/10.2307/3315521.Search in Google Scholar
37. Owen, DB. Tables for computing bivariate normal probabilities. Ann Math Statistics 1956;27:1075–90. https://doi.org/10.1214/aoms/1177728074.Search in Google Scholar
Supplementary Material
This article contains supplementary material (https://doi.org/10.1515/em-2025-0012).
© 2025 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Research Articles
- Definition, identification, and estimation of the direct and indirect number needed to treat
- Sample size determination for external validation of risk models with binary outcomes using the area under the ROC curve
- Analysis of the drug resistance level of malaria disease: a fractional-order model
- Extending the scope of the capture-recapture experiment: a multilevel approach with random effects to provide reliable estimates at national level
- Discrete-time compartmental models with partially observed data: a comparison among frequentist and Bayesian approaches for addressing likelihood intractability
- Sensitivity analysis for unmeasured confounding for a joint effect with an application to survey data
- Investigating the association between school substance programs and student substance use: accounting for informative cluster size
- The quantiles of extreme differences matrix for evaluating discriminant validity
- Finite-sample improved confidence intervals based on the estimating equation theory for the modified Poisson and least-squares regressions
- Causal mediation analysis for difference-in-difference design and panel data
- What if dependent causes of death were independent?
- Bot invasion: protecting the integrity of online surveys against spamming
- A study of a stochastic model and extinction phenomenon of meningitis epidemic
- Understanding the impact of media and latency in information response on the disease propagation: a mathematical model and analysis
- Time-varying reproductive number estimation for practical application in structured populations
- Perspective
- Should we still use pointwise confidence intervals for the Kaplan–Meier estimator?
- Leveraging data from multiple sources in epidemiologic research: transportability, dynamic borrowing, external controls, and beyond
- Regression calibration for time-to-event outcomes: mitigating bias due to measurement error in real-world endpoints
Articles in the same Issue
- Research Articles
- Definition, identification, and estimation of the direct and indirect number needed to treat
- Sample size determination for external validation of risk models with binary outcomes using the area under the ROC curve
- Analysis of the drug resistance level of malaria disease: a fractional-order model
- Extending the scope of the capture-recapture experiment: a multilevel approach with random effects to provide reliable estimates at national level
- Discrete-time compartmental models with partially observed data: a comparison among frequentist and Bayesian approaches for addressing likelihood intractability
- Sensitivity analysis for unmeasured confounding for a joint effect with an application to survey data
- Investigating the association between school substance programs and student substance use: accounting for informative cluster size
- The quantiles of extreme differences matrix for evaluating discriminant validity
- Finite-sample improved confidence intervals based on the estimating equation theory for the modified Poisson and least-squares regressions
- Causal mediation analysis for difference-in-difference design and panel data
- What if dependent causes of death were independent?
- Bot invasion: protecting the integrity of online surveys against spamming
- A study of a stochastic model and extinction phenomenon of meningitis epidemic
- Understanding the impact of media and latency in information response on the disease propagation: a mathematical model and analysis
- Time-varying reproductive number estimation for practical application in structured populations
- Perspective
- Should we still use pointwise confidence intervals for the Kaplan–Meier estimator?
- Leveraging data from multiple sources in epidemiologic research: transportability, dynamic borrowing, external controls, and beyond
- Regression calibration for time-to-event outcomes: mitigating bias due to measurement error in real-world endpoints