Home Medicine Sample size determination for external validation of risk models with binary outcomes using the area under the ROC curve
Article
Licensed
Unlicensed Requires Authentication

Sample size determination for external validation of risk models with binary outcomes using the area under the ROC curve

  • Yiwang Zhou ORCID logo EMAIL logo , Ying Zheng , Cai Li , Akshay Sharma and Li Tang ORCID logo EMAIL logo
Published/Copyright: November 4, 2025

Abstract

External validation is critical in risk prediction model development to assess the model’s performance on independent datasets before clinical application. Sample size determination (SSD) is a pivotal aspect for external validation, ensuring sufficient precision, adequate power, appropriate ethical considerations, and affordable resource allocation. SSD for external validation of a risk model with binary outcomes can be performed targeting the most commonly used evaluation criterion, the area under the receiver operating characteristic (AUROC) curve, for two study designs: one focusing on margin-of-error and another on hypothesis testing. Sample sizes for external validation are estimated using various AUROC variance estimators. In this article, we conduct extensive simulations to compare the performance of different AUROC-based SSD methods for external validation under the two study designs. These simulations involve models developed using various machine learning (ML) algorithms, different sets of predictors, and data generated under diverse distributions. The simulation results show significantly different sample size estimates based on the chosen SSD methods. Based on empirical evidence from these simulations, we recommend the Hanley and McNeil method or the rules-of-thumb approach for margin-of-error–based designs, given their balance between reliability and feasibility. For hypothesis testing designs, the Hanley and McNeil method demonstrates superior operating characteristics and is recommended for achieving adequate power. We also demonstrate the application of these AUROC-based SSD methods in the external validation of a risk prediction model for forecasting 1-year overall survival (OS) in pediatric patients who underwent allogeneic hematopoietic cell transplantation (alloHCT). Our findings highlight the importance of selecting appropriate SSD methods for external validation and provide practical guidance for implementing SSD in clinical prediction model validation.


Corresponding authors: Yiwang Zhou and Li Tang, Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN, USA, E-mail: (Y. Zhou), (L. Tang) (Y. Zhou) (L. Tang)

Funding source: National Institutes of Health/National Cancer Institute

Award Identifier / Grant number: P30 CA021765

  1. Research ethics: Not applicable.

  2. Informed consent: Not applicable.

  3. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission. Y. Zhou and L. Tang conceptualized the study, developed simulations, interpreted results, and wrote the manuscript; Y. Zhou and Y. Zheng implemented the computational algorithms, conducted simulations, and performed real data analysis; C. Li and A. Sharma provided inputs to the study and reviewed the analysis; and all authors approved the final version of the manuscript.

  4. Use of Large Language Models, AI and Machine Learning Tools: During the preparation of this work the authors used [OpenAI’s ChatGPT, 2024] grammatical revisions. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

  5. Conflict of interest: The authors state no conflict of interest.

  6. Research funding: The authors are supported for their work at St Jude Children’s Research Hospital by the American Lebanese Syrian Associated Charities and National Institutes of Health/National Cancer Institute grant (P30 CA021765).

  7. Data availability: Not applicable.

Appendix A: Algorithms of SSD based on AUROC

Algorithm 1: SSD for Design 1 – Based on the length of the CI of AUROC (margin-of-error design)
1: Input: θ, h, p, α
2: if n cannot be directly calculated based on requirement 2 Z 1 α / 2 Var ( θ ) = h then
3: for l=1, , 5,000 do
4:  Calculate Var(θ) based on inputs θ, l, p
5:  Calculate Δ l = Var ( θ ) ( h 2 Z 1 α / 2 ) 2
6: end for
7: Find the index l * where Δ l * 0 and Δ l * 1 > 0 .
8: Set n=l *.
9: else
10: Directly calculate n based on requirement 2 Z 1 α / 2 Var ( θ ) = h
11: end if
Algorithm 2: SSD for Design 2 – Based on the hypothesis test of AUROC (historical control design)
1: Input: θ 0, θ 1, p, α, β
2: if n cannot be directly calculated based on requirement β = Φ { Z 1 α / 2 Var 0 ( θ ̂ ) + θ 0 θ 1 Var 1 ( θ ̂ ) } then
3: for l=1, , 5,000 do
4:  Calculate Var 0 ( θ ̂ ) based on inputs θ 0, l, p
5:  Calculate Var 1 ( θ ̂ ) based on inputs θ 1, l, p
6:  Calculate Δ l = Z 1 α / 2 Var 0 ( θ ̂ ) Z β Var 1 ( θ ̂ ) + θ 0 θ 1
7: end for
8: Find the index l * where Δ l * 0 and Δ l * 1 > 0 .
9: Set n=l *.
10: else
11: Directly calculate n based on requirement β = Φ { Z 1 α / 2 Var 0 ( θ ̂ ) + θ 0 θ 1 Var 1 ( θ ̂ ) }
12: end if

Appendix B: Simulation settings

We generated a population with a size of n=100,000 and considered it large enough as the entire population. Denote i as the index of individuals, where i=1, , N. Each individual’s disease status was labeled by the variable y, where y=1 indicated a diseased patient and y=0 signified a non-diseased individual. Assumed that for each subject, there was a d-dimensional feature vector denoted as X = ( X 1 , , X d ) , d = 60 . The features followed a multivariate normal distribution, X ∼ MVN(0, Σ), with Σ as the covariance matrix of X and diag(Σ)=1. A randomly selected set of 30 features was correlated with each other, having a correlation of 0.2, while all other features were mutually independent. We simulated the disease status y of each subject based on two underlying models: (1) Logistic Model: The probability of subject i developing the disease condition was given by P ( y i = 1 ) = exp ( η i ) 1 + exp ( η i ) , where η i = −8 + f( X i ). The disease status y i of each subject was determined using a Bernoulli distribution with the calculated probability. (2) Weibull Model: The survival time for each subject i was simulated using a Weibull distribution with a shape parameter of 2 and a scale parameter of 0.005exp(−8) exp(f( X i )). The administrative censoring time was set at 10, and the label for each subject i was determined by whether the event occurred before the censoring time. To introduce nonlinear dependence on the features, we define f ( X ) = log ( 1.8 ) ( X 1 + + X 5 ) + log ( 1.5 ) ( X 6 + + X 15 ) + log ( 1.1 ) ( X 16 + + X 20 ) + X i , 21 ( 1 + X i , 22 ) + X i , 23 ( 1.5 + X i , 24 ) + X i , 25 ( 2 + X i , 26 ) + X i , 27 2 + X i , 28 2 + X i , 29 3 + X i , 30 3 + sin ( X i , 31 + X i , 32 + X i , 33 + X i , 34 + X i , 35 ) + cos ( X i , 36 + X i , 37 + X i , 38 + X i , 39 + X i , 40 ) + 0.001 exp ( X i , 41 + X i , 42 + X i , 43 ) + 0.001 exp ( X i , 44 + X i , 45 + X i , 46 ) + 0.001 exp ( X i , 47 + X i , 48 + X i , 49 + X i , 50 ) + log X i , 51 + X i , 52 + X i , 53 + X i , 54 + log X i , 55 2 + X i , 56 2 + X i , 57 2 + X i , 58 2 + max ( 0 , X i , 59 ) + min ( 0 , X i , 60 ) .

In real data analysis, it is common to collect information for only a subset of predictors, with some predictors remaining unknown or unmeasured. In the simulation, we considered different scenarios where the number of predictors collected, denoted as m, varied within the set {20, 30, 50}. For each value of m, we randomly selected 40 % from {X 1, , X 20}, 30 % from {X 21, , X 40}, and the remaining 30 % from {X 41, , X 60}. This selection process ensured that our model included only a proportion of important predictors. Before conducting SSD, we drew a dataset of 1,000 individuals from the population N. This dataset, referred to as D prior, was used as a prior dataset to estimate the disease proportion p and to fit a risk model using the selected m predictors. We then calculate the AUROC, θ 1, which was considered the underlying AUROC value necessary for SSD. D prior was randomly split into 80 % training and 20 % testing sets 10 times. The risk model was constructed on the training set using different ML algorithms, including Logistic Regression (LR), Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM) with the Radial Basis Function (RBF) kernel, and Neural Network (NN). Five-fold cross-validation was performed for model training for algorithms requiring parameter tuning (i.e., NB, RF, SVM, and NN). The average AUROC value from the testing data was reported as θ 1. The value θ 1 serves as the benchmark AUROC achievable in the external validation study. This process mimics prior knowledge, including estimates from published studies, as commonly practiced.

The remaining patients who were not part of D prior were designated as D remain. We conducted separate simulation experiments for Design 1 and Design 2. In Design 1, we assumed that, based on the prior study using D prior, the AUROC that can be achieved from the external validation study would be θ 1. The objective was to determine the necessary sample size for the external validation data required to estimate the AUROC with a margin of error of 5 % from a 95 % CI (i.e., α=0.05 and a 95 % CI length of L<h=0.1). To assess the sensitivity of our simulation results to the choice of h, we conducted additional simulations under alternative settings (h=0.06 and h=0.14), with results presented in the Supplementary Materials. In Design 2, we considered a scenario where an investigator aimed to test the hypothesis H 0: θ=θ 0=θ 1−0.05 vs. H a : θ=θ 1>θ 0. In the external validation study, the investigator sought to achieve 80 % power (i.e., type II error β=0.2) to detect this difference while maintaining a type I error rate of α=0.05, like in other clinical studies. Similarly, to evaluate the sensitivity of the simulation results to the choice of the effect size (θ 1θ 0), we conducted additional simulations using alternative values (θ 1θ 0=0.03 and θ 1θ 0=0.07). The results can be found in the Supplementary Materials.

For both Design 1 and Design 2, we estimated the required sample sizes based on the variance of AUROC following the steps outlined in Figure 1. Additionally, we compared the performance of the AUROC-based SSD method with the rules-of-thumb approach, which commonly requires a minimum of 100 events (i.e., n 1=100 and n 0=(1−p)n 1/p) for an external validation study. It was worth noting that the rules-of-thumb approach did not consider the specific type of study design and yielded the same required sample size for both Design 1 and Design 2. Once the sample sizes were estimated, we randomly sampled a total of n subjects from D remain, with n 1=np patients from the disease group and n 0=n(1−p) patients from the non-disease group. We then constructed risk models using the sampled data. Similar to the estimation process of θ 1, the AUROC value from this external validation set was reported as the averaged AUROC obtained from 10 randomly split 20 % testing data. This random sampling process was repeated for 500 replicates on D remain to generate the empirical distribution of AUROC. The entire process, starting from generating the population with n subjects, was repeated 100 times to derive summary statistics for evaluating SSD using both the rules-of-thumb approach and different AUROC variance estimators.

References

1. Alonzo, TA. Clinical prediction models: a practical approach to development, validation, and updating: by Ewout W. Steyerberg. Am J Epidemiol 2009;170:528. https://doi.org/10.1093/aje/kwp129.Search in Google Scholar

2. Royston, P, Moons, KG, Altman, DG, Vergouwe, Y. Prognosis and prognostic research: developing a prognostic model. BMJ 2009;338. https://doi.org/10.1136/bmj.b604.Search in Google Scholar

3. Steyerberg, EW, Moons, KG, van der Windt, DA, Hayden, JA, Perel, P, Schroter, S, et al.. Prognosis research strategy (progress) 3: prognostic model research. PLoS Med 2013;10:e1001381. https://doi.org/10.1371/journal.pmed.1001381.Search in Google Scholar

4. Harrell, FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York, NY: Springer; 2001, 608.10.1007/978-1-4757-3462-1Search in Google Scholar

5. Riley, RD, D van der Windt, P Croft, KG Moons. Prognosis research in healthcare: concepts, methods, and impact. New York, NY: Oxford University Press; 2019.10.1093/med/9780198796619.001.0001Search in Google Scholar

6. Kalayeh, H, Landgrebe, DA. Predicting the required number of training samples. IEEE Trans Pattern Anal Mach Intell 1983:664–7. https://doi.org/10.1109/tpami.1983.4767459.Search in Google Scholar

7. Raudys, SJ, AK Jain. Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 1991;13:252–64. https://doi.org/10.1109/34.75512.Search in Google Scholar

8. Harrell, FEJr, Lee, KL, Mark, DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics Med 1996;15:361–87.10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4Search in Google Scholar

9. Vergouwe, Y, Steyerberg, EW, Eijkemans, MJ, Habbema, JDF. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol 2005;58:475–83. https://doi.org/10.1016/j.jclinepi.2004.06.017.Search in Google Scholar

10. Collins, GS, Ogundimu, EO, Altman, DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Statistics Med 2016;35:214–26. https://doi.org/10.1002/sim.6787.Search in Google Scholar

11. Ogundimu, EO, Altman, DG, Collins, GS. Adequate sample size for developing prediction models is not simply related to events per variable. J Clin Epidemiol 2016;76:175–82. https://doi.org/10.1016/j.jclinepi.2016.02.031.Search in Google Scholar PubMed PubMed Central

12. Newcombe, RG. Confidence intervals for an effect size measure based on the mann–whitney statistic. part 2: asymptotic methods and evaluation. Statistics Med 2006;25:559–73. https://doi.org/10.1002/sim.2324.Search in Google Scholar PubMed

13. Pavlou, M, Qu, C, Omar, RZ, Seaman, SR, Steyerberg, EW, White, IR, et al.. Estimation of required sample size for external validation of risk models for binary outcomes. Stat Methods Med Res 2021;30:2187–206. https://doi.org/10.1177/09622802211007522.Search in Google Scholar PubMed PubMed Central

14. Birnbaum, ZW, Klose, OM. Bounds for the variance of the mann-whitney statistic. Ann Math Statistics 1957:933–45. https://doi.org/10.1214/aoms/1177706794.Search in Google Scholar

15. Hanley, JA, McNeil, BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 1982;143:29–36. https://doi.org/10.1148/radiology.143.1.7063747.Search in Google Scholar PubMed

16. Rosner, B, Glynn, RJ. Power and sample size estimation for the wilcoxon rank sum test with application to comparisons of c statistics from alternative prediction models. Biometrics 2009;65:188–97. https://doi.org/10.1111/j.1541-0420.2008.01062.x.Search in Google Scholar PubMed

17. Lehmann, EL, D’Abrera, H. Nonparametrics: Statistical methods based on ranks (rev. ed.). Englewood Cliffs, NJ: Prentice-Hall; 1998, 292:23 p.Search in Google Scholar

18. Lehmann, EL. Elements of large-sample theory. New York, NY: Springer; 1999.10.1007/b98855Search in Google Scholar

19. Bamber, D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 1975;12:387–415. https://doi.org/10.1016/0022-2496-75-90001-2.Search in Google Scholar

20. Mee, RW. Confidence intervals for probabilities and tolerance regions based on a generalization of the mann-whitney statistic. J Am Stat Assoc 1990;85:793–800. https://doi.org/10.2307/2290017.Search in Google Scholar

21. Auletta, JJ, Kou, J, Chen, M, Bolon, Y-T, Broglie, L, Bupp, C, et al.. Real-world data showing trends and outcomes by race and ethnicity in allogeneic hematopoietic cell transplantation: a report from the center for international blood and marrow transplant research. Transplantation and Cellular Therapy 2023;29:346–e1. https://doi.org/10.1016/j.jtct.2023.03.007.Search in Google Scholar PubMed PubMed Central

22. Fausser, J-L, Tavenard, A, Rialland, F, Le Moine, P, Minckes, O, Jourdain, A, et al.. Should we pay attention to the delay before admission to a pediatric intensive care unit for children with cancer? Impact on 1-month mortality. a report from the french children’s oncology study group, goce. J Pediatr Hematol Oncol 2017;39:e244–8. https://doi.org/10.1097/mph.0000000000000816.Search in Google Scholar PubMed

23. Lee, D-S, Suh, GY, Ryu, J-A, Chung, CR, Yang, JH, Park, C-M, et al.. Effect of early intervention on long-term outcomes of critically ill cancer patients admitted to icus. Crit Care Med 2015;43:1439–48. https://doi.org/10.1097/ccm.0000000000000989.Search in Google Scholar PubMed

24. Zhou, Y, Smith, J, Keerthi, D, Li, C, Sun, Y, Mothi, SS, et al.. Longitudinal clinical data improves survival prediction after hematopoietic cell transplantation using machine learning. Blood Adv 2023 bloodadvances–2023011752. https://doi.org/10.1182/bloodadvances.2023011752.Search in Google Scholar PubMed PubMed Central

25. Gratwohl, A, Stern, M, Brand, R, Apperley, J, Baldomero, H, de Witte, T, et al.. Risk score for outcome after allogeneic hematopoietic stem cell transplantation: a retrospective analysis. Cancer 2009;115:4715–26. https://doi.org/10.1002/cncr.24531.Search in Google Scholar PubMed

26. Sorror, ML, Maris, MB, Storb, R, Baron, F, Sandmaier, BM, Maloney, DG, et al.. Hematopoietic cell transplantation (Hct)-specific comorbidity index: a new tool for risk assessment before allogeneic hct. Blood 2005;106:2912–19. https://doi.org/10.1182/blood-2005-05-2004.Search in Google Scholar PubMed PubMed Central

27. Armand, P, Kim, HT, Logan, BR, Wang, Z, Alyea, EP, Kalaycio, ME, et al.. Validation and refinement of the disease risk index for allogeneic stem cell transplantation. Blood, J Am Soc Hematol 2014;123:3664–71. https://doi.org/10.1182/blood-2014-01-552984.Search in Google Scholar PubMed PubMed Central

28. Broglie, L, Ruiz, J, Jin, Z, Kahn, JM, Bhatia, M, George, D, et al.. Limitations of applying the hematopoietic cell transplantation comorbidity index in pediatric patients receiving allogeneic hematopoietic cell transplantation. Transplant Cellular Ther 2021;27:74–e1. https://doi.org/10.1016/j.bbmt.2020.10.003.Search in Google Scholar PubMed

29. Nakaya, A, Mori, T, Tanaka, M, Tomita, N, Nakaseko, C, Yano, S, et al.. Does the hematopoietic cell transplantation specific comorbidity index (Hct-ci) predict transplantation outcomes? A prospective multicenter validation study of the kanto study group for cell therapy. Biol Blood Marrow Transplant 2014;20:1553–9. https://doi.org/10.1016/j.bbmt.2014.06.005.Search in Google Scholar PubMed

30. Raimondi, R, Tosetto, A, Oneto, R, Cavazzina, R, Rodeghiero, F, Bacigalupo, A, et al.. Validation of the hematopoietic cell transplantation-specific comorbidity index: a prospective, multicenter gitmo study, blood. J Am Soc Hematol 2012;120:1327–33. https://doi.org/10.1182/blood-2012-03-414573.Search in Google Scholar PubMed

31. Versluis, J, Labopin, M, Niederwieser, D, Socie, G, Schlenk, RF, Milpied, N, et al.. Prediction of non-relapse mortality in recipients of reduced intensity conditioning allogeneic stem cell transplantation with aml in first complete remission. Leukemia 2015;29:51–7. https://doi.org/10.1038/leu.2014.164.Search in Google Scholar PubMed

32. Bantis, LE, Brewer, B, Nakas, CT, Reiser, B. Statistical inference for box–cox based receiver operating characteristic curves. Statistics Med 2024;43:6099–122. https://doi.org/10.1002/sim.10252.Search in Google Scholar PubMed PubMed Central

33. Hanley, JA. The robustness of the” binormal” assumptions used in fitting roc curves. Med Decis Mak 1988;8:197–203. https://doi.org/10.1177/0272989x8800800308.Search in Google Scholar PubMed

34. Pepe, MS. The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press; 2003.10.1093/oso/9780198509844.001.0001Search in Google Scholar

35. Reiser, B, Faraggi, D, Guttman, I. Choice of sample size for testing the p (x>y). Commun Stat Theor Methods 1992;21:559–69. https://doi.org/10.1080/03610929208830798.Search in Google Scholar

36. Reiser, B, Guttman, I. Sample size choice for reliability verification in strength-stress models. Can J Stat 1989;17:253–9. https://doi.org/10.2307/3315521.Search in Google Scholar

37. Owen, DB. Tables for computing bivariate normal probabilities. Ann Math Statistics 1956;27:1075–90. https://doi.org/10.1214/aoms/1177728074.Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/em-2025-0012).


Received: 2025-03-18
Accepted: 2025-10-02
Published Online: 2025-11-04

© 2025 Walter de Gruyter GmbH, Berlin/Boston

Articles in the same Issue

  1. Research Articles
  2. Definition, identification, and estimation of the direct and indirect number needed to treat
  3. Sample size determination for external validation of risk models with binary outcomes using the area under the ROC curve
  4. Analysis of the drug resistance level of malaria disease: a fractional-order model
  5. Extending the scope of the capture-recapture experiment: a multilevel approach with random effects to provide reliable estimates at national level
  6. Discrete-time compartmental models with partially observed data: a comparison among frequentist and Bayesian approaches for addressing likelihood intractability
  7. Sensitivity analysis for unmeasured confounding for a joint effect with an application to survey data
  8. Investigating the association between school substance programs and student substance use: accounting for informative cluster size
  9. The quantiles of extreme differences matrix for evaluating discriminant validity
  10. Finite-sample improved confidence intervals based on the estimating equation theory for the modified Poisson and least-squares regressions
  11. Causal mediation analysis for difference-in-difference design and panel data
  12. What if dependent causes of death were independent?
  13. Bot invasion: protecting the integrity of online surveys against spamming
  14. A study of a stochastic model and extinction phenomenon of meningitis epidemic
  15. Understanding the impact of media and latency in information response on the disease propagation: a mathematical model and analysis
  16. Time-varying reproductive number estimation for practical application in structured populations
  17. Perspective
  18. Should we still use pointwise confidence intervals for the Kaplan–Meier estimator?
  19. Leveraging data from multiple sources in epidemiologic research: transportability, dynamic borrowing, external controls, and beyond
  20. Regression calibration for time-to-event outcomes: mitigating bias due to measurement error in real-world endpoints
Downloaded on 29.1.2026 from https://www.degruyterbrill.com/document/doi/10.1515/em-2025-0012/html
Scroll to top button