Sample size determination for external validation of risk models with binary outcomes using the area under the ROC curve

Yiwang Zhou; Ying Zheng; Cai Li; Akshay Sharma; Li Tang

doi:10.1515/em-2025-0012

Article

Sample size determination for external validation of risk models with binary outcomes using the area under the ROC curve

Yiwang Zhou , Ying Zheng , Cai Li , Akshay Sharma and Li Tang

Published/Copyright: November 4, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Epidemiologic Methods Volume 14 Issue 1

Abstract

External validation is critical in risk prediction model development to assess the model’s performance on independent datasets before clinical application. Sample size determination (SSD) is a pivotal aspect for external validation, ensuring sufficient precision, adequate power, appropriate ethical considerations, and affordable resource allocation. SSD for external validation of a risk model with binary outcomes can be performed targeting the most commonly used evaluation criterion, the area under the receiver operating characteristic (AUROC) curve, for two study designs: one focusing on margin-of-error and another on hypothesis testing. Sample sizes for external validation are estimated using various AUROC variance estimators. In this article, we conduct extensive simulations to compare the performance of different AUROC-based SSD methods for external validation under the two study designs. These simulations involve models developed using various machine learning (ML) algorithms, different sets of predictors, and data generated under diverse distributions. The simulation results show significantly different sample size estimates based on the chosen SSD methods. Based on empirical evidence from these simulations, we recommend the Hanley and McNeil method or the rules-of-thumb approach for margin-of-error–based designs, given their balance between reliability and feasibility. For hypothesis testing designs, the Hanley and McNeil method demonstrates superior operating characteristics and is recommended for achieving adequate power. We also demonstrate the application of these AUROC-based SSD methods in the external validation of a risk prediction model for forecasting 1-year overall survival (OS) in pediatric patients who underwent allogeneic hematopoietic cell transplantation (alloHCT). Our findings highlight the importance of selecting appropriate SSD methods for external validation and provide practical guidance for implementing SSD in clinical prediction model validation.

Keywords: area under the receiver operating characteristic (AUROC) curve; external validation; machine learning (ML); risk prediction model; sample size determination (SSD); study design

Corresponding authors: Yiwang Zhou and Li Tang, Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN, USA, E-mail: yiwang.zhou@stjude.org (Y. Zhou), li.tang@stjude.org (L. Tang) (Y. Zhou) (L. Tang)

Funding source: National Institutes of Health/National Cancer Institute

Award Identifier / Grant number: P30 CA021765

Funding source: American Lebanese Syrian Associated Charities

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission. Y. Zhou and L. Tang conceptualized the study, developed simulations, interpreted results, and wrote the manuscript; Y. Zhou and Y. Zheng implemented the computational algorithms, conducted simulations, and performed real data analysis; C. Li and A. Sharma provided inputs to the study and reviewed the analysis; and all authors approved the final version of the manuscript.
Use of Large Language Models, AI and Machine Learning Tools: During the preparation of this work the authors used [OpenAI’s ChatGPT, 2024] grammatical revisions. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Conflict of interest: The authors state no conflict of interest.
Research funding: The authors are supported for their work at St Jude Children’s Research Hospital by the American Lebanese Syrian Associated Charities and National Institutes of Health/National Cancer Institute grant (P30 CA021765).
Data availability: Not applicable.

Appendix A: Algorithms of SSD based on AUROC

Algorithm 1: SSD for Design 1 – Based on the length of the CI of AUROC (margin-of-error design)

1: Input: θ, h, p, α

2: if n cannot be directly calculated based on requirement 2 Z 1 − α / 2 Var ( θ ) = h then

3: for l=1, …, 5,000 do

4: Calculate Var(θ) based on inputs θ, l, p

5: Calculate Δ l = Var ( θ ) − ( h 2 Z 1 − α / 2 ) 2

6: end for

7: Find the index l ^* where Δ l * ≤ 0 and Δ l * − 1 > 0 .

8: Set n=l ^*.

9: else

10: Directly calculate n based on requirement 2 Z 1 − α / 2 Var ( θ ) = h

11: end if

Algorithm 2: SSD for Design 2 – Based on the hypothesis test of AUROC (historical control design)

1: Input: θ ₀, θ ₁, p, α, β

2: if n cannot be directly calculated based on requirement β = Φ { Z 1 − α / 2 Var 0 ( θ ̂ ) + θ 0 − θ 1 Var 1 ( θ ̂ ) } then

3: for l=1, …, 5,000 do

4: Calculate Var 0 ( θ ̂ ) based on inputs θ ₀, l, p

5: Calculate Var 1 ( θ ̂ ) based on inputs θ ₁, l, p

6: Calculate Δ l = Z 1 − α / 2 Var 0 ( θ ̂ ) − Z β Var 1 ( θ ̂ ) + θ 0 − θ 1

7: end for

8: Find the index l ^* where Δ l * ≤ 0 and Δ l * − 1 > 0 .

9: Set n=l ^*.

10: else

11: Directly calculate n based on requirement β = Φ { Z 1 − α / 2 Var 0 ( θ ̂ ) + θ 0 − θ 1 Var 1 ( θ ̂ ) }

12: end if

Appendix B: Simulation settings

We generated a population with a size of n=100,000 and considered it large enough as the entire population. Denote i as the index of individuals, where i=1, …, N. Each individual’s disease status was labeled by the variable y, where y=1 indicated a diseased patient and y=0 signified a non-diseased individual. Assumed that for each subject, there was a d-dimensional feature vector denoted as X = ( X 1 , … , X d ) ⊺ , d = 60 . The features followed a multivariate normal distribution, X ∼ MVN(0, Σ), with Σ as the covariance matrix of X and diag(Σ)=1. A randomly selected set of 30 features was correlated with each other, having a correlation of 0.2, while all other features were mutually independent. We simulated the disease status y of each subject based on two underlying models: (1) Logistic Model: The probability of subject i developing the disease condition was given by P ( y i = 1 ) = exp ( η i ) 1 + exp ( η i ) , where η _i = −8 + f( X _i). The disease status y _i of each subject was determined using a Bernoulli distribution with the calculated probability. (2) Weibull Model: The survival time for each subject i was simulated using a Weibull distribution with a shape parameter of 2 and a scale parameter of 0.005exp(−8) exp(f( X _i)). The administrative censoring time was set at 10, and the label for each subject i was determined by whether the event occurred before the censoring time. To introduce nonlinear dependence on the features, we define f ( X ) = log ( 1.8 ) ( X 1 + ⋯ + X 5 ) + log ( 1.5 ) ( X 6 + ⋯ + X 15 ) + log ( 1.1 ) ( X 16 + ⋯ + X 20 ) + X i , 21 ( 1 + X i , 22 ) + X i , 23 ( 1.5 + X i , 24 ) + X i , 25 ( 2 + X i , 26 ) + X i , 27 2 + X i , 28 2 + X i , 29 3 + X i , 30 3 + sin ( X i , 31 + X i , 32 + X i , 33 + X i , 34 + X i , 35 ) + cos ( X i , 36 + X i , 37 + X i , 38 + X i , 39 + X i , 40 ) + 0.001 exp ( X i , 41 + X i , 42 + X i , 43 ) + 0.001 exp ( X i , 44 + X i , 45 + X i , 46 ) + 0.001 exp ( X i , 47 + X i , 48 + X i , 49 + X i , 50 ) + log ∣ X i , 51 + X i , 52 + X i , 53 + X i , 54 ∣ + log X i , 55 2 + X i , 56 2 + X i , 57 2 + X i , 58 2 + max ( 0 , X i , 59 ) + min ( 0 , X i , 60 ) .

In real data analysis, it is common to collect information for only a subset of predictors, with some predictors remaining unknown or unmeasured. In the simulation, we considered different scenarios where the number of predictors collected, denoted as m, varied within the set {20, 30, 50}. For each value of m, we randomly selected 40 % from {X ₁, …, X ₂₀}, 30 % from {X ₂₁, …, X ₄₀}, and the remaining 30 % from {X ₄₁, …, X ₆₀}. This selection process ensured that our model included only a proportion of important predictors. Before conducting SSD, we drew a dataset of 1,000 individuals from the population N. This dataset, referred to as D _prior, was used as a prior dataset to estimate the disease proportion p and to fit a risk model using the selected m predictors. We then calculate the AUROC, θ ₁, which was considered the underlying AUROC value necessary for SSD. D _prior was randomly split into 80 % training and 20 % testing sets 10 times. The risk model was constructed on the training set using different ML algorithms, including Logistic Regression (LR), Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM) with the Radial Basis Function (RBF) kernel, and Neural Network (NN). Five-fold cross-validation was performed for model training for algorithms requiring parameter tuning (i.e., NB, RF, SVM, and NN). The average AUROC value from the testing data was reported as θ ₁. The value θ ₁ serves as the benchmark AUROC achievable in the external validation study. This process mimics prior knowledge, including estimates from published studies, as commonly practiced.

The remaining patients who were not part of D _prior were designated as D _remain. We conducted separate simulation experiments for Design 1 and Design 2. In Design 1, we assumed that, based on the prior study using D _prior, the AUROC that can be achieved from the external validation study would be θ ₁. The objective was to determine the necessary sample size for the external validation data required to estimate the AUROC with a margin of error of 5 % from a 95 % CI (i.e., α=0.05 and a 95 % CI length of L<h=0.1). To assess the sensitivity of our simulation results to the choice of h, we conducted additional simulations under alternative settings (h=0.06 and h=0.14), with results presented in the Supplementary Materials. In Design 2, we considered a scenario where an investigator aimed to test the hypothesis H ₀: θ=θ ₀=θ ₁−0.05 vs. H _a: θ=θ ₁>θ ₀. In the external validation study, the investigator sought to achieve 80 % power (i.e., type II error β=0.2) to detect this difference while maintaining a type I error rate of α=0.05, like in other clinical studies. Similarly, to evaluate the sensitivity of the simulation results to the choice of the effect size (θ ₁−θ ₀), we conducted additional simulations using alternative values (θ ₁−θ ₀=0.03 and θ ₁−θ ₀=0.07). The results can be found in the Supplementary Materials.

For both Design 1 and Design 2, we estimated the required sample sizes based on the variance of AUROC following the steps outlined in Figure 1. Additionally, we compared the performance of the AUROC-based SSD method with the rules-of-thumb approach, which commonly requires a minimum of 100 events (i.e., n ₁=100 and n ₀=(1−p)n ₁/p) for an external validation study. It was worth noting that the rules-of-thumb approach did not consider the specific type of study design and yielded the same required sample size for both Design 1 and Design 2. Once the sample sizes were estimated, we randomly sampled a total of n subjects from D _remain, with n ₁=np patients from the disease group and n ₀=n(1−p) patients from the non-disease group. We then constructed risk models using the sampled data. Similar to the estimation process of θ ₁, the AUROC value from this external validation set was reported as the averaged AUROC obtained from 10 randomly split 20 % testing data. This random sampling process was repeated for 500 replicates on D _remain to generate the empirical distribution of AUROC. The entire process, starting from generating the population with n subjects, was repeated 100 times to derive summary statistics for evaluating SSD using both the rules-of-thumb approach and different AUROC variance estimators.

References

1. Alonzo, TA. Clinical prediction models: a practical approach to development, validation, and updating: by Ewout W. Steyerberg. Am J Epidemiol 2009;170:528. https://doi.org/10.1093/aje/kwp129.Search in Google Scholar

2. Royston, P, Moons, KG, Altman, DG, Vergouwe, Y. Prognosis and prognostic research: developing a prognostic model. BMJ 2009;338. https://doi.org/10.1136/bmj.b604.Search in Google Scholar

3. Steyerberg, EW, Moons, KG, van der Windt, DA, Hayden, JA, Perel, P, Schroter, S, et al.. Prognosis research strategy (progress) 3: prognostic model research. PLoS Med 2013;10:e1001381. https://doi.org/10.1371/journal.pmed.1001381.Search in Google Scholar

4. Harrell, FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York, NY: Springer; 2001, 608.10.1007/978-1-4757-3462-1Search in Google Scholar

5. Riley, RD, D van der Windt, P Croft, KG Moons. Prognosis research in healthcare: concepts, methods, and impact. New York, NY: Oxford University Press; 2019.10.1093/med/9780198796619.001.0001Search in Google Scholar

6. Kalayeh, H, Landgrebe, DA. Predicting the required number of training samples. IEEE Trans Pattern Anal Mach Intell 1983:664–7. https://doi.org/10.1109/tpami.1983.4767459.Search in Google Scholar

7. Raudys, SJ, AK Jain. Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 1991;13:252–64. https://doi.org/10.1109/34.75512.Search in Google Scholar

8. Harrell, FEJr, Lee, KL, Mark, DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics Med 1996;15:361–87.10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4Search in Google Scholar

9. Vergouwe, Y, Steyerberg, EW, Eijkemans, MJ, Habbema, JDF. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol 2005;58:475–83. https://doi.org/10.1016/j.jclinepi.2004.06.017.Search in Google Scholar

10. Collins, GS, Ogundimu, EO, Altman, DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Statistics Med 2016;35:214–26. https://doi.org/10.1002/sim.6787.Search in Google Scholar

11. Ogundimu, EO, Altman, DG, Collins, GS. Adequate sample size for developing prediction models is not simply related to events per variable. J Clin Epidemiol 2016;76:175–82. https://doi.org/10.1016/j.jclinepi.2016.02.031.Search in Google Scholar PubMed PubMed Central

12. Newcombe, RG. Confidence intervals for an effect size measure based on the mann–whitney statistic. part 2: asymptotic methods and evaluation. Statistics Med 2006;25:559–73. https://doi.org/10.1002/sim.2324.Search in Google Scholar PubMed

13. Pavlou, M, Qu, C, Omar, RZ, Seaman, SR, Steyerberg, EW, White, IR, et al.. Estimation of required sample size for external validation of risk models for binary outcomes. Stat Methods Med Res 2021;30:2187–206. https://doi.org/10.1177/09622802211007522.Search in Google Scholar PubMed PubMed Central

14. Birnbaum, ZW, Klose, OM. Bounds for the variance of the mann-whitney statistic. Ann Math Statistics 1957:933–45. https://doi.org/10.1214/aoms/1177706794.Search in Google Scholar

15. Hanley, JA, McNeil, BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 1982;143:29–36. https://doi.org/10.1148/radiology.143.1.7063747.Search in Google Scholar PubMed

16. Rosner, B, Glynn, RJ. Power and sample size estimation for the wilcoxon rank sum test with application to comparisons of c statistics from alternative prediction models. Biometrics 2009;65:188–97. https://doi.org/10.1111/j.1541-0420.2008.01062.x.Search in Google Scholar PubMed

17. Lehmann, EL, D’Abrera, H. Nonparametrics: Statistical methods based on ranks (rev. ed.). Englewood Cliffs, NJ: Prentice-Hall; 1998, 292:23 p.Search in Google Scholar

18. Lehmann, EL. Elements of large-sample theory. New York, NY: Springer; 1999.10.1007/b98855Search in Google Scholar

19. Bamber, D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 1975;12:387–415. https://doi.org/10.1016/0022-2496-75-90001-2.Search in Google Scholar

20. Mee, RW. Confidence intervals for probabilities and tolerance regions based on a generalization of the mann-whitney statistic. J Am Stat Assoc 1990;85:793–800. https://doi.org/10.2307/2290017.Search in Google Scholar

21. Auletta, JJ, Kou, J, Chen, M, Bolon, Y-T, Broglie, L, Bupp, C, et al.. Real-world data showing trends and outcomes by race and ethnicity in allogeneic hematopoietic cell transplantation: a report from the center for international blood and marrow transplant research. Transplantation and Cellular Therapy 2023;29:346–e1. https://doi.org/10.1016/j.jtct.2023.03.007.Search in Google Scholar PubMed PubMed Central

22. Fausser, J-L, Tavenard, A, Rialland, F, Le Moine, P, Minckes, O, Jourdain, A, et al.. Should we pay attention to the delay before admission to a pediatric intensive care unit for children with cancer? Impact on 1-month mortality. a report from the french children’s oncology study group, goce. J Pediatr Hematol Oncol 2017;39:e244–8. https://doi.org/10.1097/mph.0000000000000816.Search in Google Scholar PubMed

23. Lee, D-S, Suh, GY, Ryu, J-A, Chung, CR, Yang, JH, Park, C-M, et al.. Effect of early intervention on long-term outcomes of critically ill cancer patients admitted to icus. Crit Care Med 2015;43:1439–48. https://doi.org/10.1097/ccm.0000000000000989.Search in Google Scholar PubMed

24. Zhou, Y, Smith, J, Keerthi, D, Li, C, Sun, Y, Mothi, SS, et al.. Longitudinal clinical data improves survival prediction after hematopoietic cell transplantation using machine learning. Blood Adv 2023 bloodadvances–2023011752. https://doi.org/10.1182/bloodadvances.2023011752.Search in Google Scholar PubMed PubMed Central

25. Gratwohl, A, Stern, M, Brand, R, Apperley, J, Baldomero, H, de Witte, T, et al.. Risk score for outcome after allogeneic hematopoietic stem cell transplantation: a retrospective analysis. Cancer 2009;115:4715–26. https://doi.org/10.1002/cncr.24531.Search in Google Scholar PubMed

26. Sorror, ML, Maris, MB, Storb, R, Baron, F, Sandmaier, BM, Maloney, DG, et al.. Hematopoietic cell transplantation (Hct)-specific comorbidity index: a new tool for risk assessment before allogeneic hct. Blood 2005;106:2912–19. https://doi.org/10.1182/blood-2005-05-2004.Search in Google Scholar PubMed PubMed Central

27. Armand, P, Kim, HT, Logan, BR, Wang, Z, Alyea, EP, Kalaycio, ME, et al.. Validation and refinement of the disease risk index for allogeneic stem cell transplantation. Blood, J Am Soc Hematol 2014;123:3664–71. https://doi.org/10.1182/blood-2014-01-552984.Search in Google Scholar PubMed PubMed Central

28. Broglie, L, Ruiz, J, Jin, Z, Kahn, JM, Bhatia, M, George, D, et al.. Limitations of applying the hematopoietic cell transplantation comorbidity index in pediatric patients receiving allogeneic hematopoietic cell transplantation. Transplant Cellular Ther 2021;27:74–e1. https://doi.org/10.1016/j.bbmt.2020.10.003.Search in Google Scholar PubMed

29. Nakaya, A, Mori, T, Tanaka, M, Tomita, N, Nakaseko, C, Yano, S, et al.. Does the hematopoietic cell transplantation specific comorbidity index (Hct-ci) predict transplantation outcomes? A prospective multicenter validation study of the kanto study group for cell therapy. Biol Blood Marrow Transplant 2014;20:1553–9. https://doi.org/10.1016/j.bbmt.2014.06.005.Search in Google Scholar PubMed

30. Raimondi, R, Tosetto, A, Oneto, R, Cavazzina, R, Rodeghiero, F, Bacigalupo, A, et al.. Validation of the hematopoietic cell transplantation-specific comorbidity index: a prospective, multicenter gitmo study, blood. J Am Soc Hematol 2012;120:1327–33. https://doi.org/10.1182/blood-2012-03-414573.Search in Google Scholar PubMed

31. Versluis, J, Labopin, M, Niederwieser, D, Socie, G, Schlenk, RF, Milpied, N, et al.. Prediction of non-relapse mortality in recipients of reduced intensity conditioning allogeneic stem cell transplantation with aml in first complete remission. Leukemia 2015;29:51–7. https://doi.org/10.1038/leu.2014.164.Search in Google Scholar PubMed

32. Bantis, LE, Brewer, B, Nakas, CT, Reiser, B. Statistical inference for box–cox based receiver operating characteristic curves. Statistics Med 2024;43:6099–122. https://doi.org/10.1002/sim.10252.Search in Google Scholar PubMed PubMed Central

33. Hanley, JA. The robustness of the” binormal” assumptions used in fitting roc curves. Med Decis Mak 1988;8:197–203. https://doi.org/10.1177/0272989x8800800308.Search in Google Scholar PubMed

34. Pepe, MS. The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press; 2003.10.1093/oso/9780198509844.001.0001Search in Google Scholar

35. Reiser, B, Faraggi, D, Guttman, I. Choice of sample size for testing the p (x>y). Commun Stat Theor Methods 1992;21:559–69. https://doi.org/10.1080/03610929208830798.Search in Google Scholar

36. Reiser, B, Guttman, I. Sample size choice for reliability verification in strength-stress models. Can J Stat 1989;17:253–9. https://doi.org/10.2307/3315521.Search in Google Scholar

37. Owen, DB. Tables for computing bivariate normal probabilities. Ann Math Statistics 1956;27:1075–90. https://doi.org/10.1214/aoms/1177728074.Search in Google Scholar

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/em-2025-0012).

Received: 2025-03-18

Accepted: 2025-10-02

Published Online: 2025-11-04

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/em-2025-0012

Keywords for this article

area under the receiver operating characteristic (AUROC) curve; external validation; machine learning (ML); risk prediction model; sample size determination (SSD); study design