Startseite Two-stage receiver operating-characteristic curve estimator for cohort studies
Artikel
Lizenziert
Nicht lizenziert Erfordert eine Authentifizierung

Two-stage receiver operating-characteristic curve estimator for cohort studies

  • Susana Díaz-Coto , Norberto Octavio Corral-Blanco und Pablo Martínez-Camblor EMAIL logo
Veröffentlicht/Copyright: 21. August 2020
Veröffentlichen auch Sie bei De Gruyter Brill

Abstract

The receiver operating-characteristic (ROC) curve is a graphical statistical tool routinely used for studying the classification accuracy in both, diagnostic and prognosis problems. Given the different nature of these situations, ROC curve estimation has been separately considered for binary (diagnostic) and time-to-event (prognosis) outcomes, even for data coming from the same study design. In this work, the authors propose a two-stage ROC curve estimator which allows to link both contexts through a general prediction model (first-stage) and the empirical cumulative estimator of the distribution function (second-stage) of the considered test (marker) on the total population. The so-called two-stage Mixed-Subject (sMS) approach proves its behavior on both, large-samples (theoretically) and finite-samples (via Monte Carlo simulations). Besides, a useful asymptotic distribution for the concomitant area under the curve is also computed. Results show the ability of the proposed estimator to fit non-standard situations by considering flexible predictive models. Two real-world examples, one with binary and one with time-dependent outcomes, help us to a better understanding of the proposed methodology on usual practical circumstances. The R code used for the practical implementation of the proposed methodology and its documentation is provided as supplementary material.


Corresponding author: Pablo Martínez-Camblor, Biomedical Data Science Department, Geisel school of Medicine at Dartmouth, Hanover, NH, USA, E-mail:

Award Identifier / Grant number: FC-GRUPIN-IDI/2018/000132

Award Identifier / Grant number: MTM2017-89422-P

Acknowledgment

The authors thank the reviewers and the Associated Editor for all their comments, suggestions and their careful reading of the paper, which have helped us to improve the quality of the manuscript.

  1. Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

  2. Research funding: This paper is supported by the grant FC-GRUPIN-IDI/2018/000132 of the Asturias Government. Besides PM-C is supported by the Spanish Ministry of Economy, Industry and Competitiveness; State Research Agency; and FEDER funds - MTM2017-89422-P.

  3. Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

Appendix

Theoretical results

First we highlight the difference in the asymptotic behavior of the ECDF estimator for the global distribution when we consider a case-control or a cohort design. Let H=πF+(1π)G, with π=P(D)(probability of having the studied characteristic, D=1) be the real distribution of the population including both positive and negative subjects. Notice that, if H^N(N=n+m) stands for the pooled empirical cumulative distribution function estimator (ECDF), we have the uniform weak convergence (see van der Vaart [33]),

N[HN^(x)H(x)]LB{H(x)},

where {u}{0u1} is a standard Brownian bridge. However, if F^n, G^m denote the ECDFs for F and G, respectively, even when n/(n+m)π, we have that

N[πFn^(x)+(1π)Gm^(x)H(x)]LπB1{F(x)}+1πB2{G(x)},

where 1{u}{0u1} and 2{u}{0u1} are two independent standard Brownian bridges. Of course, in general {H(x)}π1{F(x)}+1π2{G(x)}.

Now, we formalize here the proofs of the results anticipated in section 3. Next, we are proving one general and useful lemma.

Preliminary result

Lemma 1.

Let(X,D)be the test values and the subjects status, respectively. LetR(),Se()andSP()be the ROC, the sensitivity and the specificity curves associated with the classification problem. LetR^(),S^e()andS^P()be their respective estimators based on the random sample(XN,DN). If

  • A1.-R()has two continuous and bounded derivatives,

  • A2.-supx|S^P(x)SP(x)|N0a.s. (almost surely).

Then, for eachu(0,1), there exists a real random numbers sequence, {xN*}N, satisfying that xN*Nx*(a.s.)(x*=[1SP]1(u)) such that,

(11)[R^(u)R(u)]=[S^e(xN*)Se(xN*)]+r(1S^P(xN*))[S^P(xN*)SP(xN*)]+O([S^P(xN*)SP(xN*)]2),

where r() is the first derivative of R(). Besides, O(xN*x)=O(S^P(xN*)SP(xN*)).

Proof

 From A1, for any sequence {uN*}N(0,1), it is hold the equality

[R^(u)R(u)]=[R^(u)R(uN*)]+r(u)(uN*u)+O([uN*u]2).

Let be xN*=[1S^P]1(u) and uN*=1SP(xN*), then

[R^(u)R(u)]=[S^e(xN*)Se(xN*)]+r(1S^P(xN*))[S^P(xN*)SP(xN*)]+O([S^P(xN*)SP(xN*)]2).

On the other hand, A1 implies that SP() has two bounded derivatives. Therefore, there exists ξN(min{xN*,x*},max{xN*,x*}), such that

sP(ξN)|xN*x*|=|SP(xN*)SP(x*)|=|SP(xN*)S^P(xN*)|,

where sP() is the first derivative of SP(). From A2, xNNx(a.s) and O(xN*x*)=O(S^P(xN*)SP(xN*)).□

Proof of Theorem 1 (Strong uniform consistency)

Let be π=P(D) and π^N=P^N(D|x)dH^N(x). From M1 we have that π^NNπ. Besides, for any x*,

π^N[S^e(x)Se(x)]=I(x,)(x)P^N(D|x)dH^N(x)π^NSe(x)=Δse(x,x)P^N(D|x)d[H^N(x)H(x)]+Δse(x,x)[P^N(D|x)P(D|x)]dH(x),

with Δse(x*,x)=[I(x*,)(x)Se(x*)]. The Glivenko-Cantelli’s Theorem and M1 guarantee the uniform convergence of the first and the second summands, respectively. The triangular inequality and π^NNπ lead to

(12)supx*|S^e(x*)Se(x*)|N0a.s.

Arguing similarly, we have that

(1π^N)[S^P(x*)SP(x*)]=Δsp(x*,x)Q^N(D|x)d[H^N(x)H(x)]+Δsp(x*,x)[Q^N(D|x)Q(D|x)]dH(x),

where Δsp(x*,x)=[I(,x*](x)SP(x*)], Q(D|x)=1P(D|x) and Q^N(D|x)=1P^N(D|x). Then,

(13)supx*|S^P(x*)SP(x*)|N0a.s.

Applying Lemma 1, the triangular inequality, Eq. (12) and Eq. (13) we get Eq. (7).□

Proof of Theorem 2 (Weak uniform convergence)

Arguing as in Theorem 1, from the Hungarian embedding (see, for instance, [33]) and M2, we know that, in a suitable probability space, there exist a sequence of Brownian bridges, N{u}{0u1}, and a standardized Gaussian variable, P, such that

N[S^e(x*)Se(x*)]=1π{Δse(x*,x)P(D|x)dN{H(x)}+Δse(x*,x)σ(x)dH(x)P}+oP(1),

and

N[S^P(x*)SP(x*)]=11π{Δsp(x*,x)Q(D|x)dN{H(x)}Δsp(x*,x)σ(x)dH(x)P}+oP(1).

Then, from Lemma 1 and the Lebesgue’s Theorem we have that, in this suitable probability space,

(14)N[R^(u)R(u)]={π1Δse(x*,x)P(D|x)+(1π)1r(1SP(x*))Δsp(x*,x)Q(D|x)}dN{H(x)}+{π1Δse(x*,x)(1π)1r(1SP(x*))Δsp(x*,x)}σ(x)dH(x)P+oP(1),

where u=1SP(x*). Hence, making v=1SP(x), we have that (in a suitable probability space),

(15)N[R^(u)R(u)]={LPse(u,v)+r(u)LPsp(u,v)}dN{πR(v)+(1π)v}+{Lse(u,v)+r(u)Lsp(u,v)}σ(1SP1(v))d[πR(v)+(1π)v]P+oP(1),

and the proof is concluded.□

Proof of Theorem 3 (AUC’s weak convergence)

Directly, we have that

(16)[A^NA]=[S^P(x*)SP(x*)]dSe(x*)[S^e(x*)Se(x*)]dS^P(x*).

From definition (2) we have that dSe(x*)=π1P(D|x*)dH(x*), hence, by using the Fubbini’s Theorem and arguing as in Theorem 2

(17)N[S^P(x*)SP(x*)]dSe(x*)=1π(1π){Wsp(x)Q(D|x)dN{H(x)}Wsp(x)σ(x)dH(x)P+oP(1)}.

Anagously, dSP(x*)=(1π)1Q(D|x*)dH(x*), and

(18)N[S^e(x*)Se(x*)]dSP(x*)=1π(1π){Wse(x)P(D|x)dN{H(x)}Wse(x)σ(x)dH(x)P+oP(1)}.

The Slutski’s Lemma guarantees that the asymptotic distribution of N[A^A) is the distribution of

1π(1π){[Wse(x)P(D|x)Wsp(x)Q(D|x)]dN{H(x)}+[Wsp(x)Wse(x)]σ(x)dH(x)P}.

From the Brownian bridge properties (see, for instance, [34]) and M3, it is Gaussian with mean zero and variance v2, where

[π(1π)v]2=[(Wse(x)+Wsp(x))P(D|x)Wsp(x)]2dH(x)+{[Wsp(x)Wse(x)]σ(x)dH(x)}2,

and the proof is concluded.□

Additional Monte Carlo results

We report here (4) full results regarding the competitors considered in the time-dependent scenario (Table 2). We provide mean ± sd for [R^(t)R(t)]2dt, where R^() is the ROC curve estimation and R() the real ROC curve and for the respective AUC. We consider the empirical estimator computed from those subjects with complete information (E); the NNE-based estimator proposed in Heagerty et al. [10], computed with the R package survivalROC [35] with two different bandwidth parameters: 0.25N1/5 (H1) and 0.5N1/5 (H2) and the estimator proposed in Li et al. [12], computed with the package nsROC [6] with two different bandwidth parameters: 0.1N1/5 (L1) and 0.5N1/5 (L2) (see Table 4).

Table 4:

Time-dependent models. Supplement. Complementary results to Table 2. Same information regarding five alternative procedures.

NπCPAROC curveAUC
EH1H2L1L2EH1H2L1L2
Model IV
2001/210%0.654.68 ± 4.93.62 ± 3.83.08 ± 3.64.37 ± 3.94.32 ± 3.90.657 ± 0.050.647 ± 0.050.636 ± 0.040.644 ± 0.050.649 ± 0.05
0.852.43 ± 1.91.99 ± 2.12.87 ± 3.12.67 ± 2.32.44 ± 2.10.857 ± 0.030.842 ± 0.030.822 ± 0.030.844 ± 0.030.849 ± 0.03
2001/225%0.655.25 ± 5.34.03 ± 4.33.41 ± 4.14.81 ± 4.64.77 ± 4.60.658 ± 0.050.647 ± 0.050.636 ± 0.050.641 ± 0.050.647 ± 0.05
0.852.99 ± 2.52.41 ± 2.53.21 ± 3.63.85 ± 3.53.16 ± 2.80.861 ± 0.030.841 ± 0.030.822 ± 0.030.836 ± 0.030.845 ± 0.03
3001/310%0.652.92 ± 2.72.43 ± 2.52.24 ± 2.62.93 ± 2.62.87 ± 2.60.654 ± 0.040.645 ± 0.040.636 ± 0.040.648 ± 0.040.650 ± 0.04
0.851.81 ± 1.51.66 ± 1.83.09 ± 2.81.91 ± 1.71.82 ± 1.60.854 ± 0.020.842 ± 0.030.822 ± 0.030.847 ± 0.030.850 ± 0.03
3001/325%0.653.18 ± 2.92.64 ± 2.82.40 ± 2.83.24 ± 2.93.14 ± 2.90.654 ± 0.040.646 ± 0.040.636 ± 0.040.646 ± 0.040.649 ± 0.04
0.852.09 ± 1.81.84 ± 1.93.22 ± 3.02.31 ± 2.12.09 ± 1.90.857 ± 0.030.842 ± 0.030.823 ± 0.030.844 ± 0.030.848 ± 0.03
Model V
2001/210%0.653.07 ± 2.73.05 ± 2.85.03 ± 2.93.11 ± 2.83.03 ± 2.80.658 ± 0.040.662 ± 0.040.658 ± 0.040.653 ± 0.040.653 ± 0.04
0.852.09 ± 1.41.50 ± 1.32.29 ± 1.92.05 ± 1.52.01 ± 1.50.859 ± 0.030.854 ± 0.030.842 ± 0.030.853 ± 0.030.853 ± 0.03
2001/225%0.653.47 ± 3.13.32 ± 3.15.27 ± 3.33.49 ± 3.13.37 ± 3.10.662 ± 0.050.661 ± 0.050.657 ± 0.050.654 ± 0.050.653 ± 0.05
0.852.54 ± 1.81.75 ± 1.62.52 ± 2.32.39 ± 1.92.31 ± 1.80.866 ± 0.030.854 ± 0.030.842 ± 0.030.853 ± 0.030.854 ± 0.03
3001/310%0.652.63 ± 2.62.98 ± 2.95.69 ± 3.02.63 ± 2.72.57 ± 2.60.653 ± 0.040.661 ± 0.040.652 ± 0.040.651 ± 0.040.650 ± 0.04
0.851.80 ± 1.41.44 ± 1.43.59 ± 2.41.78 ± 1.51.77 ± 1.50.856 ± 0.030.855 ± 0.030.840 ± 0.030.851 ± 0.030.851 ± 0.03
3001/325%0.652.94 ± 3.03.18 ± 3.15.86 ± 3.22.90 ± 2.92.79 ± 2.80.657 ± 0.040.662 ± 0.040.653 ± 0.040.652 ± 0.040.651 ± 0.04
0.852.06 ± 1.51.62 ± 1.53.74 ± 2.61.99 ± 1.71.96 ± 1.60.861 ± 0.030.8550.030.840 ± 0.030.852 ± 0.030.852 ± 0.03
Model VI
2001/210%0.653.47 ± 3.22.94 ± 2.92.64 ± 3.03.43 ± 3.13.41 ± 3.00.657 ± 0.040.647 ± 0.040.636 ± 0.040.652 ± 0.040.654 ± 0.04
0.852.33 ± 1.92.13 ± 2.23.02 ± 3.32.40 ± 2.12.37 ± 2.10.859 ± 0.030.844 ± 0.030.824 ± 0.030.852 ± 0.030.854 ± 0.03
2001/225%0.654.30 ± 4.23.38 ± 3.52.96 ± 3.64.04 ± 3.54.01 ± 3.60.661 ± 0.050.648 ± 0.040.637 ± 0.040.650 ± 0.040.653 ± 0.04
0.852.78 ± 2.32.43 ± 2.53.24 ± 3.53.26 ± 2.92.89 ± 2.50.865 ± 0.030.844 ± 0.030.824 ± 0.030.845 ± 0.030.851 ± 0.03
3001/310%0.652.51 ± 2.22.08 ± 2.11.86 ± 2.02.47 ± 2.12.46 ± 2.10.656 ± 0.030.647 ± 0.030.638 ± 0.030.653 ± 0.030.653 ± 0.03
0.851.66 ± 1.41.51 ± 1.62.47 ± 2.41.68 ± 1.41.64 ± 1.40.858 ± 0.020.846 ± 0.020.828 ± 0.020.853 ± 0.020.855 ± 0.02
3001/325%0.653.10 ± 2.92.56 ± 2.82.29 ± 2.73.05 ± 2.83.01 ± 2.90.658 ± 0.040.647 ± 0.040.638 ± 0.040.650 ± 0.040.652 ± 0.04
0.852.08 ± 1.81.78 ± 1.82.65 ± 2.72.16 ± 1.82.01 ± 1.80.863 ± 0.020.847 ± 0.030.829 ± 0.030.850 ± 0.030.854 ± 0.03

References

1. Green, DM, Swets, JA. Signal detection theory and psychophysics. New York, NY: Wiley; 1966.Suche in Google Scholar

2. Zhou, X-H, Obuchowski, NA, McClish, DK. Statistical methods in diagnostic medicine. New York, NY: Wiley Blackwell; 2002.10.1002/9780470317082Suche in Google Scholar

3. Krzanowski, WJ, Hand, DJ. ROC curves for continuous data, volume 111 of Monographs on Statistics and Applied Probability. Boca Raton, FL:CRC Press; 2009.10.1201/9781439800225Suche in Google Scholar

4. Pepe, MS. The statistical evaluation of medical tests for classification and prediction. Oxford: Oxford University Press; 2004.10.1093/oso/9780198509844.001.0001Suche in Google Scholar

5. Gneiting, T, Vogel, P. Receiver Operating Characteristic (ROC) curves; 2018. arXiv e-prints, art. arXiv:1809.04808, September.Suche in Google Scholar

6. Pérez-Fernández, S, Martínez-Camblor, P, Filzmoser, P, nsROC, NC. An R package for non-standard ROC curve analysis. R J 2018;10:55–77. https://doi.org/10.32614/RJ-2018-043.Suche in Google Scholar

7. Fluss, R, Faraggi, D, Reiser, B. Estimation of the Youden index and its associated cutoff point. Biometrical J 2005;47:458–72. https://doi.org/10.1002/bimj.200410135.Suche in Google Scholar

8. Demidenko, E. The p-Value you can’t buy. Am Statist 2016;70:33–8. https://doi.org/10.1080/00031305.2015.1069760.Suche in Google Scholar

9. Gonçalves, L, Oliveira, MR, Subtil, A, de Zea Bermudez, P. ROC curve estimation: An overview. Revstat Stat J 2014;12:1–20.Suche in Google Scholar

10. Heagerty, PJ, Lumley, T, Pepe, MS. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 2000;56:337–44. https://doi.org/10.1111/j.0006-341x.2000.00337.x.Suche in Google Scholar

11. Martínez-Camblor, P, Bayón, GF, Pérez-Fernández, S. Cumulative/dynamic ROC curve estimation. J Stat Comput Simul 2016;86:3582–94. https://doi.org/10.1080/00949655.2016.1175442.Suche in Google Scholar

12. Li, L, Greene, T, Hu, B. A simple method to estimate the time-dependent receiver operating characteristic curve and the area under the curve with right censored data. Stat Methods Med Res 2018;27:2264–78. https://doi.org/10.1177/0962280216680239.Suche in Google Scholar

13. Song, X, Zhou, X-H. A semiparametric approach for the covariate specific ROC curve with survival outcome. Stat Sin 2008;18:947–65.Suche in Google Scholar

14. Liu, D, Cai, T, Zheng, Y. Evaluating the predictive value of biomarkers with stratified case-cohort design. Biometrics 2012;68:1219–27. https://doi.org/10.1111/j.1541-0420.2012.01787.x.Suche in Google Scholar

15. Chambless, LE, Diao, G. Estimation of time-dependent area under the ROC curve for long-term risk prediction. Stat Med 2006;25:3474–86. https://doi.org/10.1002/sim.2299.Suche in Google Scholar

16. Blanche, P, Dartigues, J-F, Jacqmin-Gadda, H. Review and comparison of roc curve estimators for a time-dependent outcome with marker-dependent censoring. Biometrical J 2013;55:687–704. https://doi.org/10.1002/bimj.201200045.Suche in Google Scholar

17. Harrell, FE. Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer Series in Statistics. Cham, Switzerland: Springer International Publishing;2015.10.1007/978-3-319-19425-7Suche in Google Scholar

18. Hsieh, F, Turnbull, W. B. Nonparametric and semiparametric estimation of the receiver operating characteristic curve. Ann Statist 1996;24:25–40. https://doi.org/10.1214/aos/1033066197.Suche in Google Scholar

19. Spanos, A, Harrell, FE, Durack, DT. Differential diagnosis of acute meningitis: an analysis of the predictive value of initial observations. J Am Med Assoc 1989;262:2700–7. https://doi.org/10.1001/jama.1989.03430190084036.Suche in Google Scholar

20. Durrleman, S, Simon, R. Flexible regression models with cubic splines. Stat Med 1989;8:551–61. https://doi.org/10.1002/sim.4780080504.Suche in Google Scholar

21. Fleming, TR, Harrington, DP. Counting processes and survival analysis. Hoboken, New Jersey: John Wiley & Sons; 1991.Suche in Google Scholar

22. Dickson, ER, Grambsch, PM, Fleming, TR, Fisher, LD, Langworthy, A. Prognosis in primary biliary cirrhosis: model for decision making. Hepatology 1989;10:1–7. https://doi.org/10.1002/hep.1840100102.Suche in Google Scholar

23. Heagerty, PJ, Zheng, Y. Survival model predictive accuracy and ROC curves. Biometrics 2005;61:92–105. https://doi.org/10.1111/j.0006-341x.2005.030814.x.Suche in Google Scholar

24. Cox, DR. Regression models and life-tables. J R Stat Soc Ser B 1972;34:187–220. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x.Suche in Google Scholar

25. Hurvich, CM, Simonoff, JS, Tsai, C-L. Smoothing parameter selection in nonparametric regression using an improved akaike information criterion. J R Stat Soc Ser B 1998;60:271–93. https://doi.org/10.1111/1467-9868.00125.Suche in Google Scholar

26. Stone, CJ, Hansen, MH, Kooperberg, C, Truong, YK. Polynomial splines and their tensor products in extended linear modeling. Ann Stat 1997;25:1371–425. https://doi.org/10.1214/aos/1031594728.Suche in Google Scholar

27. Zhou, S, Shen, X, Wolfe, DA. Local asymptotics for regression splines and confidence regions. Ann Stat 1998;26:1760–82. https://doi.org/10.1214/aos/1024691356.Suche in Google Scholar

28. Falk, M, Wisheckel, F. Asymptotic independence of bivariate order statistics. Stat Probabil Lett 2017;125:91–8.10.1016/j.spl.2017.01.020Suche in Google Scholar

29. Goldstein, B, Giroir, B, Randolph, A. International pediatric sepsis consensus conference: definitions for sepsis and organ dysfunction in pediatrics. Pediatr Crit Care Med 2005;6:2–8. https://doi.org/10.1097/01.pcc.0000149131.72248.e6.Suche in Google Scholar

30. Martínez-Camblor, P, Pardo-Fernández, JC. Parametric estimates for the receiver operating characteristic curve generalization for non-monotone relationships. Stat Methods Med Res 2019;28:2032–48. https://doi.org/10.1177/0962280217747009.Suche in Google Scholar

31. Efron, B, Tibshirani, RJ. An introduction to the Bootstrap. Number 57 in monographs on statistics and applied probability. Boca Raton, FL, USA: Chapman & Hall/CRC; 1993.Suche in Google Scholar

32. Martínez-Camblor, P, Pérez-Fernández, S, Díaz-Coto, S. Improving the biomarker diagnostic capacity via functional transformations. J Appl Stat 2019;46:1550–66. https://doi.org/10.1080/02664763.2018.1554628.Suche in Google Scholar

33. van der Vaart, AW. Asymptotic statistics. Cambridge series in statistical and probabilistic mathematics. Cambridge, England: Cambridge University Press; 1998.Suche in Google Scholar

34. Mansuy, R, Yor, M. Aspects of Brownian Motion. Universitext. New York: Springer Berlin Heidelberg; 2008.10.1007/978-3-540-49966-4Suche in Google Scholar

35. Heagerty, PJ, Saha-Chaudhuri, P. survivalROC: time-dependent ROC curve estimation from censored survival data. R package version 1.0.3; 2013.Suche in Google Scholar


Supplementary Material

As supplementary material we provide the file sMSApproach.R which contains two functions for implementing the sMS ROC curve estimator for the binary and time-dependent cases together with the code used for analyzing the real-world examples (data sets are freely available on the net). The file sMSApproach.pdf contains indications about the use of the functions. The files are available at https://doi.org/10.1515/ijb-2019-0097.


Received: 2019-08-27
Accepted: 2020-05-25
Published Online: 2020-08-21

© 2020 Walter de Gruyter GmbH, Berlin/Boston

Heruntergeladen am 16.11.2025 von https://www.degruyterbrill.com/document/doi/10.1515/ijb-2019-0097/html
Button zum nach oben scrollen