Abstract
To ensure privacy protection and alleviate computational burden, we propose a fast subsmaling procedure for the Cox model with massive survival datasets from multi-centered, decentralized sources. The proposed estimator is computed based on optimal subsampling probabilities that we derived and enables transmission of subsample-based summary level statistics between different storage sites with only one round of communication. For inference, the asymptotic properties of the proposed estimator were rigorously established. An extensive simulation study demonstrated that the proposed approach is effective. The methodology was applied to analyze a large dataset from the U.S. airlines.
Acknowledgments
The authors would like to thank the Editor, the Associate Editor and two reviewers for their constructive and insightful comments that greatly improved the manuscript.
-
Research ethics: Not applicable.
-
Informed consent: Not applicable.
-
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Use of Large Language Models, AI and Machine Learning Tools: None declared.
-
Conflict of interest: The authors state no conflict of interest.
-
Research funding: None declared.
-
Data availability: Not applicable.
In this section, we give the proof details of Theorems 1 and 2. For presentation clarity, we introduce the following notation:
where a
⊗d
denotes a power of vector a with a
⊗0 = 1, a
⊗1 = a and a
⊗2 = aa′. Let
The derivation of theoretical properties of the distributed subsample estimator
Assumption 1.
The baseline hazard satisfies that
Assumption 2.
For k = 1, …, K, the quantity
Assumption 3.
The covariates X ik ’s are bounded.
Assumption 4.
For k = 1, …, K, there exists two positive definite matrices, Λ 1 and Λ 2 such that
i.e., for any
Assumptions 1–3 are three commonly imposed conditions on the Cox’s model [22], [23], [24]. Assumption 4 is required to establish the asymptotic normality of distributed subsample estimator [25].
For the purpose of enhancing clarity in presenting our subsampling procedure, we provide the asymptotic properties of subsample-based estimators
Lemma 1.
Under the Assumptions 1–3, r
k
= o(n
k
), as n
k
→ ∞ and r
k
→ ∞, then the kth subsample-based estimator
where
with N ik (t) = I(Δ ik = 1, Y ik ≤ t), and
with
Proof of Theorem 1.
Under the Assumptions 1–4 and r
k
= o(n
k
), the asymptotic normality presented in (A.1) indicates that the variable
where Z
k
represents a normal random vector with mean zero and covariance matrix Γ
k
, and
Let λ
1 > 0 be the smallest eigenvalue of the matrix Λ
1, and λ
2 be the largest eigenvalue of the matrix Λ
2, where Λ
1 and Λ
2 are given in the Assumption 4. Then for any vector
This together with ∥Ψ k ∥ ≤ ∥Λ 2∥ ≤ λ 2 lead to that
Hence, as r k → ∞ the last term of (A.5) satisfies
Therefore, as r k → ∞ we have
As
where
Proof of Theorem 2.
Subtracting β 0 from both sides of (3.7), we can derive that
Therefore,
where the last inequality is due to
This completes the proof.
References
1. Zuo, L, Zhang, H, Wang, H, Liu, L. Sampling-based estimation for massive survival data with additive hazards model. Stat Med 2021;40:441–50. https://doi.org/10.1002/sim.8783.Search in Google Scholar PubMed PubMed Central
2. Yang, Z, Wang, H, Yan, J. Optimal subsampling for parametric accelerated failure time models with massive survival data. Stat Med 2022;41:5421–31. https://doi.org/10.1002/sim.9576.Search in Google Scholar PubMed
3. Yang, Z, Wang, H, Yan, J. Subsampling approach for least squares fitting of semi-parametric accelerated failure time models to massive survival data. Stat Comput 2024;34. https://doi.org/10.1007/s11222-024-10391-y.Search in Google Scholar
4. Keret, N, Gorfine, M. Analyzing big EHR data-optimal Cox regression subsampling procedure with rare events. J Am Stat Assoc 2023;118:2262–75. https://doi.org/10.1080/01621459.2023.2209349.Search in Google Scholar
5. Zhang, H, Zuo, L, Wang, H, Sun, L. Approximating partial likelihood estimators via optimal subsampling. J Comput Graph Stat 2024;33:276–88. https://doi.org/10.1080/10618600.2023.2216261.Search in Google Scholar
6. Yao, Y, Wang, H. A review on optimal subsampling methods for massive datasets. J Data Sci 2021;19:151–72. https://doi.org/10.6339/21-jds999.Search in Google Scholar
7. Yu, J, Ai, M, Ye, Z. A review on design inspired subsampling for big data. Stat Pap 2023:1–44. https://doi.org/10.1007/s00362-022-01386-w.Search in Google Scholar
8. Wang, H, Zhu, R, Ma, P. Optimal subsampling for large sample logistic regression. J Am Stat Assoc 2018;113:829–44. https://doi.org/10.1080/01621459.2017.1292914.Search in Google Scholar PubMed PubMed Central
9. Wang, H, Yang, M, Stufken, J. Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 2019;114:393–405. https://doi.org/10.1080/01621459.2017.1408468.Search in Google Scholar
10. Wang, T, Zhang, H. Optimal subsampling for multiplicative regression with massive data. Stat Neerl 2022;76:418–49. https://doi.org/10.1111/stan.12266.Search in Google Scholar
11. Han, L, Tan, KM, Yang, T, Zhang, T. Local uncertainty sampling for large-scale multiclass logistic regression. Ann Stat 2020;48:1770–88. https://doi.org/10.1214/19-aos1867.Search in Google Scholar
12. Wang, H, Ma, Y. Optimal subsampling for quantile regression in big data. Biometrika 2021;108:99–112. https://doi.org/10.1093/biomet/asaa043.Search in Google Scholar
13. Zhang, H, Wang, H. Distributed subdata selection for big data via sampling-based approach. Comput Stat Data Anal 2021;153:107072. https://doi.org/10.1016/j.csda.2020.107072.Search in Google Scholar
14. Zuo, L, Zhang, H, Wang, H, Sun, L. Optimal subsample selection for massive logistic regression with distributed data. Comput Stat 2021;36:2535–62. https://doi.org/10.1007/s00180-021-01089-0.Search in Google Scholar
15. Yu, J, Wang, H, Ai, M, Zhang, H. Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J Am Stat Assoc 2022;117:265–76. https://doi.org/10.1080/01621459.2020.1773832.Search in Google Scholar
16. Cox, DR. Regression models and life-tables (with discussions). J Roy Stat Soc B 1972;34:187–220.10.1111/j.2517-6161.1972.tb00899.xSearch in Google Scholar
17. Han, L, Hou, J, Cho, K, Duan, R, Cai, T. Federated adaptive causal estimation (face) of target treatment effects. arXiv:2112.09313v3 2022.Search in Google Scholar
18. Xiong, R, Koenecke, A, Powell, M, Shen, Z, Vogelstein, JT, Athey, S. Federated causal inference in heterogeneous observational data. Stat Med 2023;42:4418–39. https://doi.org/10.1002/sim.9868.Search in Google Scholar PubMed
19. Zhang, HH, Lu, W. Adaptive lasso for cox’s proportional hazards model. Biometrika 2007;94:691–703. https://doi.org/10.1093/biomet/asm037.Search in Google Scholar
20. Fan, J, Li, R. Variable selection for cox’s proportional hazards model and frailty model. Ann Stat 2002;30:74–99. https://doi.org/10.1214/aos/1015362185.Search in Google Scholar
21. Ren, JJ, Zhou, M. Full likelihood inferences in the Cox model: an empirical likelihood approach. Ann Inst Stat Math 2011;63:1005–18. https://doi.org/10.1007/s10463-010-0272-y.Search in Google Scholar
22. Andersen, PK, Gill, RD. Cox’s regression model for counting processes: a large sample study. Ann Stat 1982;10:1100–20. https://doi.org/10.1214/aos/1176345976.Search in Google Scholar
23. Huang, J, Sun, T, Ying, Z, Yu, Y, Zhang, CH. Oracle inequalities for the lasso in the cox model. Ann Stat 2013;41:1142–65. https://doi.org/10.1214/13-aos1098.Search in Google Scholar PubMed PubMed Central
24. Fang, EX, Ning, Y, Liu, H. Testing and confidence intervals for high dimensional proportional hazards models. J Roy Stat Soc B 2017;79:1415–37. https://doi.org/10.1111/rssb.12224.Search in Google Scholar PubMed PubMed Central
25. Lin, N, Xi, R. Aggregated estimating equation estimation. Stat Interface 2011;4:73–83. https://doi.org/10.4310/sii.2011.v4.n1.a8.Search in Google Scholar
26. Wang, J, Zou, J, Wang, H. Sampling with replacement vs poisson sampling: a comparative study in optimal subsampling. IEEE Trans Inf Theor 2022;68:6605–30. https://doi.org/10.1109/tit.2022.3176955.Search in Google Scholar
© 2025 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Research Articles
- Prognostic adjustment with efficient estimators to unbiasedly leverage historical data in randomized trials
- Homogeneity test and sample size of response rates for AC 1 in a stratified evaluation design
- A review of survival stacking: a method to cast survival regression analysis as a classification problem
- DsubCox: a fast subsampling algorithm for Cox model with distributed and massive survival data
- A hybrid hazard-based model using two-piece distributions
- Regression analysis of clustered current status data with informative cluster size under a transformed survival model
- Bayesian covariance regression in functional data analysis with applications to functional brain imaging
- Risk estimation and boundary detection in Bayesian disease mapping
- An improved estimator of the logarithmic odds ratio for small sample sizes using a Bayesian approach
- Short Communication
- A multivariate Bayesian learning approach for improved detection of doping in athletes using urinary steroid profiles
- Research Articles
- Guidance on individualized treatment rule estimation in high dimensions
- Weighted Euclidean balancing for a matrix exposure in estimating causal effect
- Penalized regression splines in Mixture Density Networks
Articles in the same Issue
- Frontmatter
- Research Articles
- Prognostic adjustment with efficient estimators to unbiasedly leverage historical data in randomized trials
- Homogeneity test and sample size of response rates for AC 1 in a stratified evaluation design
- A review of survival stacking: a method to cast survival regression analysis as a classification problem
- DsubCox: a fast subsampling algorithm for Cox model with distributed and massive survival data
- A hybrid hazard-based model using two-piece distributions
- Regression analysis of clustered current status data with informative cluster size under a transformed survival model
- Bayesian covariance regression in functional data analysis with applications to functional brain imaging
- Risk estimation and boundary detection in Bayesian disease mapping
- An improved estimator of the logarithmic odds ratio for small sample sizes using a Bayesian approach
- Short Communication
- A multivariate Bayesian learning approach for improved detection of doping in athletes using urinary steroid profiles
- Research Articles
- Guidance on individualized treatment rule estimation in high dimensions
- Weighted Euclidean balancing for a matrix exposure in estimating causal effect
- Penalized regression splines in Mixture Density Networks