DsubCox: a fast subsampling algorithm for Cox model with distributed and massive survival data

Haixiang Zhang; Yang Li; HaiYing Wang

doi:10.1515/ijb-2024-0042

Article

DsubCox: a fast subsampling algorithm for Cox model with distributed and massive survival data

Haixiang Zhang , Yang Li and HaiYing Wang

Published/Copyright: February 4, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal The International Journal of Biostatistics Volume 21 Issue 1

Abstract

To ensure privacy protection and alleviate computational burden, we propose a fast subsmaling procedure for the Cox model with massive survival datasets from multi-centered, decentralized sources. The proposed estimator is computed based on optimal subsampling probabilities that we derived and enables transmission of subsample-based summary level statistics between different storage sites with only one round of communication. For inference, the asymptotic properties of the proposed estimator were rigorously established. An extensive simulation study demonstrated that the proposed approach is effective. The methodology was applied to analyze a large dataset from the U.S. airlines.

Keywords: distributed learning; L-optimality criterion; massive survival data; optimal subsampling

Corresponding author: Haixiang Zhang, Center for Applied Mathematics and KL-AAGDM, Tianjin University, Tianjin 300072, China, E-mail: haixiang.zhang@tju.edu.cn

Acknowledgments

The authors would like to thank the Editor, the Associate Editor and two reviewers for their constructive and insightful comments that greatly improved the manuscript.

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Use of Large Language Models, AI and Machine Learning Tools: None declared.
Conflict of interest: The authors state no conflict of interest.
Research funding: None declared.
Data availability: Not applicable.

Appendix

In this section, we give the proof details of Theorems 1 and 2. For presentation clarity, we introduce the following notation:

S k ( d ) ( t , β ) = 1 n k ∑ i = 1 n k I ( Y i k ≥ t ) X i k ⊗ d ⁡ exp β ′ X i k , d = 0,1 o r 2 ,

where a ^⊗d denotes a power of vector a with a ^⊗0 = 1, a ^⊗1 = a and a ^⊗2 = aa′. Let X ̄ k ( t , β ) = S k ( 1 ) ( t , β ) / S k ( 0 ) ( t , β ) . Throughout this paper, ∥ A ∥ = ∑ 1 ≤ i , j ≤ p a i j 2 1 / 2 for a matrix A = (a _ij).

The derivation of theoretical properties of the distributed subsample estimator β ̆ DSE relies on the following assumptions.

Assumption 1.

The baseline hazard satisfies that ∫ 0 τ λ 0 ( t ) d t < ∞ , and P(T _ik ≥ τ) > 0.

Assumption 2.

For k = 1, …, K, the quantity 1 n k ∑ i = 1 n k ∫ 0 τ S k ( 2 ) ( t , β ) S k ( 0 ) ( t , β ) − S k ( 1 ) ( t , β ) S k ( 0 ) ( t , β ) ⊗ 2 d N i k ( t ) converges in probability to a positive definite matrix for all β ∈ Θ, where Θ is a compact set containing the true value of β .

Assumption 3.

The covariates X _ik’s are bounded.

Assumption 4.

For k = 1, …, K, there exists two positive definite matrices, Λ ₁ and Λ ₂ such that

Λ 1 ≤ 1 n k ∑ i = 1 n k ∫ 0 τ S k ( 2 ) ( t , β 0 ) S k ( 0 ) ( t , β 0 ) − S k ( 1 ) ( t , β 0 ) S k ( 0 ) ( t , β 0 ) ⊗ 2 d N i k ( t ) ≤ Λ 2 .

i.e., for any v ∈ R p , v ′ Λ 1 v ≤ 1 n k ∑ i = 1 n k ∫ 0 τ v ′ S k ( 2 ) ( t , β 0 ) S k ( 0 ) ( t , β 0 ) − S k ( 1 ) ( t , β 0 ) S k ( 0 ) ( t , β 0 ) ⊗ 2 v d N i k ( t ) ≤ v ′ Λ 2 v .

Assumptions 1–3 are three commonly imposed conditions on the Cox’s model [22], [23], [24]. Assumption 4 is required to establish the asymptotic normality of distributed subsample estimator [25].

For the purpose of enhancing clarity in presenting our subsampling procedure, we provide the asymptotic properties of subsample-based estimators β ̆ k ’s, which can be derived analogously to Proposition 1 in [5]; along with Proposition 2 from [26].

Lemma 1.

Under the Assumptions 1–3, r _k = o(n _k), as n _k → ∞ and r _k → ∞, then the kth subsample-based estimator β ̆ k is a consistent estimator of β ₀ with a convergence rate O P r k − 1 / 2 , where k = 1, …, K. In addition, we have

(A.1) Ω k − 1 / 2 ( β ̆ k − β 0 ) → d N ( 0 , I ) ,

where → d denotes convergence in distribution, Ω k = Ψ k − 1 Γ k Ψ k − 1 ,

(A.2) Ψ k = 1 n k ∑ i = 1 n k ∫ 0 τ S k ( 2 ) ( t , β 0 ) S k ( 0 ) ( t , β 0 ) − S k ( 1 ) ( t , β 0 ) S k ( 0 ) ( t , β 0 ) ⊗ 2 d N i k ( t )

with N _ik(t) = I(Δ_ik = 1, Y _ik ≤ t), and

(A.3) Γ k = 1 n k 2 ∑ i = 1 n k 1 π i k ∫ 0 τ X i k − X ̄ k ( t , β 0 ) d M i k ( t , β 0 ) ⊗ 2 − 1 n k 2 ∑ i = 1 n k ∫ 0 τ X i k − X ̄ k ( t , β 0 ) d M i k ( t , β 0 ) ⊗ 2

with M i k ( t , β ) = N i k ( t ) − ∫ 0 t I ( Y i k ≥ u ) exp β ′ X i k λ 0 ( u ) d u .

Proof of Theorem 1.

Under the Assumptions 1–4 and r _k = o(n _k), the asymptotic normality presented in (A.1) indicates that the variable β ̆ k asymptotically follows a independent normal vector with mean β ₀ and covariance matrix Ω _k, where k = 1, …, K. Based on Lemma 1, we have the following expression:

(A.4) Ψ k ( β ̆ k − β 0 ) = Z k + Ψ k R k ,

where Z _k represents a normal random vector with mean zero and covariance matrix Γ _k, and R k = O P r k − 1 . By taking summation over k on both side of Equation (A.4), we get

(A.5) ∑ k = 1 K Ψ k − 1 ∑ k = 1 K Ψ k β ̆ k − β 0 = ∑ k = 1 K Ψ k − 1 ∑ k = 1 K Z k + ∑ k = 1 K Ψ k − 1 ∑ k = 1 K Ψ k R k .

Let λ ₁ > 0 be the smallest eigenvalue of the matrix Λ ₁, and λ ₂ be the largest eigenvalue of the matrix Λ ₂, where Λ ₁ and Λ ₂ are given in the Assumption 4. Then for any vector v ∈ R p , we have v′Ψ _k v ≥ v′Λ ₁ v ≥ λ ₁∥v∥², which is due to the Assumption 4. Hence, v ′ 1 K ∑ k = 1 K Ψ k v ≥ λ 1 ∥ v ∥ . In the remainder of proof, the norm of a p × p positive definite matrix A is defined as ∥ A ∥ = sup v ∈ R p ∥ A v ∥ ∥ v ∥ . Based on the Facts A.1 and A.2 of [25]; we have

1 K ∑ k = 1 K Ψ k − 1 ≤ 1 λ 1 .

This together with ∥Ψ _k∥ ≤ ∥Λ ₂∥ ≤ λ ₂ lead to that

(A.6) ∑ k = 1 K Ψ k − 1 Ψ k ≤ 1 K ∑ k = 1 K Ψ k − 1 1 K Ψ k ≤ λ 2 K λ 1 .

Hence, as r _k → ∞ the last term of (A.5) satisfies

∑ k = 1 K Ψ k − 1 ∑ k = 1 K Ψ k R k ≤ ∑ k = 1 K ∑ k = 1 K Ψ k − 1 Ψ k R k ≤ λ 2 λ 1 K ∑ k = 1 K ∥ R k ∥ = o P ( 1 ) .

Therefore, as r _k → ∞ we have

(A.7) ∑ k = 1 K Ψ k − 1 ∑ k = 1 K Ψ k β ̆ k − β 0 → d N 0 , ∑ k = 1 K Ψ k − 1 ∑ k = 1 K Γ k ∑ k = 1 K Ψ k − 1 .

As Ψ ̆ k and Γ ̆ k are consistent to Ψ _k and Γ _k, respectively, the application of Slutsky’s theorem in conjunction with (A.7) guarantees that

Ω ̆ D S E − 1 / 2 ( β ̆ DSE − β 0 ) → d N ( 0 , I ) ,

where Ω ̆ D S E = ∑ k = 1 K Ψ ̆ k − 1 ∑ k = 1 K Γ ̆ k ( ∑ k = 1 K Ψ ̆ k ) − 1 . This ends the proof.

Proof of Theorem 2.

Subtracting β ₀ from both sides of (3.7), we can derive that

β ̆ DSE − β 0 = ∑ k = 1 K Ψ ̆ k − 1 ∑ k = 1 K Ψ ̆ k ( β ̆ k − β 0 ) .

Therefore,

∥ β ̆ DSE − β 0 ∥ ≤ ∑ k = 1 K ∑ k = 1 K Ψ ̆ k − 1 Ψ ̆ k ( β ̆ k − β 0 ) ≤ ∑ k = 1 K ∥ β ̆ k − β 0 ∥ ,

where the last inequality is due to ∥ ∑ k = 1 K Ψ ̆ k − 1 Ψ ̆ k ∥ ≤ 1 . Let k 0 = arg max 1 ≤ k ≤ K { ∥ β ̆ k − β 0 ∥ } , then we have

∥ β ̆ DSE − β 0 ∥ ≤ K ∥ β ̆ k 0 − β 0 ∥ .

This completes the proof.

References

1. Zuo, L, Zhang, H, Wang, H, Liu, L. Sampling-based estimation for massive survival data with additive hazards model. Stat Med 2021;40:441–50. https://doi.org/10.1002/sim.8783.Search in Google Scholar PubMed PubMed Central

2. Yang, Z, Wang, H, Yan, J. Optimal subsampling for parametric accelerated failure time models with massive survival data. Stat Med 2022;41:5421–31. https://doi.org/10.1002/sim.9576.Search in Google Scholar PubMed

3. Yang, Z, Wang, H, Yan, J. Subsampling approach for least squares fitting of semi-parametric accelerated failure time models to massive survival data. Stat Comput 2024;34. https://doi.org/10.1007/s11222-024-10391-y.Search in Google Scholar

4. Keret, N, Gorfine, M. Analyzing big EHR data-optimal Cox regression subsampling procedure with rare events. J Am Stat Assoc 2023;118:2262–75. https://doi.org/10.1080/01621459.2023.2209349.Search in Google Scholar

5. Zhang, H, Zuo, L, Wang, H, Sun, L. Approximating partial likelihood estimators via optimal subsampling. J Comput Graph Stat 2024;33:276–88. https://doi.org/10.1080/10618600.2023.2216261.Search in Google Scholar

6. Yao, Y, Wang, H. A review on optimal subsampling methods for massive datasets. J Data Sci 2021;19:151–72. https://doi.org/10.6339/21-jds999.Search in Google Scholar

7. Yu, J, Ai, M, Ye, Z. A review on design inspired subsampling for big data. Stat Pap 2023:1–44. https://doi.org/10.1007/s00362-022-01386-w.Search in Google Scholar

8. Wang, H, Zhu, R, Ma, P. Optimal subsampling for large sample logistic regression. J Am Stat Assoc 2018;113:829–44. https://doi.org/10.1080/01621459.2017.1292914.Search in Google Scholar PubMed PubMed Central

9. Wang, H, Yang, M, Stufken, J. Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 2019;114:393–405. https://doi.org/10.1080/01621459.2017.1408468.Search in Google Scholar

10. Wang, T, Zhang, H. Optimal subsampling for multiplicative regression with massive data. Stat Neerl 2022;76:418–49. https://doi.org/10.1111/stan.12266.Search in Google Scholar

11. Han, L, Tan, KM, Yang, T, Zhang, T. Local uncertainty sampling for large-scale multiclass logistic regression. Ann Stat 2020;48:1770–88. https://doi.org/10.1214/19-aos1867.Search in Google Scholar

12. Wang, H, Ma, Y. Optimal subsampling for quantile regression in big data. Biometrika 2021;108:99–112. https://doi.org/10.1093/biomet/asaa043.Search in Google Scholar

13. Zhang, H, Wang, H. Distributed subdata selection for big data via sampling-based approach. Comput Stat Data Anal 2021;153:107072. https://doi.org/10.1016/j.csda.2020.107072.Search in Google Scholar

14. Zuo, L, Zhang, H, Wang, H, Sun, L. Optimal subsample selection for massive logistic regression with distributed data. Comput Stat 2021;36:2535–62. https://doi.org/10.1007/s00180-021-01089-0.Search in Google Scholar

15. Yu, J, Wang, H, Ai, M, Zhang, H. Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J Am Stat Assoc 2022;117:265–76. https://doi.org/10.1080/01621459.2020.1773832.Search in Google Scholar

16. Cox, DR. Regression models and life-tables (with discussions). J Roy Stat Soc B 1972;34:187–220.10.1111/j.2517-6161.1972.tb00899.xSearch in Google Scholar

17. Han, L, Hou, J, Cho, K, Duan, R, Cai, T. Federated adaptive causal estimation (face) of target treatment effects. arXiv:2112.09313v3 2022.Search in Google Scholar

18. Xiong, R, Koenecke, A, Powell, M, Shen, Z, Vogelstein, JT, Athey, S. Federated causal inference in heterogeneous observational data. Stat Med 2023;42:4418–39. https://doi.org/10.1002/sim.9868.Search in Google Scholar PubMed

19. Zhang, HH, Lu, W. Adaptive lasso for cox’s proportional hazards model. Biometrika 2007;94:691–703. https://doi.org/10.1093/biomet/asm037.Search in Google Scholar

20. Fan, J, Li, R. Variable selection for cox’s proportional hazards model and frailty model. Ann Stat 2002;30:74–99. https://doi.org/10.1214/aos/1015362185.Search in Google Scholar

21. Ren, JJ, Zhou, M. Full likelihood inferences in the Cox model: an empirical likelihood approach. Ann Inst Stat Math 2011;63:1005–18. https://doi.org/10.1007/s10463-010-0272-y.Search in Google Scholar

22. Andersen, PK, Gill, RD. Cox’s regression model for counting processes: a large sample study. Ann Stat 1982;10:1100–20. https://doi.org/10.1214/aos/1176345976.Search in Google Scholar

23. Huang, J, Sun, T, Ying, Z, Yu, Y, Zhang, CH. Oracle inequalities for the lasso in the cox model. Ann Stat 2013;41:1142–65. https://doi.org/10.1214/13-aos1098.Search in Google Scholar PubMed PubMed Central

24. Fang, EX, Ning, Y, Liu, H. Testing and confidence intervals for high dimensional proportional hazards models. J Roy Stat Soc B 2017;79:1415–37. https://doi.org/10.1111/rssb.12224.Search in Google Scholar PubMed PubMed Central

25. Lin, N, Xi, R. Aggregated estimating equation estimation. Stat Interface 2011;4:73–83. https://doi.org/10.4310/sii.2011.v4.n1.a8.Search in Google Scholar

26. Wang, J, Zou, J, Wang, H. Sampling with replacement vs poisson sampling: a comparative study in optimal subsampling. IEEE Trans Inf Theor 2022;68:6605–30. https://doi.org/10.1109/tit.2022.3176955.Search in Google Scholar

Received: 2023-12-22

Accepted: 2024-09-30

Published Online: 2025-02-04

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/ijb-2024-0042

Keywords for this article

distributed learning; L-optimality criterion; massive survival data; optimal subsampling