Abstract
Statistical analysis of high-dimensional data has been attracting more and more attention due to the abundance of such data in various fields such as genetic studies or genomics and the existence of many interesting topics. Among them, one is the identification of a gene or genes that have significant effects on the occurrence of or are significantly related to a certain disease. In this paper, we will discuss such a problem that can be formulated as a group test or testing a group of variables or coefficients when one faces right-censored failure time response variable. For the problem, we develop a corrected variance reduced partial profiling (CVRPP) linear regression model and a likelihood ratio test procedure when the failure time of interest follows the additive hazards model. The numerical study suggests that the proposed method works well in practical situations and gives better performance than the existing one. An illustrative example is provided.
1 Introduction
Statistical analysis of high-dimensional data has been attracting more and more attention due to the abundance of such data in various fields such as genetic studies or genomics and the existence of many interesting topics. Among them, one is the identification of a gene or genes that have significant effects on the occurrence of or are significantly related to a certain disease for the purpose of predicting survival rates among others [1, 2, 3, 4]. In this paper, we will discuss such a problem that can be formulated as a group test or testing a group of predictor variables or coefficients such as genes or genomic factors when one faces right-censored failure time response variable. By high-dimension, we usually mean that the number of predictor variables denoted by
Let
given
or the effects of a set of the predictor variables for the high-dimensional situation. In the above,
As mentioned above, many authors have discussed the analysis of high-dimensional data or more specifically the analysis of high-dimensional data under the failure time context or concerning the test of hypotheses similar to
In the following, to test the hypothesis
In Section 4, we will present some results obtained from a simulation study conducted to evaluate the performance of the test procedure and they suggest that it works well in practical situations. An illustration is given in Section 5 and Section 6 provides some discussion and concluding remarks.
2 Notation and review
Consider a failure time study and let
Furthermore let
. Also define
(see Martinussen and Scheike [9, 10]), where
For a subset
Given
where
In addition, they suggested to estimate its asymptotic variance by
It follows that one can test
For general
3 A likelihood ratio test procedure
Now we will present a new test procedure based on the likelihood ratio principle. For this, we will first present a different partition procedure for
To describe the new partition procedure, let
For the development of the CVRPP linear regression model based on the new partition procedure, for each
and
Then as the VRPP linear regression model (2.10) in Zhong et al. [4], we have
Note that
where
Note that one can naturally estimate the covariance matrix of
Note that each component of
for testing the hypothesis
Under model (10) and for each
and
the estimators of the covariance matrix of
whose distribution can be approximated by the
Note that to implement the likelihood ratio test procedure above, one needs to choose
If no such
Two other issues related to the implementation of the proposed test procedure is the determination of
4 A simulation study
In this section, we present some results obtained from a simulation study conducted to assess the performance of the likelihood ratio test procedure proposed in the previous sections. In the study, by following Zhong et al. [4], we generated the true failure times
Tables 1 and 2 give the estimated size and empirical power of the test procedures based on
Empirical sizes and powers of the test procedure given in Zhong et al. [4] and the proposed one with 25% right censoring percentage.
| Sizes | Powers | ||||
|---|---|---|---|---|---|
| d | (p,n) | TZ | TF | TZ | TF |
| (100, 100) | 0.051 | 0.040 | 0.176 | 1 | |
| {5} | (200, 100) | 0.056 | 0.041 | 0.141 | 1 |
| (600, 100) | 0.046 | 0.053 | 0.108 | 1 | |
| (100, 100) | 0.128 | 0.043 | 0.124 | 1 | |
| {5, 6} | (200, 100) | 0.117 | 0.047 | 0.139 | 1 |
| (600, 100) | 0.133 | 0.049 | 0.133 | 1 | |
| (100, 100) | 0.321 | 0.041 | 0.243 | 1 | |
| {6 : 10} | (200, 100) | 0.309 | 0.053 | 0.207 | 1 |
| (600, 100) | 0.261 | 0.055 | 0.180 | 1 | |
| (100, 100) | 0.403 | 0.047 | 0.342 | 1 | |
| {16 : 25} | (200, 100) | 0.423 | 0.053 | 0.324 | 1 |
| (600, 100) | 0.375 | 0.049 | 0.233 | 1 | |
| (100, 100) | 0.464 | 0.050 | 0.366 | 0.997 | |
| {16 : 35} | (200, 100) | 0.458 | 0.059 | 0.329 | 0.974 |
| (600, 100) | 0.360 | 0.048 | 0.202 | 0.868 | |
Empirical sizes and powers of the test procedure given in Zhong et al. [4] and the proposed one with 35% right censoring percentage.
| Sizes | Powers | ||||
|---|---|---|---|---|---|
| d | (p, n) | TZ | TF | TZ | TF |
| (100, 100) | 0.058 | 0.052 | 0.142 | 1 | |
| {5} | (200, 100) | 0.051 | 0.056 | 0.116 | 1 |
| (600, 100) | 0.044 | 0.056 | 0.081 | 1 | |
| (100, 100) | 0.155 | 0.051 | 0.126 | 1 | |
| {5, 6} | (200, 100) | 0.119 | 0.055 | 0.111 | 1 |
| (600, 100) | 0.116 | 0.057 | 0.131 | 1 | |
| (100, 100) | 0.303 | 0.062 | 0.293 | 1 | |
| {6 : 10} | (200, 100) | 0.286 | 0.058 | 0.272 | 1 |
| (600, 100) | 0.295 | 0.046 | 0.218 | 1 | |
| (100, 100) | 0.437 | 0.052 | 0.356 | 1 | |
| {16 : 25} | (200, 100) | 0.439 | 0.039 | 0.371 | 1 |
| (600, 100) | 0.423 | 0.050 | 0.289 | 1 | |
| (100, 100) | 0.514 | 0.049 | 0.443 | 0.998 | |
| {16 : 35} | (200, 100) | 0.486 | 0.055 | 0.433 | 0.981 |
| (600, 100) | 0.426 | 0.061 | 0.252 | 0.907 | |
![Figure 1 Q-Q plots for the proposed test statistic TF$T_F$ and the test statistic TZ$T_Z$ given in Zhong et al. [4].](/document/doi/10.1515/ijb-2016-0085/asset/graphic/ijb-2016-0085_figure1.jpg)
Q-Q plots for the proposed test statistic
To further assess the performance of the two test statistics
Empirical sizes and powers of the test procedure given in Zhong et al. [4] and the proposed one with the misspecified model.
| Sizes | Powers | ||||
|---|---|---|---|---|---|
| D | (p, n) | TZ | TF | TZ | TF |
| (100, 100) | 0.264 | 0.051 | 0.504 | 1 | |
| {5} | (200, 100) | 0.256 | 0.052 | 0.523 | 1 |
| (600, 100) | 0.242 | 0.049 | 0.523 | 1 | |
| (100, 100) | 0.492 | 0.047 | 0.589 | 1 | |
| {5, 6} | (200, 100) | 0.487 | 0.065 | 0.553 | 1 |
| (600, 100) | 0.503 | 0.053 | 0.586 | 1 | |
| (100, 100) | 0.672 | 0.053 | 0.583 | 1 | |
| {6 : 10} | (200, 100) | 0.668 | 0.045 | 0.533 | 1 |
| (600, 100) | 0.623 | 0.054 | 0.501 | 1 | |
| (100, 100) | 0.665 | 0.045 | 0.489 | 1 | |
| {16 : 25} | (200, 100) | 0.616 | 0.040 | 0.433 | 1 |
| (600, 100) | 0.527 | 0.055 | 0.354 | 0.998 | |
| (100, 100) | 0.524 | 0.044 | 0.460 | 0.949 | |
| {16 : 35} | (200, 100) | 0.487 | 0.051 | 0.384 | 0.792 |
| (600, 100) | 0.377 | 0.064 | 0.225 | 0.386 | |
In practice, one possible question of interest is the sensitivity of the test procedure proposed in the previous sections with respect to model (1). To investigate this, we performed some simulation studies and Table 3 presents the results obtained exactly in the same way as those given in Table 1 except that the failure times were generated from the following Cox’s proportional hazards model
instead of model (1), where all the function or parameters are defined as in model (1). One can see from the table that the results gave similar conclusions to those in Table 1 and in particular, they suggest that the proposed test procedure is still valid and does not seem to be sensitive to model (1). In addition, as with Figure 1, we also obtained the corresponding quantile plots of the two test statistics against the F-distribution and the
5 An application
Now we apply the likelihood ratio test procedure proposed in the previous sections to a set of the data on the kidney cancer discussed in Sultmann et al. [18] and Zhong et al. [4]. The data set consists of 74 patients and in addition to the censored survival times (in months), some genomic and non-genomic information was also observed. Specifically, for each patient, 4,224 microarray gene expression was measured. For non-genomic factors, in addition to gender and age, the patients were classified into three groups based on their renal cell carcinoma (RCC), clear cell (ccRCC) type, papillary (pRCC) type and chromophobe cell (chRCC) type. One objective of interest is to evaluate the effects of the non-genomic factors on the survival rate given the genomic factors. In the following, by following others, we will focus on the 60 patients whose survival times are either observed or censored.
To perform the analysis, for each patient, we will define
and the main goal is to test the hypothesis
The
| L = | Procedure in Zhong et al. [4] | The proposed procedure |
|---|---|---|
| c(5 : 10) | 0.8732 | 0.5024 |
| c(5 : 20) | 0.8732 | 0.5057 |
| c(5 : 30) | 0.8732 | 0.5014 |
| c(5 : 50) | 0.8732 | 0.4990 |
| c(5 : 100) | 0.8732 | 0.4939 |
Table 4 gives the average
6 Discussion and conclusion remarks
In the previous sections, we have considered a group test problem in the high-dimensional situations with the response variable of interest being a failure time arising from the additive hazards model. As discussed above, the problem occurs quite often and in many areas, especially in genetic studies or genomics where the determination or identification of a group of genes significantly related to a certain disease is often of interest. Also due to the dimensionality, traditional methods cannot be apply. For the problem, corresponding to the approach given in Zhong et al. [4], a new partition algorithm was presented. Furthermore, a likelihood ratio test procedure was developed and the numerical studies suggested that the proposed approach seems to perform well in practical situations and gives better performance than that given in Zhong et al. [4].
More research is clearly needed for the investigation of some issued related to the problem discussed here and the generalization of the proposed test procedure to other situations. One is the goodness-of-fit test for model (1). In standard failure time data or the data with low dimensions, some procedures have been developed for checking the appropriateness of the additive hazards model (1) and one typical way is to apply some residual-based statistics [19]. For the high-dimensional situation considered here, however, it does not seem to exist an established procedure for checking model (1) or other commonly used regression models for failure time data. It is apparent that one can define residuals similarly but the similar statistics may not have similar properties in the high-dimensional situations.
Note that in the preceding sections, the focus has been on the group test and it is apparent that one may be also interested in estimation of regression parameters. For this, it is clear that one can employ some existing penalized methods or develop some new methods based on the new partition procedure. In the proposed approach, it has been assumed that the failure time of interest follows the AHM and in practice, this may not be true. In other words, it may be useful to develop similar approaches for the situations where the failure time follows other models such as the proportional hazards model or linear transformation model. However, such generalizations may not be straightforward since under these situations, the relationship (4) may not exist. Finally, of course, the development of some theoretical justification for the proposed test procedure would be helpful too.
Funding statement: Jiang’s work was partly supported by the National Natural Science Foundation of China, Project No. 11471140.
Acknowledgements:
The authors wish to thank the Editor and two reviewers for their helpful comments and suggestions, which greatly improved the paper.
References
1. Foster JC, Liu D, Albert PS, Liu A. Identifying subgroups of enhanced predictive accuracy from longitudinal biomarker data using tree-based approaches: applications to monitoring fetal growth. J R Stat Soc Ser A 2016. doi:10.1111/rssa.12182.Search in Google Scholar PubMed PubMed Central
2. Lin HZ, Li Y, Tan M. Estimating a unitary effect summary based on combined survival and quantitative outcomes in clinical trials. Comput Stat Data Anal 2013;66:129–39.10.1016/j.csda.2013.03.028Search in Google Scholar
3. Liu Z, Bensmail H, Tan M. Efficient feature selection and multiclass classification with integrated instance and model based learning. Evol Bioinf 2012;8:1–10.10.4137/EBO.S9407Search in Google Scholar PubMed PubMed Central
4. Zhong PS, Hu T, Li J. Tests for coefficients in high-dimensional additive hazard models. Scand J Stat 2015;42:649–64.10.1111/sjos.12127Search in Google Scholar
5. Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data, 2nd ed. Ne York: John Wiley, 2002.10.1002/9781118032985Search in Google Scholar
6. Lin DY, Ying Z. Semiparametric analysis of the additive risk model. Biometrika 1994;81:61–71.10.1093/biomet/81.1.61Search in Google Scholar
7. Gaiffas S, Guilloux A. High-dimensional additive hazards models and the Lasso. Electron J Stat 2012;6:522–46.10.1214/12-EJS681Search in Google Scholar
8. Lin W, Lv J. High-dimensional sparse additive hazards regression. J Am Stat Assoc 2013;108:247–64.10.1080/01621459.2012.746068Search in Google Scholar
9. Martinussen T, Scheike TH. Covariate selection for the semiparametric additive risk model. Scand J Stat 2009a;36:602–19.10.1111/j.1467-9469.2009.00650.xSearch in Google Scholar
10. Martinussen T, Scheike TH. The additive hazards model with high-dimensional regressors. Lifetime Data Anal 2009b;15:330–42.10.1007/s10985-009-9111-ySearch in Google Scholar PubMed
11. Goeman J, Geer VD, Houwelingen V. Testing against a high-dimensional alternative. J R Stat Soc Ser B 2006;68:477–93.10.1111/j.1467-9868.2006.00551.xSearch in Google Scholar
12. Zhong PS, Chen SX. Tests for high dimensional regression coefficients with factorial designs. J Am Stat Assoc 2011;106:260–74.10.1198/jasa.2011.tm10284Search in Google Scholar
13. Goeman J, Houwelingen V, Finos L. Testing against a high dimensional alternative in the generalized linear model: asymptotic type I error control. Biometrika 2011;98:381–90.10.1093/biomet/asr016Search in Google Scholar
14. Goeman J, Oosting J, Cleton-Jansen AM, Anninga JK, van Houwelingen HC. Testing association of a pathway with survival using gene expression data. Bioinformatics 2005;21:1950–7.10.1093/bioinformatics/bti267Search in Google Scholar PubMed
15. Lan W, Wang H, Tsai CL. Testing covariates in high dimensional regression. Ann Inst Stat Math 2014;66:279–301.10.1007/s10463-013-0414-0Search in Google Scholar
16. Wang S, Cui H. Generalized F-test for high dimensional linear regression coefficients. J Multivariate Anal 2013;117:134–49.10.1016/j.jmva.2013.02.010Search in Google Scholar
17. Anderson TW. An introduction to multivariate statistical analysis, 2nd ed. New Jersey: John Wiley & Sons, 2003.Search in Google Scholar
18. Sultmann H, Heydebreck A, Huber W, Kuner R, Bune A, Vogt M, et al. Gene expression in kidney cancer is associated with cytogenetic abnormalities, metastasis formation, and patient survival. Clin Cancer Res 2005;11:646–55.10.1158/1078-0432.646.11.2Search in Google Scholar
19. Klein JP, Moeschberger ML. Survival analysis: Techniques for censored and truncated data, 2nd ed. New York: Springer, 2003.10.1007/b97377Search in Google Scholar
© 2017 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Commentary
- Big Data, Small Sample
- Research Articles
- Parameter Estimation of a Two-Colored Urn Model Class
- Combinatorial Mixtures of Multiparameter Distributions: An Application to Bivariate Data
- On the Conditional Power in Survival Time Analysis Considering Cure Fractions
- Comparing Four Methods for Estimating Tree-Based Treatment Regimes
- On Stratified Adjusted Tests by Binomial Trials
- Improvement Screening for Ultra-High Dimensional Data with Censored Survival Outcomes and Varying Coefficients
- Bayesian Variable Selection Methods for Matched Case-Control Studies
- Testing Equality of Treatments under an Incomplete Block Crossover Design with Ordinal Responses
- Empirical Likelihood in Nonignorable Covariate-Missing Data Problems
- A Quantitative Concordance Measure for Comparing and Combining Treatment Selection Markers
- Median Analysis of Repeated Measures Associated with Recurrent Events in Presence of Terminal Event
- A Theorem at the Core of Colliding Bias
- Group Tests for High-dimensional Failure Time Data with the Additive Hazards Models
- Characterizing Highly Benefited Patients in Randomized Clinical Trials
Articles in the same Issue
- Commentary
- Big Data, Small Sample
- Research Articles
- Parameter Estimation of a Two-Colored Urn Model Class
- Combinatorial Mixtures of Multiparameter Distributions: An Application to Bivariate Data
- On the Conditional Power in Survival Time Analysis Considering Cure Fractions
- Comparing Four Methods for Estimating Tree-Based Treatment Regimes
- On Stratified Adjusted Tests by Binomial Trials
- Improvement Screening for Ultra-High Dimensional Data with Censored Survival Outcomes and Varying Coefficients
- Bayesian Variable Selection Methods for Matched Case-Control Studies
- Testing Equality of Treatments under an Incomplete Block Crossover Design with Ordinal Responses
- Empirical Likelihood in Nonignorable Covariate-Missing Data Problems
- A Quantitative Concordance Measure for Comparing and Combining Treatment Selection Markers
- Median Analysis of Repeated Measures Associated with Recurrent Events in Presence of Terminal Event
- A Theorem at the Core of Colliding Bias
- Group Tests for High-dimensional Failure Time Data with the Additive Hazards Models
- Characterizing Highly Benefited Patients in Randomized Clinical Trials