Home Mathematics Group Tests for High-dimensional Failure Time Data with the Additive Hazards Models
Article Publicly Available

Group Tests for High-dimensional Failure Time Data with the Additive Hazards Models

  • Dandan Jiang and Jianguo Sun EMAIL logo
Published/Copyright: May 9, 2017

Abstract

Statistical analysis of high-dimensional data has been attracting more and more attention due to the abundance of such data in various fields such as genetic studies or genomics and the existence of many interesting topics. Among them, one is the identification of a gene or genes that have significant effects on the occurrence of or are significantly related to a certain disease. In this paper, we will discuss such a problem that can be formulated as a group test or testing a group of variables or coefficients when one faces right-censored failure time response variable. For the problem, we develop a corrected variance reduced partial profiling (CVRPP) linear regression model and a likelihood ratio test procedure when the failure time of interest follows the additive hazards model. The numerical study suggests that the proposed method works well in practical situations and gives better performance than the existing one. An illustrative example is provided.

1 Introduction

Statistical analysis of high-dimensional data has been attracting more and more attention due to the abundance of such data in various fields such as genetic studies or genomics and the existence of many interesting topics. Among them, one is the identification of a gene or genes that have significant effects on the occurrence of or are significantly related to a certain disease for the purpose of predicting survival rates among others [1, 2, 3, 4]. In this paper, we will discuss such a problem that can be formulated as a group test or testing a group of predictor variables or coefficients such as genes or genomic factors when one faces right-censored failure time response variable. By high-dimension, we usually mean that the number of predictor variables denoted by p is much larger than the sample size n, and it is well-known that for these situations, traditional statistical methods cannot be applied. In other words, some new procedures that allow pn are required.

Let T denote a failure time variable of interest and Z a p-dimensional vector of covariates or predictor variables. For regression analysis of failure time data, one commonly used model is the additive hazards model (AHM) given in the form of the hazard function of T as

(1)λ(t|Z)=λ0(t)+j=1pβ0jZj(t)

given Z [5, 6]. In the above, λ0(t) is an unknown baseline hazard function, β0=(β01,,β0p) a p-dimensional vector of unknown regression coefficients or parameters, and Z=(Z1,,Zp). Note that in contrast to other models, model (1) describes the additive covariate effects, the type of effects that are often of more interest in many areas such as social sciences [6]. Of course, one may want to consider other models such as linear transformation models if other types of effects are of interest and more comments on this can be found below. Also many authors have discussed the model above for both the traditional situation with a fixed p and the high-dimensional situation concerning various topics of interest. However, it seems that there exists little literature on testing the hypothesis

(2)H0:β0,d=0v.s.H1:β0,d0,

or the effects of a set of the predictor variables for the high-dimensional situation. In the above, d={d1,,dq} is a subset of {1,,p} and β0,d=(β0d1,,β0dq) denotes a q-dimensional sub-vector of β0. In genomics among others, it is often the case that a group of genes rather than a single gene may be significantly related to or responsible for a certain disease and thus it is of interest to perform a group test like the hypothesis H0 above.

As mentioned above, many authors have discussed the analysis of high-dimensional data or more specifically the analysis of high-dimensional data under the failure time context or concerning the test of hypotheses similar to H0. For example, [7, 8, 9, 10] investigated the parameter estimation problem related to model (1), and [11] and [12] considered a hypothesis test problem similar to that above but with q=p under the context of linear regression models. In addition, [13, 14], [15] and [16] studied similar testing problems under the context of failure time analysis and generalized linear models. Note that all methods above concern the test on the whole set of the predictor variables or coefficients simultaneously and cannot apply to the test of H0 or on a subset of the coefficients as they cannot provide a valid p-value. More recently Zhong et al. [4] discussed the test problem regarding a single coefficient (q=1) and developed a variance reduced partial profiling (VRPP) linear regression model for the derivation of their test statistic. Furthermore they briefly considered the testing of the hypothesis H0 and generalized the proposed test statistic for the q>1 case. However, as discussed below, the generalized test procedure may not work properly even for q=2 when the predictor variables or covariates within the set d are highly correlated.

In the following, to test the hypothesis H0, we will develop a corrected variance reduced partial profiling (CVRPP) linear regression model and present a new test statistic. Of course, the proposed test procedure applies to the q=1 situation too. To present the proposed test statistic, we will first define some notation and review the VRPP linear regression model and the test statistic given in Zhong et al. [4] in Section 2. Section 3 discusses the proposed CVRPP model and the resulting likelihood ratio test along with its implementation.

In Section 4, we will present some results obtained from a simulation study conducted to evaluate the performance of the test procedure and they suggest that it works well in practical situations. An illustration is given in Section 5 and Section 6 provides some discussion and concluding remarks.

2 Notation and review

Consider a failure time study and let T and Z be defined as above. Suppose that T follows the model (1) and the main objective is to test the hypothesis H0 defined in eq. (2) for given d. Also suppose that there exists a right censoring time denoted by C and the observed data have the form {Xi,Δi,{Zij(t)}j=1p;i=1,,n} given by n independent subjects, where Xi=min(Ti,Ci) and Δi=I(TiCi) with Ti, Ci and Zi defined as T, C and Z but with respect to subject i. As mentioned above, the focus will be on the situation where p is greater than n and also we will assume that the censoring time C is independent of T given the covariates Z(t)=Zij(t)n×p, which will be assumed to be centralized such that E(Zij(t))=0.

Furthermore let Σ(t)=σjkj,k=1p denote the p×p covariance matrix of Z and define the counting process Ni(t)=I(Xit,Δi=1), the risk indicator Yi(t)=I(Xit), and the martingale

(3)Mi(t)=Ni(t)0tYi(u){λ0(u)+j=1pβ0jZij(t)}du,
i=1,...,n

. Also define Y(t)=(Y1(t),,Yn(t)), Y=Y(t)(1Y(t))1, which is a n×1 vector, and dA=(dA1(t),,dAn(t)) for a vector of A=(A1(t),,An(t)). Then several authors have shown that we have the equation

(4)(InY1)dN=Z˜(t)β0dt+(InY1)dM,

(see Martinussen and Scheike [9, 10]), where Z˜(t)=(Z˜1(t),,Z˜p(t))=(InY1)diag{Y(t)}Z(t), a n×p matrix. Note that the equation above has two important features. One is that the first term on the right hand side is linear in regression parameter β0 and the other is that the second term on the right hand side involves the martingale difference and thus can be treated as a random error. In other words, one can treat eq. (4) as a linear regression model when making inference.

For a subset A{1,,p}, let Z˜A(t) denote the n×|A| submatrix of Z˜(t) including the columns corresponding to A with |A| denoting the dimension of A, and for the time being, we assume that d={1}. That is, we are only interested in a single regression parameter β0,1 for notation convenience. To test H0 in this case, Zhong et al. [4] suggested to estimate β0,1 by using the eq. (4) first and then to apply the resulting Wald test statistic. For this, note that if Z˜1(t) is perpendicular to all other Z˜j(t) for j>1, then one can easily estimate β0,1 by multiplying Z˜1(t) on both sides of eq. (4). On the other hand, it is apparent that this is not likely true in practice. To address this, Zhong et al. [4] proposed to obtain a relaxed orthogonalization by dividing all other Z˜j(t) or {2,...,p} into two parts S1 and S1c based on the measure Qkj=E{0τZ˜ik(t)Z˜ij(t)dt} for any k,j{1,...,p}, where S1={l{2,,p}:|Q1l|>η} for a pre-specified constant η and S1c denotes the complementary set of {1,S1}.

Given S1, for estimation of β0,1, Zhong et al. [4] suggested to partially profile out the effect of Z˜S1 by multiplying InPZ˜S1 on both sides of eq. (4), which gives

(InPZ˜S1)(InY1)dN=(InPZ˜S1)Z˜{1}(t)β0,1dt
(5)+(InPZ˜S1)Z˜S1c(t)β0,S1cdt+(InPZ˜S1)(InY1)dM,

where PZ˜S1=Z˜S1(Z˜S1Z˜S1)1Z˜S1, the projection matrix corresponding to Z˜S1. Furthermore they proposed to replace β0,S1c by some reasonable initial estimator βˆ0,S1c in the second term on the right hand side of the equation above and to move the term to the left side, for which they call the resulting equation as the VRPP linear regression model. By applying the linear regression model idea described above to the VRPP model and after taking the integration, Zhong et al. [4] suggested the following variance reduced partial profiling estimator (VRPPE) of β0,1

β^0,1=(0τZ˜1(t)(InPZ˜S1)Z˜1(t)dt)1
(6)×(0tZ˜1(t)(InPZ˜S1)dN0τZ˜1(t)(InPZ˜S1)Z˜S1c(t)dtβ^0,S1c).

In addition, they suggested to estimate its asymptotic variance by σˆβ12/n with

σ^β12=(n10τZ˜1(t)(InPZ˜S1)Z˜1(t)dt)2n10tZ˜1(t)(InPZ˜S1)Z˜1(t)dN.

It follows that one can test H0 using the Wald statistic based on βˆ0,1, which was shown to follow asymptotically a normal distribution.

For general d with q<<min(p,n), by replacing S1 with Sd=j=1qSdj, Zhong et al. [4] suggested to construct a Wald test statistic similarly as above, which was shown to asymptotically follow the χ2 distribution, where Sdj is defined exactly as S1 except for the jth component of the set d. However, a couple of serious issues can occur with this test procedure. One is that it is easy to see that in general, the sets d and Sd may have or share some overlapping elements and it is apparent that the number of the overlapping elements will increase with the rising dimensionality p or the highly correlated relationship. This would mean that the parameter estimation throw away a lot of relevant information and yield some serious errors. This can be even more serious if the Z˜j’s corresponding to the set d are highly correlated and it can be seen below that this can happen even with q=2. Another issue is that the convergence to the χ2 distribution can be quite slow and thus cannot be relied on for practical situations. In the next section, we will address these issues and propose a new test procedure.

3 A likelihood ratio test procedure

Now we will present a new test procedure based on the likelihood ratio principle. For this, we will first present a different partition procedure for {d,Sd,Sdc} given a set d and develop a CVRPP linear regression model. The proposed new test statistic for H0 will then be derived.

To describe the new partition procedure, let QR denote the correlation matrix given by the covariance matrix Q=(Qkj)k,j=1p. Then define the number matrix Qord such that its jth column is the permutation of (1,,p) determined by the correlations between Z˜j and all the Z˜m’s, the jth column of QR, in the decreasing order. It is apparent that the first row of Qord will be (1,,p). Now for a given subset d{1,,p} and an appropriately chosen integer l{2,,p} to be discussed below, define the new subset or partition Sdl to include all different numbers in the submatrix Qord[2:l,d] minus the elements in d. For the notation convenience, we will continue to use Sd to denote Sdl and as before define Sdc as the complementary set of {d,Sd}.

For the development of the CVRPP linear regression model based on the new partition procedure, for each i=1,...,n or at time t=X1,...,Xn, define

Ui=(InPZ˜Sd(Xi))(InY(Xi)1)ΔN(Xi)(InPZ˜Sd(Xi))Z˜Sdc(Xi)β^0,SdcΔXi,
Vi=(InPZ˜Sd(Xi))Z˜d(Xi)ΔXi,

and

ϵi=(InPZ˜Sd(Xi))(InY(Xi)1)ΔM(Xi).

Then as the VRPP linear regression model (2.10) in Zhong et al. [4], we have

(7)Ui=Viβ0,d+ϵi,for i=1,,n.

Note that ϵi has mean zero and serves as a random error similar to that in a classical linear regression model. To further deduct the variance or simplify the model above, let βˆ0,d denote the estimator of β0,d as that given by eq. (6) with the new partition, and define ϵˆi=UiViβˆ0,d. It is apparent that we can rewrite eq. (7) as

(8)Ui=Viβ0,d+ϵˆi+ϵi,for i=1,,n,

where ϵi=ϵiϵˆi.

Note that one can naturally estimate the covariance matrix of ϵi by Σˆϵi=ViΣˆβ0,dVi, where Σˆβ0,d=A1BA1 with A=0τZ˜d(t)(IPZ˜Sd)Z˜d(t)dt and B=0τZ˜d(t)(IPZ˜Sd)Z˜d(t)dN. Define Di=diag(diag(Σˆϵi)), a diagonal matrix given by the variances of the components of ϵi. By multiplyiing Di12 to both sides of eq. (8), we obtain that

(9)Di12Ui=Di12Viβ0,d+Di12ϵiˆ+Di12ϵi.

Note that each component of Di12ϵi has mean 0 and variance 1. This suggests that one can simplify the equation above by replacing Di12ϵi with the random error ϵ˜i satisfying E(ϵ˜i)=0n and E(ϵ˜iϵ˜i)=In. In other words, instead of model (9), we can consider the CVRPP linear regression model

(10)U˜i=Di12Viβ0,d+Di12ϵiˆ+ϵ˜i,for i=1,,n,

for testing the hypothesis H0 in eq. (2), where U˜i=Di12Ui and the ϵ˜i’s are generated from the multivariate standard normal distribution N(0n,In) for simplicity.

Under model (10) and for each i, a natural test statistic for H0 is apparently given by the likelihood ratio statistic Λn(i)=|ΣΩ(i)|/|Σω(i)|, where

ΣΩ(i)=1n(U˜iDi12ϵiˆDi12Viβˆ0,d(i))(U˜iDi12ϵiˆDi12Viβˆ0,d(i))

and

Σω(i)=1n(U˜iDi12ϵiˆ)(U˜iDi12ϵiˆ),

the estimators of the covariance matrix of ϵ˜i under the null and alternative hypotheses, respectively, and βˆ0,d(i)=(ViDi1Vi)(ViDi1/2(U˜iDi12ϵiˆ)), the estimator of β0,d based on the CVRPP model. The critical region based on Λn(i) has {Λn(i)<λ0(i)}, where {λ0(i)} is a suitably chosen number. In reality, for testing H0, it would be natural to combine all Λn(i)’s together for a couple of reasons. One is that it is not easy to find λ0(i) and for it, of course, one could apply some resampling methods. However, this clearly would be time-consuming in computation. Another more important reason is that it is apparent that one has to combine all individual testing results together, which may not be straightforward or could be difficult. On the other hand, if we assume X1=min(X1,,Xn) without loss of generality, one can find out through studying the Ui’s that one only needs to focus on the eq. (10) with i=1 or the statistic Λn(1). Furthermore, it follows from Theorem 8.4.5 in Anderson [17] that one can test the hypothesis H0 by using the likelihood-based statistic

(11)TF=(nq)(1Λn(1))qΛn(1),

whose distribution can be approximated by the F-distribution with the freedom degrees of q and nq. Consequently, the critical region {Λn(1)<λ0(1)} is equivalent to {TF>Fα(q,nq)}, where Fα(q,nq) is the α-upper quantile of the F-distribution F(q,nq).

Note that to implement the likelihood ratio test procedure above, one needs to choose l in the determination of Qord. For this, by following Zhong et al. [4], we suggest first to select a subset L{2,,p} and then to search the optimal l over the range of L. More specifically, the subset L should be chosen such that both Sd and Sdc exist and the dimension of Sd is less than 30% of the whole set. For given L, one can choose the smallest l within L such that Dn(Sd,l)5(logp/n), where

Dn(Sd,l)=maxjSd,lc0τZ˜d(t)({I}PZ˜Sd,l)Z˜d(t)dt0τZ˜d(t)({I}PZ˜Sd,l)Z˜j(t).

If no such l exists, one can use l that minimizes Dn(Sd,l) within L.

Two other issues related to the implementation of the proposed test procedure is the determination of βˆ0,S1c in eq. (6) and the generation of the ϵ˜i’s in the CVRPP model (10). For the former, Zhong et al. [4] pointed out that any initial estimator satisfying |βˆ0,S1cβ0,S1c|=op{(logp/n)} can be used and suggested to use a LASSO type estimator for its fast computation. For the latter, it is apparent that the test result may depend on the values of the generated ϵ˜i’s. To address this, for a given data set, we suggest to repeat the process many times as discussed in the example below. For the simulation study in the next section, only one sample will be used.

4 A simulation study

In this section, we present some results obtained from a simulation study conducted to assess the performance of the likelihood ratio test procedure proposed in the previous sections. In the study, by following Zhong et al. [4], we generated the true failure times Ti’s based on model (1) with β0=(2,1,0.5,0,0,0,0,0, 0,0,1,2,0.8,1.2,1,0p15) and assumed that the Zi’s follow the multivariate normal distribution with mean 0 and covariance matrix Σ=(ρ|ij|) for (1i,jp) subject to the constraint Ziβ0>1, where 0p15 is a vector of p15 zeros. Furthermore the censoring times Ci’s were generated from the uniform distribution (0,c0), where c0 was chosen to give the required percentage of right-censored failure times. In addition to the proposed test statistic TF, for comparison, we also considered the χ2 test statistic given in Zhong et al. [4], which will be denoted as TZ below, and obtained the corresponding test results for each situation considered. The results given below are based on 1,000 replications.

Tables 1 and 2 give the estimated size and empirical power of the test procedures based on TZ and TF for testing the hypothesis H0:β0,d=0 with the dimension of d being q=1,2,5,10 or 20, p=100,200 or 600, and n=100. Here we took ρ=0.6 and L={5,,10}, and for the power estimation, we set β0,d=1q, the q-dimensional vector with all components equal to 1. Table 1 corresponds to the situation with 25% of right-censored failure times (c0=2), while Table 2 considered the situation with 35% of right-censored failure times (c0=1). One can see from the tables that with q=1, both test procedures seem to give the expected nominal size 5% but the proposed new likelihood ratio test procedure was clearly more powerful than that based on TZ. For the cases with q>1, it is apparent that the test procedure based on TZ cannot be applied, while the new procedure based on TF still gave the expected nominal size 5%. Furthermore the new procedure seems still to have good power but the power seems to depend on both p and n as expected. We also investigated other set-ups with different values of ρ, different set L and other percentages of right censoring as well as different values of p and n and obtained similar results.

Table 1

Empirical sizes and powers of the test procedure given in Zhong et al. [4] and the proposed one with 25% right censoring percentage.

SizesPowers
d(p,n)TZTFTZTF
(100, 100)0.0510.0400.1761
{5}(200, 100)0.0560.0410.1411
(600, 100)0.0460.0530.1081
(100, 100)0.1280.0430.1241
{5, 6}(200, 100)0.1170.0470.1391
(600, 100)0.1330.0490.1331
(100, 100)0.3210.0410.2431
{6 : 10}(200, 100)0.3090.0530.2071
(600, 100)0.2610.0550.1801
(100, 100)0.4030.0470.3421
{16 : 25}(200, 100)0.4230.0530.3241
(600, 100)0.3750.0490.2331
(100, 100)0.4640.0500.3660.997
{16 : 35}(200, 100)0.4580.0590.3290.974
(600, 100)0.3600.0480.2020.868
Table 2

Empirical sizes and powers of the test procedure given in Zhong et al. [4] and the proposed one with 35% right censoring percentage.

SizesPowers
d(p, n)TZTFTZTF
(100, 100)0.0580.0520.1421
{5}(200, 100)0.0510.0560.1161
(600, 100)0.0440.0560.0811
(100, 100)0.1550.0510.1261
{5, 6}(200, 100)0.1190.0550.1111
(600, 100)0.1160.0570.1311
(100, 100)0.3030.0620.2931
{6 : 10}(200, 100)0.2860.0580.2721
(600, 100)0.2950.0460.2181
(100, 100)0.4370.0520.3561
{16 : 25}(200, 100)0.4390.0390.3711
(600, 100)0.4230.0500.2891
(100, 100)0.5140.0490.4430.998
{16 : 35}(200, 100)0.4860.0550.4330.981
(600, 100)0.4260.0610.2520.907
Figure 1 Q-Q plots for the proposed test statistic TF$T_F$ and the test statistic TZ$T_Z$ given in Zhong et al. [4].
Figure 1

Q-Q plots for the proposed test statistic TF and the test statistic TZ given in Zhong et al. [4].

To further assess the performance of the two test statistics TF and TZ, we also obtained the quantile plots of the two test statistics against the F-distribution and the χ2-distribution, respectively, with proper degrees of freedom. Figure 1 presents some representatives of such plots based on the simulated data with p=600, n=100 and 25% right-censored data. In the figure, the plots on the left side correspond to TF and the F-distribution, while the plots on the right side are for TZ and the χ2-distribution. The top two plots are for the case of d={5}, one dimension, the two plots at the middle for the case of d={5,6}, two dimension, and the two plots at the bottom for the case of 10-dimensional d={16:25}. They suggest that with the one-dimensional d, both F-distribution and χ2-distribution provide reasonable approximations to the distributions of TF and TZ, respectively. When the dimension of d is greater than one, the F-distribution approximation is still appropriate, but the χ2-distribution approximation is clearly not valid anymore.

Table 3

Empirical sizes and powers of the test procedure given in Zhong et al. [4] and the proposed one with the misspecified model.

SizesPowers
D(p, n)TZTFTZTF
(100, 100)0.2640.0510.5041
{5}(200, 100)0.2560.0520.5231
(600, 100)0.2420.0490.5231
(100, 100)0.4920.0470.5891
{5, 6}(200, 100)0.4870.0650.5531
(600, 100)0.5030.0530.5861
(100, 100)0.6720.0530.5831
{6 : 10}(200, 100)0.6680.0450.5331
(600, 100)0.6230.0540.5011
(100, 100)0.6650.0450.4891
{16 : 25}(200, 100)0.6160.0400.4331
(600, 100)0.5270.0550.3540.998
(100, 100)0.5240.0440.4600.949
{16 : 35}(200, 100)0.4870.0510.3840.792
(600, 100)0.3770.0640.2250.386

In practice, one possible question of interest is the sensitivity of the test procedure proposed in the previous sections with respect to model (1). To investigate this, we performed some simulation studies and Table 3 presents the results obtained exactly in the same way as those given in Table 1 except that the failure times were generated from the following Cox’s proportional hazards model

λ(t|Z)=λ0(t)expj=1pβ0jZj(t)

instead of model (1), where all the function or parameters are defined as in model (1). One can see from the table that the results gave similar conclusions to those in Table 1 and in particular, they suggest that the proposed test procedure is still valid and does not seem to be sensitive to model (1). In addition, as with Figure 1, we also obtained the corresponding quantile plots of the two test statistics against the F-distribution and the χ2-distribution, respectively, with proper degrees of freedom, and again they gave similar conclusions.

5 An application

Now we apply the likelihood ratio test procedure proposed in the previous sections to a set of the data on the kidney cancer discussed in Sultmann et al. [18] and Zhong et al. [4]. The data set consists of 74 patients and in addition to the censored survival times (in months), some genomic and non-genomic information was also observed. Specifically, for each patient, 4,224 microarray gene expression was measured. For non-genomic factors, in addition to gender and age, the patients were classified into three groups based on their renal cell carcinoma (RCC), clear cell (ccRCC) type, papillary (pRCC) type and chromophobe cell (chRCC) type. One objective of interest is to evaluate the effects of the non-genomic factors on the survival rate given the genomic factors. In the following, by following others, we will focus on the 60 patients whose survival times are either observed or censored.

To perform the analysis, for each patient, we will define X1=1 if the RCC type is ccRCC and X1=0 otherwise, X2=1 if the RCC type is pRCC and X2=0 otherwise, and G=1 if the patient is male and G=0 otherwise. In addition, we will use Ag to denote the age of the patient and Z2, a 4224-dimensional vector, to denote the gene expression. By using the notation above, model (1) becomes

λ(t|Z)=λ0(t)+β1X1+β2X2+β3G+β4Ag+β5Z2,

and the main goal is to test the hypothesis H0 with d={1,2,3,4}. In the above, Z=(X1,X2,G,Ag,Z2) and β5 is a 4224-dimensional vector of unknown parameters.

Table 4

The p-values obtained for kidney cancer data.

L =Procedure in Zhong et al. [4]The proposed procedure
c(5 : 10)0.87320.5024
c(5 : 20)0.87320.5057
c(5 : 30)0.87320.5014
c(5 : 50)0.87320.4990
c(5 : 100)0.87320.4939

Table 4 gives the average p-values obtained by the proposed likelihood ratio test for testing H0 based on 10,000 sets of the generated ϵ˜i’s. To see the possible effect of the selection L on the results, we considered several choices for L including L={5:10}, L={5:20}, L={5:30}, L={5:50}, and L={5:100}. For comparison, we also included the p-values given by the test procedure proposed in Zhong et al. [4]. One can see from the table that both test procedures suggest that conditional on the genomic factors, the non-genomic factors did not seem to have any significant effects on the patient’s survival rate. In addition, the results seem to indicate that the selection of L did not seem to have much effect on the test results.

6 Discussion and conclusion remarks

In the previous sections, we have considered a group test problem in the high-dimensional situations with the response variable of interest being a failure time arising from the additive hazards model. As discussed above, the problem occurs quite often and in many areas, especially in genetic studies or genomics where the determination or identification of a group of genes significantly related to a certain disease is often of interest. Also due to the dimensionality, traditional methods cannot be apply. For the problem, corresponding to the approach given in Zhong et al. [4], a new partition algorithm was presented. Furthermore, a likelihood ratio test procedure was developed and the numerical studies suggested that the proposed approach seems to perform well in practical situations and gives better performance than that given in Zhong et al. [4].

More research is clearly needed for the investigation of some issued related to the problem discussed here and the generalization of the proposed test procedure to other situations. One is the goodness-of-fit test for model (1). In standard failure time data or the data with low dimensions, some procedures have been developed for checking the appropriateness of the additive hazards model (1) and one typical way is to apply some residual-based statistics [19]. For the high-dimensional situation considered here, however, it does not seem to exist an established procedure for checking model (1) or other commonly used regression models for failure time data. It is apparent that one can define residuals similarly but the similar statistics may not have similar properties in the high-dimensional situations.

Note that in the preceding sections, the focus has been on the group test and it is apparent that one may be also interested in estimation of regression parameters. For this, it is clear that one can employ some existing penalized methods or develop some new methods based on the new partition procedure. In the proposed approach, it has been assumed that the failure time of interest follows the AHM and in practice, this may not be true. In other words, it may be useful to develop similar approaches for the situations where the failure time follows other models such as the proportional hazards model or linear transformation model. However, such generalizations may not be straightforward since under these situations, the relationship (4) may not exist. Finally, of course, the development of some theoretical justification for the proposed test procedure would be helpful too.

Funding statement: Jiang’s work was partly supported by the National Natural Science Foundation of China, Project No. 11471140.

Acknowledgements:

The authors wish to thank the Editor and two reviewers for their helpful comments and suggestions, which greatly improved the paper.

References

1. Foster JC, Liu D, Albert PS, Liu A. Identifying subgroups of enhanced predictive accuracy from longitudinal biomarker data using tree-based approaches: applications to monitoring fetal growth. J R Stat Soc Ser A 2016. doi:10.1111/rssa.12182.Search in Google Scholar PubMed PubMed Central

2. Lin HZ, Li Y, Tan M. Estimating a unitary effect summary based on combined survival and quantitative outcomes in clinical trials. Comput Stat Data Anal 2013;66:129–39.10.1016/j.csda.2013.03.028Search in Google Scholar

3. Liu Z, Bensmail H, Tan M. Efficient feature selection and multiclass classification with integrated instance and model based learning. Evol Bioinf 2012;8:1–10.10.4137/EBO.S9407Search in Google Scholar PubMed PubMed Central

4. Zhong PS, Hu T, Li J. Tests for coefficients in high-dimensional additive hazard models. Scand J Stat 2015;42:649–64.10.1111/sjos.12127Search in Google Scholar

5. Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data, 2nd ed. Ne York: John Wiley, 2002.10.1002/9781118032985Search in Google Scholar

6. Lin DY, Ying Z. Semiparametric analysis of the additive risk model. Biometrika 1994;81:61–71.10.1093/biomet/81.1.61Search in Google Scholar

7. Gaiffas S, Guilloux A. High-dimensional additive hazards models and the Lasso. Electron J Stat 2012;6:522–46.10.1214/12-EJS681Search in Google Scholar

8. Lin W, Lv J. High-dimensional sparse additive hazards regression. J Am Stat Assoc 2013;108:247–64.10.1080/01621459.2012.746068Search in Google Scholar

9. Martinussen T, Scheike TH. Covariate selection for the semiparametric additive risk model. Scand J Stat 2009a;36:602–19.10.1111/j.1467-9469.2009.00650.xSearch in Google Scholar

10. Martinussen T, Scheike TH. The additive hazards model with high-dimensional regressors. Lifetime Data Anal 2009b;15:330–42.10.1007/s10985-009-9111-ySearch in Google Scholar PubMed

11. Goeman J, Geer VD, Houwelingen V. Testing against a high-dimensional alternative. J R Stat Soc Ser B 2006;68:477–93.10.1111/j.1467-9868.2006.00551.xSearch in Google Scholar

12. Zhong PS, Chen SX. Tests for high dimensional regression coefficients with factorial designs. J Am Stat Assoc 2011;106:260–74.10.1198/jasa.2011.tm10284Search in Google Scholar

13. Goeman J, Houwelingen V, Finos L. Testing against a high dimensional alternative in the generalized linear model: asymptotic type I error control. Biometrika 2011;98:381–90.10.1093/biomet/asr016Search in Google Scholar

14. Goeman J, Oosting J, Cleton-Jansen AM, Anninga JK, van Houwelingen HC. Testing association of a pathway with survival using gene expression data. Bioinformatics 2005;21:1950–7.10.1093/bioinformatics/bti267Search in Google Scholar PubMed

15. Lan W, Wang H, Tsai CL. Testing covariates in high dimensional regression. Ann Inst Stat Math 2014;66:279–301.10.1007/s10463-013-0414-0Search in Google Scholar

16. Wang S, Cui H. Generalized F-test for high dimensional linear regression coefficients. J Multivariate Anal 2013;117:134–49.10.1016/j.jmva.2013.02.010Search in Google Scholar

17. Anderson TW. An introduction to multivariate statistical analysis, 2nd ed. New Jersey: John Wiley & Sons, 2003.Search in Google Scholar

18. Sultmann H, Heydebreck A, Huber W, Kuner R, Bune A, Vogt M, et al. Gene expression in kidney cancer is associated with cytogenetic abnormalities, metastasis formation, and patient survival. Clin Cancer Res 2005;11:646–55.10.1158/1078-0432.646.11.2Search in Google Scholar

19. Klein JP, Moeschberger ML. Survival analysis: Techniques for censored and truncated data, 2nd ed. New York: Springer, 2003.10.1007/b97377Search in Google Scholar

Published Online: 2017-5-9

© 2017 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 31.12.2025 from https://www.degruyterbrill.com/document/doi/10.1515/ijb-2016-0085/html
Scroll to top button