Home A Sequential Rejection Testing Method for High-Dimensional Regression with Correlated Variables
Article Publicly Available

A Sequential Rejection Testing Method for High-Dimensional Regression with Correlated Variables

  • Jacopo Mandozzi EMAIL logo and Peter Bühlmann
Published/Copyright: May 26, 2016

Abstract

We propose a general, modular method for significance testing of groups (or clusters) of variables in a high-dimensional linear model. In presence of high correlations among the covariables, due to serious problems of identifiability, it is indispensable to focus on detecting groups of variables rather than singletons. We propose an inference method which allows to build in hierarchical structures. It relies on repeated sample splitting and sequential rejection, and we prove that it asymptotically controls the familywise error rate. It can be implemented on any collection of clusters and leads to improved power in comparison to more standard non-sequential rejection methods. We complement the theoretical analysis with empirical results for simulated and real data.

1 Introduction

Error control of false selection or false positive statements based on p-values is a primary goal of statistical inference and an established, broadly used tool in many areas of science. It relies on standard statistical hypothesis testing and procedures which give provable guarantees in presence of multiple, potentially very large scale multiple testing [13]. While being standard in the classical low-dimensional setup, statistical significance testing in the more challenging high-dimensional setting where the number of variables p might be much larger than the sample size n has only received attention recently.

We consider here a linear regression model

(1)Y=Xβ0+ε,εNn(0,σ2I),

with n×p design matrix X, p×1 regression vector β0 and n×1 response Y. We allow for high-dimensional scenarios where p>>n. We assume that the regression coefficient vector is sparse with many coefficients of β0 being equal to zero, that is, the active set of variables

S0={j;βj00}

is assumed to be a small subset of {1,,p} corresponding to all variables.

A few methods for assigning p-values and constructing confidence intervals for individual parameters βj0(j=1,,p) have been suggested [410], and some of them have been compared against each other in various settings [11, 12]. The inferential statements can easily be adjusted for multiplicity, thanks to the methodology and theory in multiple testing ([1], cf.). However, and important for practical applications, some major issues in presence of highly correlated variables still need further attention: typically, when p>>n, none or only a few of the individual βj0’s turn out to be significant which is a consequence of their near non-identifiability (even when some theoretical conditions on well-posedness on the design matrix X ([13], cf.). However, a group of (correlated) variables is often much better identifiable, but one can then not determine anymore the relevant variables within such a group [1416].

Thus, our main goal is testing of significance of groups of parameters: for a group or cluster C{1,,p} we consider the following null- and alternative hypothesis, respectively:

H0,C:βj0=0foralljC,HA,C:βj00foratleastonejC.

Given a collection c of clusters, we propose a general method for obtaining a collection rc of rejected clusters such that familywise error rate (FWER) is strongly controlled. That is, for a given nominal level α(0,1):

P[rf]1α,

where f={Ccs.t.H0,Cisfalse} i.e., f is the collection of false null hypotheses. Our new method has the following main features:

  1. It can be implemented on any collection of clusters c.

  2. It is modular in the sense that it requires four basic building blocks that have to satisfy certain assumptions.

  3. Its modular conception allows for a better insight of the procedure’s power and improvements thereof.

We are particularly interested to use the procedure for hierarchically ordered clusters of (correlated) variables. Such a hierarchical structure can be obtained from the output of a hierarchical clustering algorithm: since it operates on the design matrix X only and does not involve the response Y, the inference for β0 remains correct (for fixed design or by conditioning on X). With such a hierarchical cluster tree, our inference method (Sections 2.5 and 4.2) first tests the cluster C={1,,p} containing all the variables (the top node in the tree): if the corresponding null-hypothesis is rejected, we test some refined clusters, and we proceed down the cluster tree, in a sequential manner, until a cluster is not significant anymore. Figures 1 and 2, based on the results of the simulations in Section 5, provide some graphical illustrations. This procedure has the remarkable property that the resolution level of the significant clusters is automatically controlled by the sequential testing method: if the signal is strong (e.g. large absolute values of components of β0) and the variables are not too highly correlated, one can detect small clusters or even single variables and vice-versa, if the signal isn’t very strong or the variables are highly correlated, only larger groups can be detected as significant.

1.1 Relation to other work

Our proposed method is based on the multi sample splitting method from Meinshausen et al. [7] and the sequential rejection principle of Goeman and Solari [17]. It is a generalization and power improvement over the multi sample splitting technique for inference of single variables [7] and for hierarchically ordered clusters of variables [15]. The improvement in power is strict, and in analogy to the gain of power of Holm’s procedure [23] over the Bonferroni adjustment. Thus, even if the increased power might be only small for some datasets, one cannot do worse with the new procedure. The only price to pay is a slightly more complicated algorithm: we provide an implementation in the R-package hdi. We note that the mentioned improvement, based on the work by Goeman and Solari [17] and Goeman and Finos [18], is not entirely trivial to derive in the setting of the multiple sample splitting scheme. The modular set-up based on four building blocks presented here is addressing the issue how the improvement can be constructed.

1.2 Outline of the paper

In Section 2 we describe the four basic building blocks of the method and the assumptions that are sufficient to establish in Section 3 its strong FWER control. In Sections 4.1 and 4.2, respectively, we focus on the inference of two specific kinds of cluster collections: singletons and hierarchically ordered clusters. In Section 4.3 we show how logical relationships can be used to improve the power. Finally, we provide in Section 5 a comparison based on empirical results for error control and power, with a focus on minimal true detections, and we apply the new method to a real dataset.

2 A construction based on four building blocks

Our method is based on four basic building blocks that satisfy certain assumptions.

One main ingredient is multi sample splitting. For b=1,,B where B is the number of repeated sample splitting, the original data of sample size n is split into two disjoint groups, Nin(b) and Nout(b), i.e., a partition

{1,,n}=Nin(b)Nout(b)

is randomly chosen. The groups are chosen of equal size if n is even or satisfy |Nout(b)|=|Nin(b)|+1 if n is odd. The idea is to use data from Nin(b) to select a few variables and the other data from Nout(b) to perform the statistical hypothesis testing in the low-dimensional submodel with the selected variables from Nin(b). The details are described next.

2.1 Screening of variables

We consider variable screening where an estimator Sˆ(b){1,,p}, based on data corresponding to Nin(b), is aiming at including all active variables S0. A prime example is the Lasso [19], while a detailed empirical comparison of five popular screening procedures can be found in Bühlmann and Mandozzi [20]. Assume that the screening procedure satisfies the following properties for any sample split b:

(A1)Sparsityproperty:|Sˆ(b)|<n/2.(A2)δScreeningproperty:P[Sˆ(b)S0]1δ,where0<δ<1.

The sparsity property in (A1) implies that for each sample split b it holds that |Sˆ(b)|<|Nout(b)|, a condition which is necessary for applying classical tests as described in Section 2.2 below. The δ-screening property in (A2) ensures that all the relevant variables are retained with high probability (where δ>0 is typically small). We indicate in Section 3.1 that under some assumptions, the Lasso satisfies (A1) and (A2).

2.2 Testing and p-values

The idea is to perform a classical statistical test on the other half sample from Nout(b) in a low-dimensional problem with variables from Sˆ(b) only. For each sample split b, based on the second half of the sample corresponding to Nout(b), consider a testing procedure, e.g. the classical partial F-test (see also Section 3.1), that provides correct p-values pC,(b) for the null hypothesis H0,CSˆ(b) for each screened set Sˆ(b), in the sense that for each nominal level α(0,1)

(A3)Correcttestingproperty:UnderthenullhypothesisH0,CSˆ(b)itholdsP[pC,(b)α]α.

We note that the probability is with respect to the data generating random variables corresponding to the second half Nout(b), and the null-hypothesis is fixed with respect to Nout(b). Due to the screening property (A2), when δ0, the null-hypothesis H0,CSˆ(b) approximates the unconditional hypothesis H0,C which we aim to test for. If CSˆ(b)= define pC,(b)=1. This provides a (correct) p-value pC,(b) for each cluster Cc and each sample split b{1B}.

2.3 Multiplicity adjustment

Consider for each sample split b and each cluster Cc a multiplicity adjustment procedure mC(b):2c[1,] that for each collection r of rejected clusters provides a multiplicity adjustment mC(b)(r)1 and satisfies the following properties:

(A4)Monotonicityproperty:IfrsthenmC(b)(r)mC(b)(s).
(A5)Singlestepproperty:CC\R1{CS^(b)}mC(b)(R)1,

where we define 1/=0. Such a family of multiplicity adjustments for b=1,,B are often naturally induced from a global multiplicity adjustment procedure mC.

2.4 Aggregation of p-values

Consider a collection of screened sets of variables Sˆ(b), a cluster Cc, a collection of p-values pC,(b) for the null-hypothesis H0,CS(b) (which approximates H0,C, see comment after (A3)) and a collection of multiplicity adjustments mC(b)1 (we drop here the dependence on r).

The goal is to aggregate the p-values pC,(1),,pC,(B) to a single p-value which is adjusted for multiplicity. An aggregation procedure is a monotone increasing function aggr:[0,1]B[0,1]. Assume it satisfies the following property:

(A6)Aggregationproperty:IfP[pC,(b)α]α,α[0,1],thenP[aggr(pC,(1)mC(1),,pC,(B)mC(B))α]αBb=1B1{CSˆ(b)}mC(b),α[0,1].

2.5 The procedure

Our procedure is based on the four building blocks above. First, we proceed with screening of the variables based on the first half sample from Nin(b) (Section 2.1), e.g., in Section 5.1 we use the Lasso with regularization parameter chosen by 10-fold cross-validation (see also Section 3.1). Then, we construct the p-values based on the second half sample from Nout(b) by using the partial F-test (Section 2.2 and see also Section 3.1). This leads to a (correct) p-value pC,(b) for each cluster Cc and each sample split b{1B}.

The multiplicity adjustment is done sequentially (Section 2.3). Based on a chosen significance level α(0,1) and for a collection of currently rejected sets r, define the successor of r as

N(r)={Ccrs.t.aggr(pC,(1)mC(1)(r),,pC,(B)mC(B)(r))α}

Start from “no rejections” r0=, define ri+1=rin(ri) and r=limiri (although r is never constructed due to finite-ness of all possible subset of the variables). Concrete choices of mC(1)(r),,mC(B)(r) are discussed in Section 4.

Finally, we aggregate the p-values as indicated in Section 2.4. Concrete aggregation methods are described in Proposition 1 in Section 3.1.

3 Familywise error control

We show here that the method from Section 2.5 (strongly) controls the FWER at each step i=0,1,2,

Theorem 1

Assume that (A1)-(A6) hold. Then for anyiN0

P[rif](1δ)Bα,
wheref={Ccs.t.CS0}is the collection of false null hypotheses.

A proof is given in the Appendix.

3.1 Screening, testing and aggregation: their properties

We discuss here some choices for screening, testing and aggregation which we use in the implementation in the R-package hdi. The issue of sequential multiplicity adjustment is treated separately in Section 4.

For variable screening, we use the Lasso with regularization parameter chosen by 10-fold cross-validation. Theoretical justification of the sparsity and screening property (A1) and (A2) can be derived by assuming a compatibility or restricted eigenvalue condition on the fixed design matrix X and a beta-min assumption requiring that minjS0|βj0||S0|log(p)/n is sufficiently large: we refer to Bühlmann and van de Geer ([13], Ch. 2.7 and Ch. 6) for the details.

For construction of the p-values (in the low-dimensional setting, due to variable screening in the first half of the sample) we use the partial F-test. Then, assuming fixed design X and Gaussian errors, condition (A3) holds.

For aggregation of the p-values, ensuring that (A6) holds, we have the following result for two slightly different methods.

Proposition 1

Denote byqγ(u)the empiricalγ-quantile of the values occurring in the components of a vector u. The monotone increasing functions[0,1]B[0,1]

(p˜(1),,p˜(B))Q(γ)=min{1,qγ(p˜(1)/γ,,p˜(B)/γ)}(p˜(1),,p˜(B))min{1,(1logγmin)infγ(γmin,1)Q(γ)}
satisfy the aggregation property (A6) for anyγ,γmin(0,1).

A proof, which was basically given in Meinshausen et al. [7], can be found in the Appendix.

4 Some concrete methods for multiplicity adjustment

We discuss here the issue of multiplicity adjustment, and justify assumption (A4) and (A5) for different inference procedures.

4.1 Inference of single variables

This first example is paradigmatic for the advantages of the modular approach: a simple improvement of the multiplicity adjustment procedure allows for a better power, basically in the same way as in a low-dimensional setting in Goeman and Solari [17].

Concretely, we consider the problem of inferring single variables, i.e., the collection of clusters c={{i};i=1,,p}. The method proposed in Meinshausen et al. [7] corresponds to the method of Theorem 1 with the aggregation procedures of Proposition 1 and the following Bonferroni-based [21, 22] multiplicity adjustment procedure:

(2)m{i}(b)(r)=|Sˆ(b)|.

As the multiplicity adjustments are independent from the (previously) rejected collection of sets, the monotonicity property (A4) is trivially satisfied, while the single-step property (A5) follows from

Ccr1{CSˆ(b)}mC(b)(r)={i}cr1{{i}Sˆ(b)}|Sˆ(b)|1.

The power of the method can be improved taking instead of (2) the following Bonferroni-Holm-based [23] multiplicity adjustment procedure:

(3)m{i}(b)(r)=|Sˆ(b)(cr)|=|{jSˆ(b)s.t.{j}r}|.

The monotonicity property (A4) is still satisfied since

|Sˆ(b)(cr)||Sˆ(b)(cr)|

for rs, whereas

Ccr1{CSˆ(b)}mC(b)(r)={i}cr1{{i}Sˆ(b)}|Sˆ(b)(cr)|=1

proves the single step property (A5).

4.2 Inference of hierarchically ordered clusters of variables

When dealing with the challenge of inferring hierarchically ordered clusters of variables, e.g. from the tree-structured output of a hierarchical clustering algorithm, one considers a collection of clusters c={Ci}i where for any two clusters Ci,Cic, either one cluster is a subset of the other, or they have an empty intersection. The method proposed in Mandozzi and Bühlmann ([15], Section 2), which is based on the procedure of Meinshausen [24], corresponds to the one as in Theorem 1 with the aggregation methods of Proposition 1 and the following multiplicity adjustment:

(4)mC(b)(r)=,ifanc(C)/r1,ifanc(C)randSˆ(b)C=|Sˆ(b)||Sˆ(b)C|,otherwise.

Here, anc(C) denotes the ancestors in a hierarchically ordered cluster tree. To check the monotonicity property (A4), consider rs. For Cc with anc(C)r it holds anc(C)s and hence mC(b)(r)=mC(b)(s), while for Cc with anc(C)/r one has mC(b)(r)=mC(b)(s). The single step property (A5) follows from

Ccr1{CSˆ(b)}mC(b)(r)=1|Sˆ(b)|Ccrs.t.anc(C)r|Sˆ(b)C|1,

where in the inequality we have used the fact that for two sets in the sum above, one cannot be a subset of the other and hence, by definition of the hierarchy c, they are disjoint.

4.2.1 The inheritance procedure in the high-dimensional setting

In Goeman and Solari ([17], Section 6.3) and Goeman and Finos [18], the authors propose various possibilities on how the sequential rejection principle can be used to improve the power of the hierarchical procedure in Meinshausen [24].

We consider here the most powerful one, the inheritance procedure of Goeman and Finos [18] which we extend to the high-dimensional setting with hierarchical cluster trees. In order to do that, we apply the method of Theorem 1 with the aggregation procedures of Proposition 1 and the following multiplicity adjustment:

(5)mC(b)(r)=,ifanc(C)/r1,ifanc(C)randSˆ(b)C=|Sˆ(b)||Sˆ(b)C|Danc(C)nD(b)(r),otherwise,

where

nD(b)(r)=1|Sˆ(b)D|Ech(D)e(r)|Sˆ(b)E|

and

e(r)={Ccs.t.of(C)r}

the set of extinct branches, i.e., the set of hypotheses which have been rejected together with all their offspring denoted by of(C) (as before, anc(C) denotes the ancestors of cluster C). Note that since nD(b)(r)1 this procedure leads to an uniform improvement over the method of the previous section.

The monotonicity property (A4) follows from the same considerations as above and

rse(r)e(s)nD(b)(r)nD(b)(s).

To check that the single step property (A5) holds note that

Ccr1{CSˆ(b)}mC(b)(r)=Ccrs.t.anc(C)r|Sˆ(b)C||Sˆ(b)|Danc(C)|Sˆ(b)D|Ech(D)e(r)|Sˆ(b)E|=CcrαC(b)(r)α

where αC(b) is as in Goeman and Finos ([18], eq. (5)) with the weights wC(b)=|Sˆ(b)C|; therefore the single step property follows directly from the considerations in Goeman and Finos [18].

4.3 Exploiting logical relationships: Shaffer improvements

Logical relationships between hypothesis can be exploited to improve the power of the sequential rejection procedure. A first example of such an improvement for hierarchically ordered clusters was given in Meinshausen [24], while in Goeman and Finos [18] the improvement is applied to the inheritance procedure. Since those improvements are based on the considerations of Shaffer [25] they are called “Shaffer improvements”. For the high-dimensional setting a possible Shaffer improvement consists of multiplying the multiplicity adjustment mC(b)(r) with the Shaffer factor

(6)sC(b)(r)=max{mC(b)(u)/mC(b)(r)s.t.Cur,ucongruent},

where a set uc is called congruent if, by the logical implications, it can be a complete set of false hypothesis (e.g. for a collection c of hierarchically ordered hypothesis uc is congruent if for each Cu it holds anc(C)u and at least one offspring leaf node of C is in u).

Note that multiplication with the Shaffer factor never decreases the power of the method since by the monotonicity property (A4), sC(b)(r)1. Moreover, if r is congruent, then sC(b)(r)=1 and since the collection f of all false hypothesis is congruent, the Shaffer improvement doesn’t affect the validity of eq. (8). Finally, for rs,

mC(b)(r)sC(b)(r)=max{mC(b)(u)s.t.Cur,ucongruent}max{mC(b)(u)s.t.Cus,ucongruent}=mC(b)(s)sC(b)(s)

and hence the Shaffer improvement doesn’t affect the validity of eq. (7) neither.

We want to apply this Shaffer improvement to the inheritance procedure described in Section 4.2.1. Following the same reasoning as in Goeman and Finos ([18], Section 6), with the weights wC(b)=|Sˆ(b)C| we get the Shaffer factor

sC(b)(r)=wC(b)+uC(b)vC(b)wC(b)+uC(b),ifCr,si(C)lr1,otherwise,

where si(C)=chpa(C){C} denotes the siblings of C, lc denotes the collection of leaf nodes, uC(b)=Dsi(C)wD(b) and vC(b)=minDsi(C)wD(b). If c is a binary tree the Shaffer factor becomes

sC(b)(r)=|Sˆ(b)C||Sˆ(b)C|+|Sˆ(b)si(C)|,ifCr,si(C)lr1,otherwise.

Unlike as for the inheritance procedure in (5), the Shaffer factor (6) for the procedure in (4) is always 1. Nevertheless, a possibility how to exploit logical relationships to improve the power of the procedure (4) for binary trees, which provides a Shaffer improvement very similar to the one above, is illustrated in Mandozzi and Bühlmann [15].

5 Empirical results

5.1 Implementation of the methods and considered scenarios

In this section we compare the performance of the four methods illustrated in Sections 4.1 and 4.2 and refined in Section 4.3, i.e. single variable method with Bonferroni multiplicity adjustment (2), hierarchical method with Bonferroni-based adjustment (4) along with Shaffer improvement as in Mandozzi and Bühlmann [15], single variable method with Bonferroni-Holm multiplicity adjustment (3) and hierarchical method with inheritance procedure (5) along with Shaffer improvement (6). In the following we refer to the first two methods as the “non-sequential methods” (strictly seen, the hierarchical method with Bonferroni-based adjustment is actually sequential, but there previous rejections are not used to improve subsequent multiplicity corrections) and the latter two methods as the “sequential methods”.

We consider the same implementation of the methods and the same scenarios (with exactly the same sample splits) as in Mandozzi and Bühlmann [15], although here we use only standard hierarchical clustering for the hierarchical methods. Concretely, the following choices have been made for implementation:

  1. construction of the clusters with standard hierarchical clustering (using the R-function hclust) with distance between two covariables equal to 1 minus the absolute correlation between the covariables, and using complete linkage;

  2. screening with the Lasso [19] with regularization parameter chosen by 10-fold cross-validation;

  3. B=50 sample splits (for each scenario exactly the same splits as in Mandozzi and Bühlmann [15]);

  4. for aggregation, the p-values PhC in Proposition 1 are computed over a grid of γ-values between γmin=0.05 and 1 with grid-steps of size 0.025;

  5. nominal significance level α=5%.

The following scenarios are considered (for the details we refer to Mandozzi and Bühlmann [15]):
  1. 42 scenarios based on 7 designs;

  2. for each design we consider 6 settings by varying the number of variables p in the model and the signal to noise ratio defined by

    SNR=(β0)TXTXβ0n1σ2,

    namely for p=200 we use SNR=4 and SNR=8, for p=500 we use SNR=8 and SNR=16 and for p=1000 we use SNR=16 and SNR=32;

  3. 3 designs based on synthetic data (“equi correlation”, “high correlation within small blocks” and “high correlation within large blocks”) and 4 designs based on semi-real data (“Riboflavin with normal correlation”, “Breast with normal correlation”, “Riboflavin with high correlation”, “Breast with high correlation”);

  4. sparsity s0=6 for the two “Riboflavin”-designs and s0=10 for the other five designs.

5.2 Familywise error rate control (FWER)

For each of the 42 scenarios described in Section 5.1 we consider exactly the same 100 independent simulation runs as in Mandozzi and Bühlmann ([15], section 4.2.2) by varying only the synthetic noise term ε and count the number where at least one false selection is made. According to Theorem 1, we expect this number to be at most 100α=5 (α=0.05). The results for the Bonferroni-based methods can be seen in Mandozzi and Bühlmann ([15], Table 1): FWER control holds for 40 of the 42 scenarios and in 37 scenarios there is no false selection at all.

The results for the methods with sequential rejection are very similar, the only differences being that for the “high correlation within small blocks”-design with p=500 and SNR=8 the number of runs with at least a false selection increases (compared to Bonferroni-type methods) from 7 to 9 for the single variable method, and from 7 to 13 for the hierarchical method, respectively; for the same design with p=1000 and SNR=16 the number of runs with at least a false selection increases from 5 to 6 for both the single variable and hierarchical method. For all other scenarios, inclusively the “high correlation within large blocks”-design with p=200 and SNR=4, where the non-sequential hierarchical method slightly failed to control FWER (6 runs with at least a false detection), the sequential methods exhibit the same FWER control as their non-sequential counterparts.

Summarizing, FWER holds for all four methods in 39 out of 42 scenarios and the designs where it doesn’t fully hold are “high correlation within small blocks” and “high correlation within large blocks”, which is not surprising since each active predictor is highly correlated with one or more inactive predictors from S0c and hence it is rather difficult for our screening method (the Lasso) to guarantee that SˆS0.

5.3 Power

For measuring the power we consider four different aspects: the one-dimensional statistics defined in Mandozzi and Bühlmann ([15], section 4.2.1) as “Performance 1” and “Performance 2” (see below), the number of minimal true detections (MTDs, i.e., smallest significant groups of variables of any cardinality, containing at least one active variable, see below) and singleton true detections (STDs, i.e., MTDs with cardinality 1). Concretely, a cluster is said to be a MTD if it satisfies all of the following:

  1. C is a significant cluster, e.g., has p-value <5% (“Detection”);

  2. There is no significant sub-cluster DC (“Minimal”);

  3. Cf, i.e., there is at least one active variable in C (“True”);

and we define:

Performance1=1|S0|MTDC1|C|,Performance2=1|S0|MTDCwith|C|2012(1|C|+1).

For each of the 42 scenarios outlined in Section 5.1, we consider exactly the same 100 independent simulation runs obtained in Mandozzi and Bühlmann ([15], section 4.2.3-4) by varying the synthetic noise term ε and the synthetic regression vector β0. We then calculate the average Performance 1, Performance 2, number of MTDs and number STDs, over the 100 simulation runs. The results are shown in Table 1 (for the single variable methods each MTD is an STD and by definition Performance 1 and 2 are the same).

Table 1:

Empirical results, averaged over 100 simulation runs, for single variable method with Bonferroni (SB) and with Bonferroni-Holm (SH), resp. hierarchical method with Bonferroni (HB) and with sequential rejection induced by the inheritance procedure (HSR).

Designp# MTDs# STDsPerf 1 (in %)Perf 2 (in %)
SBSHHBHSRHBHSRSBSHHBHSRHBHSR
low SNRequi corr2004.795.005.405.594.344.5547.950.044.246.346.148.1
5003.974.134.744.843.733.8439.741.337.738.738.339.3
1,0001.771.792.542.541.731.7317.717.917.417.417.617.6
small blocks2004.454.786.857.124.364.8444.547.853.757.560.764.0
5003.153.335.185.273.153.4231.533.338.340.144.245.5
1,0001.311.352.532.571.311.3713.113.515.115.617.217.7
large blocks2000.290.306.506.500.280.282.93.06.76.731.331.3
5000.060.062.762.760.060.060.60.61.11.11.11.1
1,0000.000.000.600.600.000.000.00.00.10.10.10.1
Riboflavin normal corr2001.411.432.412.461.331.3523.523.823.423.825.225.7
5000.900.901.841.850.770.7915.015.013.513.814.214.5
1,0000.720.731.601.630.630.6612.012.210.811.211.011.4
Breast normal corr2004.054.165.005.113.843.9440.541.639.540.641.642.9
5003.954.025.045.113.823.8739.540.238.839.339.640.2
1,0003.303.344.254.273.103.1333.033.431.231.531.731.9
Riboflavin high corr2001.441.492.962.961.411.4424.024.826.026.431.832.1
5001.721.792.952.981.691.7228.729.829.930.432.833.3
1,0001.511.512.542.561.491.5225.225.225.325.825.726.1
Breast high corr2003.984.105.915.953.873.9139.841.041.241.646.146.6
5005.135.226.516.564.874.9351.352.249.950.451.752.3
1,0004.734.775.955.984.644.6747.347.747.047.348.348.6
high SNRequi corr2009.779.839.799.809.739.7497.798.397.497.597.497.5
5007.287.387.637.677.187.2472.873.872.072.572.172.6
1,0002.812.843.503.502.782.7828.128.427.927.928.128.1
small blocks2009.189.319.9810.009.299.4891.893.196.397.498.198.7
5006.997.038.058.147.027.1569.970.373.574.776.477.4
1,0002.262.273.403.412.262.2822.622.724.324.526.326.5
large blocks2002.172.269.589.582.132.1421.722.627.928.061.461.4
5001.171.205.385.381.151.1511.712.012.612.613.213.2
1,0000.430.451.111.110.430.434.34.54.44.44.44.4
Riboflavin normal corr2003.393.463.893.923.333.3456.557.756.356.558.759.1
5002.242.252.902.902.152.1537.337.536.436.436.936.9
1,0000.981.001.831.830.960.9616.316.716.216.216.316.3
Breast normal corr2008.658.708.898.938.608.6586.587.086.486.887.287.7
5006.816.867.337.356.726.7468.168.667.667.868.368.5
1,0003.953.974.814.843.793.8239.539.738.138.438.438.7
Riboflavin high corr2003.863.974.794.833.823.8664.366.266.066.769.470.1
5003.693.724.404.433.653.6861.562.061.862.363.764.2
1,0002.482.513.243.272.432.4541.341.840.741.041.141.4
Breast high corr2009.099.159.599.619.099.1490.991.591.992.393.894.1
5007.757.828.388.407.717.7277.578.277.878.078.979.1
1,0005.855.896.736.765.725.7558.558.957.557.958.158.5
Average3.653.724.985.023.583.6440.441.141.041.643.944.4

Considering both low and high SNR, the methods with sequential rejection improve the considered power measures in comparison to the analogous method without sequential rejection in 207 out of 252 cases, the absolute improvement being at least 0.05 for MTDs and STDs, and at least 0.5% for Performance 1 and Performance 2 in 133 cases out of 252 cases. For better interpretation of these results: an absolute improvement of 0.05 MTDs (resp. STDs) basically means that in one out of 20 runs one more MTD (resp. STD) could be detected. Averaging over all scenarios, the improvement given by the sequential rejection procedures lies between 0.04 and 0.06 for MTDs and STDs, and between 0.5% and 0.7% for Performance 1 and Performance 2. The biggest gain with sequential rejection can be found in the “high correlation within small blocks”-design with p=200 and low SNR: it consists of 0.48 more STDs, 0.33 more MTDs and an absolute increase of 3.8% of Performance 1 and 3.3% of Performance 2, respectively. This basically means that in half of the runs the method with sequential rejection could find one STD more and in one third of the runs it could find one MTD more. Other particularly favorable scenarios for an improvement with sequential rejection are the “equi correlation”-design and the “breast normal correlation”-design, both with p=200 and low SNR and the “high correlation within small blocks”-design with high SNR and p=200, resp. p=500.

In general, the improvement given by the sequential rejection procedures decreases with increasing number p of covariables and is substantial only when the power of the method without sequential rejection is intermediate. These empirical findings are not surprising, since looking at how the methods are defined and in particular at the eqs (2)–(5), we conclude that an improvement with the sequential rejection methods is only possible if the related non-sequential method provides at least an STD (and gets more likely the more STDs are provided by the non-sequential method). Moreover, an improvement with sequential rejection is more likely to happen when the number |Sˆ| of screened variables is small.

For a better illustration of what kind of an improvement is possible using sequential rejection, we show in Figures 1 and 2 the dendrograms (in gray) for a paradigmatic simulation run of the “equi correlation” – and the “high correlation within small blocks”-design, respectively, both with p=200 and SNR=4.

Figure 1: Dendrograms for a paradigmatic simulation run of the “equi correlation”-design with p=200$$p = 200$$ and SNR=4$${\kern 1pt} SNR{\kern 1pt} = 4$$. The active variables are labeled in black and the truly detected non-zero variables along the hierarchy are depicted in black.
Figure 1:

Dendrograms for a paradigmatic simulation run of the “equi correlation”-design with p=200 and SNR=4. The active variables are labeled in black and the truly detected non-zero variables along the hierarchy are depicted in black.

Figure 1 illustrates that sequential rejection allows the detection of a further singleton, increasing the number of STDs from 6 to 7 and the number of MTDs 8 to 9. In Figure 2 sequential rejection allows to detect a singleton that could previously only be detected together with another non-relevant variable in a cluster of cardinality 2, increasing the number of true STDs from 4 to 5 (while the number of MTDs remains to be 6).

Figure 2: Dendrograms for a paradigmatic simulation run of the “high correlation within small blocks”-design with p=200$$p = 200$$ and SNR=4$${\kern 1pt} SNR{\kern 1pt} = 4$$. The active variables are labeled in black and the truly detected non-zero variables along the hierarchy are depicted in black.
Figure 2:

Dendrograms for a paradigmatic simulation run of the “high correlation within small blocks”-design with p=200 and SNR=4. The active variables are labeled in black and the truly detected non-zero variables along the hierarchy are depicted in black.

Finally, we have performed a simulation with the same scenarios (and the same sample splits) as in Mandozzi and Bühlmann ([15], section 4.3), i.e. “small blocks”-designs and “large blocks”-designs with 8 different correlations

ρ{0,0.4,0.7,0.8,0.85,0.9,0.95,0.99}.

The full results are shown in Table 2 in the Appendix. While the methods with sequential rejection control the FWER in exactly the same scenarios where it is also controlled by the non-sequential methods, they increase the average number of MTDs from 5.51 to 5.62 for the single variable method, and from 8.11 to 8.18 for the hierarchical method, and the number of STDs for the hierarchical method from 5.44 to 5.55, with improvements for a single scenario up to 0.48 MTDs and 0.56 STDs (averaged over 100 runs).

The empirical results can be summarized as follows. The methods with sequential rejection essentially controls the FWER in the same way as the non-sequential methods. Regarding power, sequential rejection allows for improvements, to a similar extent for the single variable and the hierarchical procedures. As already noted in Mandozzi and Bühlmann [15], for the non-sequential methods, the hierarchical methods have similar STDs as the single variable methods but allow for substantially more MTDs. Thus, our proposed hierarchical method with the inheritance procedure can be seen as the best of the considered methods.

5.4 Real data application: Motif Regression

We consider here a problem of motif regression [26] from computational biology. We apply the four methods described above, plus the two hierarchical methods (with and without sequential rejection) using the recently proposed canonical correlation clustering of Bühlmann et al. [14], to a real dataset with n=287 and p=195, used in Meinshausen ([24], section 4.3) and Mandozzi and Bühlmann ([15], section 4.4). The sequential rejection methods detects exactly the same significant structures as non-sequential methods, namely a single variable and a cluster containing 165 variables (the latter can be detected only with the hierarchical method with canonical correlation clustering). This can barely be considered as surprising, as with only one STD by the non-sequential methods, further improvements by the sequential methods are rather unlikely (see Section 5.3 for more explanation and empirical evidence).

6 Conclusions

We propose a general sequential rejection testing method for clusters and single variables in a high-dimensional linear model. In presence of high correlations among the covariables, due to serious problems of identifiability, it is essentially mandatory to focus on detecting significant groups of variables rather than single individual covariables. Our method asymptotically controls the familywise error rate (FWER), while, as a consequence of its modular structure, allowing for unburdened power optimization. We provide an implementation in the R-package hdi.

We use and study the procedure for inference of single variables but much more importantly, for hierarchically ordered clusters of variables. With the latter, we establish a powerful scheme for meaningful inference in a high-dimensional regression model, much beyond considering single variables only. Our presented mathematical analysis on control of the FWER and power improvement is complemented by empirical results based on semi-real and simulated data confirming the theoretical results.

Appendix

Proof of Theorem 1

Proof. We show that the procedure satisfies monotonicity and single-step conditions as required by Goeman and Solari ([17], Theorem 1), i.e.

(7)rsn(r)n(s)s
(8)P[n(f)f](1δ)Bα.

Assume rs and Cn(r). Then by definition

aggr(pC,(1)mC(1)(r),,pC,(B)mC(B)(r))α.

The monotonicity property (A4) of the multiplicity adjustment and the fact that the aggregation procedure is monotone increasing imply

aggr(pC,(1)mC(1)(s),,pC,(B)mC(B)(s))aggr(pC,(1)mC(1)(r),,pC,(B)mC(B)(r))

and hence either Cs or Cn(s) which proves (7). Consider the event

a={Sˆ(b)S0,b=1B}

where all screenings are satisfied. Because of the δ-screening assumption (A2) it holds P(a)(1δ)B and hence

P[n(f)/f]=P[n(f)/f|a]P(a)+P[n(f)/f|Pac](ac)P[n(f)/f|a]+(1(1δ)B).

Since

[N()|A][C\{aggr(pC,(1)mC(1)(f),,pC,(B)mC(B)())α}]C\[aggr(pC,(1)mC(1)(),,pC,(B)mC(B)())α}](A3)(A6)C\αBb=1B1{CS^(b)0}mC(b)()(A5)αBb=1B1α

we can prove (8) as follows:

P[n(f)f]=1P[n(f)/f]1(α+(1(1δ)B))=(1δ)Bα.

Proof of Proposition 1

Proof. The proof was basically given in the Appendix of Meinshausen et al. [7]. In the following we omit the function min{1,} from the definition of Q(γ) in order to simplify the notation (this is possible since the level α is smaller than 1). Define for u(0,1) the function

π(u):=1Bb=1B1{p˜(b)u}.

Then it holds

Q(γ)αqγ(p˜(1)/γ,,p˜(B)/γ)αqγ(p˜(1),,p˜(B))αγb=1B1{p˜(b)αγ}Bγπ(αγ)γ.

Thus,

P(Q(γ)α)=E(1{Q(γ)α})=E(1{π(αγ)γ})1γE(π(αγ))=1γ1Bb=1BE(1{p˜(b)αγ})=1γ1Bb=1BP(p˜(b)αγ)1γ1Bb=1Bαγm(b)1{CSˆ(b)}=αBb=1B1{CSˆ(b)}m(b),

where the first inequality is a consequence of the Markov inequality and the last inequality is a consequence of the assumptions that P(p˜(b)α)=P(p(b)m(b)α)α/m(b) and the definition p˜(b)=1 for CSˆ(b)=. For a random variable U taking values in [0,1],

supγ(γmin,1)1{Uαγ}γ=0,Uαα/U,αγminU<α1/γmin,Uαγmin.

and if U has an uniform distribution on [0,1]

E(supγ(γmin,1)1{Uαγ}γ)=0αγminγmin1dx+αγminααx1dx=γmin1x|x=0x=αγmin+αlogx|x=αγminx=α=α(1logγmin).

We apply this using as U the uniform distributed p˜(b)/m(b)=p(b) for CS(b) and obtain

Esupγ(γmin,1)1{p˜(b)/m(b)αγ}γα(1logγmin),

and similarly as above

(infγ(γmin,1)Q(γ)α)=E(supγ(γmin,1)1{π(αγ)γ})E(supγ(γmin,1)1Bb=1B1{p˜(b)αγ}γ)=E(supγ(γmin,1)1Bb=1B1{p˜(b)αγ}1{CS(b)}γ)1Bb=1BE(supγ(γmin,1)1{p˜(b)/m(b)αγ/m(b)}1{CS(b)}γ))(1logγmin)αBb=1B1{CS(b)}m(b)

Additional empirical results

Table 2:

Results of the simulation with the “small blocks”- and “large blocks”-design for 8 different correlations ρ, for single variable method with Bonferroni (SB) and with Bonferroni-Holm (SH), resp. hierarchical method with Bonferroni (HB) and with sequential rejection induced by the inheritance procedure (HSR).

ρFWER# MTDs# STDs
SBSHHBHSRSBSHHBHSRHBHSR
“small blocks”-design with low SNR (SNR = 4)
000009.579.699.539.639.429.53
0.400008.849.068.658.818.368.51
0.700005.876.267.287.605.656.13
0.800005.535.766.797.225.335.89
0.850.030.040.030.042.973.085.215.562.823.14
0.90.010.010.010.013.353.555.495.863.223.60
0.950.460.470.460.481.021.114.044.070.90.99
0.990.550.560.540.583.623.786.016.273.423.71
“large blocks”-design with low SNR (SNR = 4)
000008.428.688.388.507.988.11
0.400007.618.098.988.987.447.48
0.700000.670.715.905.910.590.59
0.800000.270.276.026.020.240.24
0.850000003.383.3800
0.9000.060.060.380.397.597.600.380.38
0.950.030.030.160.160.450.458.678.680.440.44
0.990.970.971.001.001.471.485.285.271.471.48
“small blocks”-design with high SNR (SNR = 8)
000009.879.899.909.909.869.86
0.40000101010101010
0.70000101010101010
0.800009.859.899.989.989.909.91
0.8500009.269.389.899.929.399.53
0.900009.599.6510109.679.79
0.950.210.230.210.288.368.469.829.788.368.61
0.990.920.930.920.956.726.858.068.046.736.99
“large blocks”-design with high SNR (SNR = 8)
00000101010101010
0.400009.989.9810109.999.99
0.700005.125.359.609.605.105.12
0.800009.239.4310109.149.15
0.8500003.864.039.989.983.843.85
0.900000.060.067.177.170.060.06
0.9500001.261.299.999.991.271.28
0.990.330.330.990.993.263.267.927.923.263.26

References

1. Dudoit S, van der Laan MJ. Multiple testing procedures with applications to genomics. Berlin: Springer Science & Business Media, 2007.10.1007/978-0-387-49317-6Search in Google Scholar

2. Efron B. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, volume 1. Cambridge: Cambridge University Press, 2010.10.1017/CBO9780511761362Search in Google Scholar

3. Westfall PH. Resampling-based multiple testing: Examples and methods for p-value adjustment, volume 279. New York: John Wiley & Sons, 1993.Search in Google Scholar

4. Bühlmann P. Statistical significance in high-dimensional linear models. Bernoulli 2013;19:1212–42.10.3150/12-BEJSP11Search in Google Scholar

5. Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. J Mach Learn Res 2014;15:2869–909.Search in Google Scholar

6. Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A significance test for the lasso. Ann Stat 2014;42:413–68.10.1214/13-AOS1175Search in Google Scholar PubMed PubMed Central

7. Meinshausen N, Meier L, Bühlmann P. p-Values for High-Dimensional Regression. J Am Stat Assoc 2009;104:1671–81.10.1198/jasa.2009.tm08647Search in Google Scholar

8. van de Geer S, Bühlmann P, Ritov Y, Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Stat 2014;42:1166–202.10.1214/14-AOS1221Search in Google Scholar

9. Wasserman L, Roeder K. High dimensional variable selection. Ann Stat 2009;37:2178–201.10.1214/08-AOS646Search in Google Scholar

10. Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat SocSeries B 2014;76:217–42.10.1111/rssb.12026Search in Google Scholar

11. Bühlmann P, Kalisch M, Meier L. High-dimensional statistics with a view toward applications in biology. Ann Rev Stat Appl 2014;1:255–78.10.1146/annurev-statistics-022513-115545Search in Google Scholar

12. Dezeure R, Bühlmann P, Meier L, Meinshausen N. High-dimensional Inference: Confidence intervals, p-values and R-software hdi. To appear in Statistical Science. Preprint arXiv:1408.4026v1.10.1214/15-STS527Search in Google Scholar

13. Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. New York, NY: Springer Verlag, 2011.10.1007/978-3-642-20192-9Search in Google Scholar

14. Bühlmann P, Rütimann P, van de Geer S, Zhang C-H. Correlated variables in regression: clustering and sparse estimation (with discussion). J Stat Plann Inference 2013;143:1835–71.10.1016/j.jspi.2013.05.019Search in Google Scholar

15. Mandozzi J, Bühlmann P. Hierarchical testing in the high-dimensional setting with correlated variables. J Am Stat Assoc 2015. Published online (DOI: 10.1080/01621459.2015.1007209).Search in Google Scholar

16. Meinshausen N. Group bound: confidence intervals for groups of variables in sparse high dimensional regression without assumptions on the design. J R Stat Soc Series B 2015;77:923–45.10.1111/rssb.12094Search in Google Scholar

17. Goeman JJ, Solari A. The sequential rejection principle of familywise error control. Ann Stat 2010;38:3782–810.10.1214/10-AOS829Search in Google Scholar

18. Goeman JJ, Finos L. The inheritance procedure: Multiple testing of tree-structured hypotheses. Stat Appl Genet Mol Biol 2012;11:1–18.10.1515/1544-6115.1554Search in Google Scholar PubMed

19. Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Series B 1996;58:267–88.10.1111/j.2517-6161.1996.tb02080.xSearch in Google Scholar

20. Bühlmann P, Mandozzi J. High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat 2014;29:407–30.10.1007/s00180-013-0436-3Search in Google Scholar

21. Bonferroni CE. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del Regio Istituto Superiore di Scienze Economiche e Commerciali di Firenze 1936;8:3–62.Search in Google Scholar

22. Dunn OJ. Multiple Comparisons among Means. J Am Stat Assoc 1961;56:52–64.10.1080/01621459.1961.10482090Search in Google Scholar

23. Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat 1979;6:65–70.Search in Google Scholar

24. Meinshausen N. Hierarchical testing of variable importance. Biometrika 2008;95:265–78.10.1093/biomet/asn007Search in Google Scholar

25. Shaffer JP. Modified sequentially rejective multiple test procedures. J Am Stat Assoc 1986;81:826–31.10.1080/01621459.1986.10478341Search in Google Scholar

Published Online: 2016-5-26
Published in Print: 2016-5-1

©2016 by De Gruyter

Articles in the same Issue

  1. Frontmatter
  2. Editorial
  3. Special Issue on Data-Adaptive Statistical Inference
  4. Research Articles
  5. Statistical Inference for Data Adaptive Target Parameters
  6. Evaluations of the Optimal Discovery Procedure for Multiple Testing
  7. Addressing Confounding in Predictive Models with an Application to Neuroimaging
  8. Model-Based Recursive Partitioning for Subgroup Analyses
  9. The Orthogonally Partitioned EM Algorithm: Extending the EM Algorithm for Algorithmic Stability and Bias Correction Due to Imperfect Data
  10. A Sequential Rejection Testing Method for High-Dimensional Regression with Correlated Variables
  11. Variable Selection for Confounder Control, Flexible Modeling and Collaborative Targeted Minimum Loss-Based Estimation in Causal Inference
  12. Testing the Relative Performance of Data Adaptive Prediction Algorithms: A Generalized Test of Conditional Risk Differences
  13. A Case Study of the Impact of Data-Adaptive Versus Model-Based Estimation of the Propensity Scores on Causal Inferences from Three Inverse Probability Weighting Estimators
  14. Influence Re-weighted G-Estimation
  15. Optimal Spatial Prediction Using Ensemble Machine Learning
  16. AUC-Maximizing Ensembles through Metalearning
  17. Selection Bias When Using Instrumental Variable Methods to Compare Two Treatments But More Than Two Treatments Are Available
  18. Doubly Robust and Efficient Estimation of Marginal Structural Models for the Hazard Function
  19. Data-Adaptive Bias-Reduced Doubly Robust Estimation
  20. Optimal Individualized Treatments in Resource-Limited Settings
  21. Super-Learning of an Optimal Dynamic Treatment Rule
  22. Second-Order Inference for the Mean of a Variable Missing at Random
  23. One-Step Targeted Minimum Loss-based Estimation Based on Universal Least Favorable One-Dimensional Submodels
Downloaded on 23.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/ijb-2015-0008/html?lang=en
Scroll to top button