Design-based RCT estimators and central limit theorems for baseline subgroup and related analyses

Peter Z. Schochet

doi:10.1515/jci-2023-0056

Article Open Access

Design-based RCT estimators and central limit theorems for baseline subgroup and related analyses

Peter Z. Schochet

Published/Copyright: July 25, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Causal Inference Volume 12 Issue 1

Abstract

There is a growing literature on design-based (DB) methods to estimate average treatment effects (ATEs) for randomized controlled trials (RCTs) for full sample analyses. This article extends these methods to estimate ATEs for discrete subgroups defined by pre-treatment variables, with an application to an RCT testing subgroup effects for a school voucher experiment in New York City. We consider ratio estimators for subgroup effects using regression methods, allowing for model covariates to improve precision, and prove a new finite population central limit theorem. We discuss extensions to blocked and clustered RCT designs, and to other common estimators with random treatment-control sample sizes or summed weights: post-stratification estimators, weighted estimators that adjust for data nonresponse, and estimators for Bernoulli trials. We also develop simple variance estimators that share features with robust estimators. Simulations show that the DB subgroup estimators yield confidence interval coverage near nominal levels, even for small subgroups.

Keywords: randomized controlled trials; subgroup analyses; design-based estimators; finite population central limit theorems

MSC 2010: 62K99; 62D99; 62E20; 62G20

1 Introduction

There is a growing literature on design-based (DB) methods to estimate overall average treatment effects (ATEs) for randomized controlled trials (RCTs). These nonparametric methods use the building blocks of experimental designs to generate consistent, asymptotically normal ATE estimators with minimal assumptions. The underpinnings of these methods were introduced by Neyman [1] and later developed in seminal works by Rubin [2,3] and Holland [4] using a potential outcomes framework.

To date, the DB literature has focused on ATE estimation for full sample analyses. In this article, we build on these methods to develop ATE estimators for discrete subgroups defined by pre-treatment (baseline) characteristics of study participants. Subgroup analyses for RCTs are common across fields as they can be used to assess treatment effect heterogeneity and inform decisions about how to best target and improve treatments [5,6]. Guidelines for the planning, analysis, and reporting of RCT subgroup analyses have been proposed in the literature to ensure statistical rigor, such as approaches to reduce the chances of finding spurious positive effects due to multiple testing [5,7,8].

As a motivating example, consider the evaluation of the New York City (NYC) School Choice Scholarships Program, an RCT where low-income public school students in grades K–4 could participate in a series of lotteries to receive a private school voucher for up to 3 years [9,10]. A subgroup analysis was pre-specified for the study to examine differences in voucher effects for African-American and Latino students. The hypothesis was that African Americans might benefit more from the vouchers as they tended to live in poorer communities and attend lower-performing public schools.

Several key aspects of this subgroup analysis motivate the theory underlying this article. First, the study sample was not randomly sampled from a broader population. Rather, the sample included only a very small percentage of NYC families who applied for a scholarship. Thus, the study results cannot be generalized to a broader voucher program that would involve all children in NYC or elsewhere. This setting suggests a finite population framework for estimating ATEs where the sample and their potential outcomes are considered fixed, and study results are assumed to pertain to the study sample only. This is a common RCT setting across disciplines that often include volunteer samples of individuals and sites.

Second, the estimation strategy should allow for the inclusion of model baseline covariates to improve precision as power is often a concern for subgroup analyses due to small sample sizes. Third, the voucher study conducted randomization within strata, suggesting the need for a theory for blocked RCTs. Fourth, the study randomized families rather than students, suggesting a further need to consider a theory for clustered RCTs that are becoming increasingly prevalent across fields [11,12]. Finally, the study constructed weights to adjust for missing outcome data, a common strategy for RCT analyses that should be covered in the theory.

This article addresses these issues by developing DB ATE ratio estimators for subgroup-related analyses using regression models that allow for baseline covariates. We focus on ratio estimators due to the randomness of subgroup sizes in the treatment and control groups. We prove a new finite population central limit theorem (CLT) by building on the methods reported by Pashley [13] and Schochet et al. [14]. We also discuss extensions to blocked and clustered RCTs, and to other common estimators with random sample sizes or summed weights: post-stratification estimators, weighted estimators that adjust for data nonresponse, and estimators for Bernoulli trials (BTs). We provide consistent variance estimators that are compared to commonly used robust standard errors (SEs). Our simulations show that the DB subgroup ATE estimators yield confidence interval coverage near nominal levels, even for small subgroups. Finally, we demonstrate the methods using data from our motivating NYC voucher experiment.

The rest of this article proceeds as follows. Section 2 discusses the related literature. Section 3 provides the theoretical framework, ATE estimators and CLT results for the non-clustered RCT, and extensions. Section 4 discusses blocked and clustered RCTs. Section 5 presents simulation results, and Section 6 presents empirical results using the NYC voucher study. Section 7 concludes.

2 Related work

Our work builds on the growing literature on DB methods to estimate ATEs for full sample analyses [14–23]. These methods also pertain to subgroup analyses conditional on subgroup sizes observed in the treatment and control groups [21], but not to unconditional analyses that average over subgroup allocations.

Our work draws most directly on two studies. First, we draw on methods in Schochet et al. [14] who provide finite population CLTs for ratio estimators for blocked, clustered RCTs with general weights (using previous results in Scott and Wu [24], Li and Ding [23], and Pashley [13]). Our innovation is to adapt these methods by treating subgroup indicators as “weights” in the analysis. Second, we draw on results from the study by Miratrix et al. [25] who considered DB post-stratification estimators for overall effects, which share properties with baseline subgroup estimators. Miratrix et al. [25], however, do not consider asymptotic distributions, blocked or clustered RCT designs, the inclusion of other model covariates, or weights considered here.

Finally, there is a large statistical literature on DB methods for analyzing survey data with complex sample designs, including for estimating subpopulation means or totals [26–28]. However, these works do not consider RCT settings for estimating treatment-control differences in subpopulation means.

In what follows, we focus on the non-clustered RCT design without blocking and extensions to related estimators. We then discuss blocked and clustered designs.

3 DB subgroup analysis for non-clustered RCTs

We assume an RCT of n individuals, with n 1 = np assigned to the treatment group and n 0 = n ( 1 − p ) assigned to the control group, where p is the treatment assignment rate ( 0 < p < 1 ) . Let T i equal 1 if person i is randomly assigned to the treatment condition and 0 otherwise.

Let Y i ( 1 ) be the outcome of person i if assigned to the treatment group and Y i ( 0 ) be the outcome in the control condition. These potential outcomes can be continuous, binary, or discrete. We assume a finite population model where potential outcomes are fixed for the study.

For the subgroup analysis, we assume each sample member is allocated to a discrete category within a subgroup class with K ≥ 1 levels. The subgroup classes (such as age or race/ethnicity groups) can be formed from continuous, categorical, or discrete variables measured at baseline, so are unaffected by the treatment. We consider estimation for each subgroup class in isolation. For a specific class, let G ik equal 1 for a member of subgroup (level) k and 0 otherwise, for k ∈ { 1 , 2 , … , K } . Let n k = ∑ i = 1 n G ik denote the number of persons in subgroup k , with ∑ k = 1 K n k = n . Finally, let π k = G ̅ k = n k / n be the subgroup population share, with ∑ k = 1 K π k = 1 .

We assume two conditions. The first is the stable unit treatment value assumption (SUTVA) [29]:

(C1): SUTVA: Let Y i ( T ) denote the potential outcome given the random vector of all treatment assignments, T . Then, if T i = T i ′ for person i , we have that Y i ( T ) = Y i ( T ′ ) .

SUTVA allows us to express Y i ( T ) as Y i ( T i ) , so that a person’s potential outcomes depend only on the person’s treatment assignment and not on those of other persons in the sample. This condition is assumed to hold within and across subgroups. SUTVA also assumes that a particular treatment unit cannot receive different forms of the treatment.

Under SUTVA, the ATE parameter for subgroup k under the finite population model is,

(1) τ k = ∑ i = 1 n G ik ( Y i ( 1 ) − Y i ( 0 ) ) n k = Y ̅ k ( 1 ) − Y ̅ k ( 0 ) ,

which is the mean treatment effect for members of subgroup k in the study sample.

Our second condition is complete randomization [21], where extensions to BTs are discussed in Section 3.3:

(C2): Complete randomization: For fixed n 1 , if t = ( t 1 , … , t n ) is any vector of randomization realizations such that ∑ i = 1 n t i = n 1 , then Prob ( T = t ) = n n 1 − 1 .

This condition implies that potential outcomes are independent of treatment status, Y i ( 1 ) , Y i ( 0 ) ⫫ T i , which also holds for any baseline subgroup (e.g., males or females).

3.1 ATE estimators

Under the potential outcomes framework and SUTVA, the data generating process for the observed outcome measure, y i , is a result of the random assignment process:

(2) y i = T i Y i ( 1 ) + ( 1 − T i ) Y i ( 0 ) .

This relation states that we can observe y i = Y i ( 1 ) for those in the treatment group and y i = Y i ( 0 ) for those in the control group, but not both.

Rearranging (2) generates the following nominal full sample regression model:

(3) y i = α + τ T ∼ i + u i ,

where T ∼ i = ( T i − p ) is the centered treatment indicator; τ = Y ̅ ( 1 ) − Y ̅ ( 0 ) is the full sample ATE estimand; Y ̅ ( t ) = 1 n ∑ i = 1 n Y i ( t ) is the mean potential outcome for t ∈ { 1,0 } ; α = p Y ̅ ( 1 ) + ( 1 − p ) Y ̅ ( 0 ) is the intercept (expected outcome); and the “error” term, u i , is

u i = T i ( Y i ( 1 ) − Y ̅ ( 1 ) ) + ( 1 − T i ) ( Y i ( 0 ) − Y ̅ ( 0 ) ) .

We center the treatment indicator in (3) to facilitate the theory without changing the estimator.

In contrast to usual formulations of the regression model, the residual, u i , is random solely due to T i [16,17,20]. This framework allows individual-level treatment effects, τ i = Y i ( 1 ) − Y i ( 0 ) , to vary across the sample, and is nonparametric because it makes no assumptions about the potential outcome distributions. The model does not satisfy key assumptions of the usual regression model: over the randomization distribution ( R ), u i is heteroscedastic, and E R ( u i ) , Cov R ( u i , u i ′ ) , and E R ( T ∼ i u i ) are nonzero if τ i varies across the sample.

The model in (3) also applies to each subgroup due to randomization. Thus, if we combine each subgroup model using the G ik indicators, we obtain the following pooled model:

(4) y i = ∑ k = 1 K τ k G ik T ∼ i + ∑ k = 1 K α k G ik + ϵ i ,

where τ k = Y ̅ k ( 1 ) − Y ̅ k ( 0 ) is the subgroup ATE estimand; α k = p Y ̅ k ( 1 ) + ( 1 − p ) Y ̅ k ( 0 ) is the subgroup intercept; and ϵ i = ∑ k = 1 K G ik u ik is the error term with u ik = T i ( Y i ( 1 ) − Y ̅ k ( 1 ) ) + ( 1 − T i ) ( Y i ( 0 ) − Y ̅ k ( 0 ) ) . We include model terms for all subgroups and exclude the grand intercept.

Consider the ordinary least squares (OLS) differences-in-mean estimator for τ k from (4) using data on the full sample:

(5) τ ˆ k = y ̅ k 1 − y ̅ k 0 = 1 n k 1 ∑ i = 1 n G ik T i Y i ( 1 ) − 1 n k 0 ∑ i = 1 n G ik ( 1 − T i ) Y i ( 0 ) ,

where n k 1 = ∑ i = 1 n T i G ik and n k 0 = ∑ i = 1 n ( 1 − T i ) G ik are subgroup sizes in the treatment and control groups with sample shares, π k 1 = n k 1 / n 1 and π k 0 = n k 0 / n 0 . We see that τ ˆ k is a ratio estimator because n k 1 and n k 0 are random variables (with hypergeometric distributions).

The finite population CLT in Theorem 4 in the study by Li and Ding [23] applies to τ ˆ k conditional on n k 1 and n k 0 , which are ancillary to (independent of) the potential outcomes. There is a long-standing debate on the merits of conditional inference in such settings [30]. In our DB RCT context, we view repeated sampling over the randomization distribution as applying to the full sample, which leads to random subgroup allocations and the need for unconditional inference to capture what “could” occur. In contrast, a conditional analysis measures what “did” occur and parallels a blocked subgroup design with fixed subgroup sizes. Accordingly, we focus on an unconditional CLT for τ ˆ k , but compare variance estimators using both approaches in our theory and simulations. The key difference is that an unconditional analysis leads to nonlinear ratio estimators with random numerators and denominators (subgroup sizes), which complicates the asymptotic analysis.

Our CLT is provided in Section 3.2 for a more general covariate-adjusted estimator from a working model that includes in (4) a 1 xV vector of fixed, baseline covariates other than the subgroup indicators, x i , with parameter vector, β .

(6) y i = ∑ k = 1 K τ k G ik T ∼ i + ∑ k = 1 K α k G ik + x ∼ i β + e i ,

where x ∼ i = ( x i − ∑ k = 1 K G ik x ̅ k ) are centered covariates; x ̅ k = 1 n k ∑ i = 1 n G ik x i are subgroup covariate means; and e i is the error term. While the covariates do not enter the true RCT model in (4) and the ATE estimands do not change, they will increase precision to the extent they are correlated with the potential outcomes. We do not need to assume that the true conditional distribution of y i given x i is linear in x i . We define β in Section 3.2.

We focus on the pooled covariate model in (6) because it is commonly used in practice. In Section 3.3, we discuss extensions to models that interact T ∼ i with x ∼ i and G ik .

Using OLS to estimate the working model in (6) yields the following covariate-adjusted estimator for τ k that is produced by standard OLS statistical packages:

(7) τ ˆ k x = ( y ̅ k 1 − y ̅ k 0 ) − ( x ̅ k 1 − x ̅ k 0 ) β ˆ ,

where x ̅ k 1 and x ̅ k 0 are subgroup covariate means for treatments and controls, and β ˆ is the OLS estimator for β (see [23] for a parallel result for full sample analyses).

Our DB theory is conditional on randomizations that yield n k 1 > 0 and n k 0 > 0 so that τ ˆ k and its variance can be defined [25]. These restrictions yield subgroup allocation distributions that are truncated, but these effects disappear as n and n k increase (where we assume p = n 1 / n and π k = n k 1 / n 1 have finite limits). To see this, consider the general case where n k ≤ n 1 and n k ≤ n 0 so that either n k 1 or n k 0 can equal 0 , and define a k 1 = E ( n k 1 | 0 < n k 1 < n k ) as the expected value of the associated Truncated hypergeometric ( n , n 1 , n k ) distribution. We can express a k 1 in terms of the non-truncated expectation, n k p , using a k 1 = ( n k pf ) a k 1 n k pf , where ( n k pf ) is the mean of the Truncated binomial ( n k , p ) distribution with f = 1 − p ( n k − 1 ) 1 − ( 1 − p ) n k − p n k (derivation not shown). As n and n k increase, then a k 1 converges to n k p because a k 1 n k pf converges to 1 (as a hypergeometric mean converges to a binomial mean) and f also converges to 1 . A similar argument holds for n k 0 and when n k > n 1 or n k > n 0 (where there is no truncation if both hold). Relatedly, in finite samples, the restrictions are likely to have little effect on our results as they will hold with probability near 1 for typical subgroup analyses. For instance, even for a very small subgroup with n k = 12 , n = 40 , and p = 0.5 , the restrictions will hold with probability 0.9999. Thus, to simplify notation, we omit the conditioning on positive subgroup allocations to each research group and use the unconditional expectations, n k p and n k ( 1 − p ) , in the analysis.

3.2 Main CLT result

To consider the asymptotic properties of τ ˆ k x (which also apply to τ ˆ k without covariates), we consider a hypothetical increasing sequence of finite populations where n → ∞ . Parameters should be subscripted by n , but we omit this notation for simplicity. We assume that n 1 / n → p * as n → ∞ , so the numbers of treatments and controls both increase with n . In addition, we assume that n k / n → π k * for all k , where π k * > 0 and ∑ k = 1 K π k * = 1 . This implies that each subgroup also grows with n , where the number of subgroups, K , is assumed fixed.

Our CLT builds on Schochet et al. [14] who provided CLTs for RCT ratio estimators with general weights for clustered, blocked designs. We adapt these methods to our setting by treating the subgroup indicators, G ik , as “weights” when computing the subgroup sample means.

Before presenting our CLT, we need to define several terms. First, for t ∈ { 1,0 } , let ε ik ( t ) = ( Y i ( t ) − Y ̅ k ( t ) − x ∼ i β ) denote model residuals for subgroup k , where R ik ( t ) = G ik π k ε ik ( t ) are scaled residuals using the normalized weights, w i / w ̅ = G ik / π k , that sum to n . Second, let S R k 2 ( t ) = 1 n − 1 ∑ i = 1 n R ik 2 ( t ) denote the variance of R ik ( t ) , and let S R k 2 ( 1,0 ) = 1 n − 1 ∑ i = 1 n R ik ( 1 ) R ik ( 0 ) denote the treatment-control covariance. Third, we define D ̅ k as the mean treatment-control difference in the R ik ( t ) residuals, with associated variance:

(8) Var ( D ̅ k ) = S R k 2 ( 1 ) n 1 + S R k 2 ( 0 ) n 0 − S 2 ( τ k ) n ,

where S 2 ( τ k ) = 1 n − 1 ∑ i = 1 n ( R ik ( 1 ) − R ik ( 0 ) ) 2 is the heterogeneity of treatment effects. Fourth, we define the variance of G ik as S 2 ( G k ) = 1 n − 1 ∑ i = 1 n 1 π k 2 ( G ik − π k ) 2 = n ( n − 1 ) ( 1 − π k ) π k . Fifth, we require the variances of each covariate, S x k , v 2 = 1 n − 1 ∑ i = 1 n G ik π k 2 ( [ x ∼ i ] v ) 2 for v ∈ { 1 , … , V } , and the full variance-covariance matrix for the covariates, S x , k 2 = 1 n ∑ i = 1 n G ik π k x ∼ i ′ x ∼ i . Finally, we need two outcome-covariate variance-covariance matrices: S x , Y , k 2 ( t ) = 1 n ∑ i = 1 n G ik π k x ∼ i ′ Y i ( t ) and S x Y , k 2 ( t ) = 1 n ∑ i = 1 n 1 π k ( G ik x ∼ i ′ Y i ( t ) − θ ̅ k ) 2 , where θ ̅ k = 1 n ∑ i = 1 n G ik x ∼ i ′ Y i ( t ) is the mean covariance.

We now present our CLT theorem, proved in Supplementary Materials S1.

Theorem 1

Assume (C1), (C2), and the following conditions for t ∈ { 1 , 0 } and k ∈ { 1 , … , K } , with fixed K ≥ 1 :

(C3) Letting g k ( t ) = max 1 ≤ i ≤ n { R ik 2 ( t ) } , as n → ∞ ,

1 ( n t ) 2 g k ( t ) Var ( D ̅ k ) → 0 .

(C4) f 1 = n 1 / n and f 0 = n 0 / n have limiting values, p * and ( 1 − p * ) , for 0 < p * < 1 .

(C5) The subgroup shares, n k / n , converge to π k * for 0 < π k * < 1 and ∑ k = 1 K π k * = 1 .

(C6) As n → ∞ ,

( 1 − f t ) S 2 ( G k ) n t → 0 .

(C7) Letting h v ( t ) = max 1 ≤ i ≤ n G ik π k ( [ x ∼ i ] v ) 2 for all v ∈ { 1 , … , V } , as n → ∞ ,

1 min ( n 1 , n 0 ) h v ( t ) S x , v 2 → 0 .

(C8) S R k 2 ( t ) , S R k 2 ( 1,0 ) , S x k , v 2 , S x , k 2 , S x , Y , k 2 ( t ) , and S x Y , k 2 ( t ) have finite limiting values.

Then, as n → ∞ , τ ˆ k x is a consistent estimator for τ k , and

τ ˆ k x − ( Y ¯ k ( 1 ) − Y ¯ k ( 0 ) ) Var ( D ̅ k ) → d N ( 0 , 1 ) ,

where Var ( D ̅ k ) is defined as in (8).

Remark 1. The Var ( D ̅ k ) expression in (8) is difficult to interpret because S R k 2 ( t ) and S 2 ( τ k ) are scaled by π k 2 ( n − 1 ) to facilitate the theory. To address this, we apply the following relations in (8): n 1 π k 2 ( n − 1 ) = n k p ( n k − π k ) and n 0 π k 2 ( n − 1 ) = n k ( 1 − p ) ( n k − π k ) , which yields,

(9) Var ( D ̅ k ) = ϕ k Ω R k 2 ( 1 ) n k p + Ω R k 2 ( 0 ) n k ( 1 − p ) − Ω 2 ( τ k ) n k ,

where Ω R k 2 ( t ) = 1 n k − 1 ∑ i = 1 n G ik ε ik 2 ( t ) ; Ω 2 ( τ k ) = 1 n k − 1 ∑ i = 1 n G ik ( ε ik ( 1 ) − ε ik ( 0 ) ) 2 ; and ϕ k = ( n k − 1 ) / ( n k − π k ) ≤ 1 is a correction term that reflects the single treatment indicator “shared” by each subgroup (and can be ignored as it converges to 1). The Ω R k 2 ( t ) and Ω 2 ( τ k ) terms are population variances for those in subgroup k , and n k p and n k ( 1 − p ) are expected subgroup sizes in the two research groups. This variance expression is more intuitive as it parallels the full sample asymptotic results in Li and Ding [23], the key difference being that (9) is based on expected subgroup sizes rather than actual ones. Note that for ϕ k = 1 , (9) is the same as for an RCT that stratifies on subgroup k to select fixed subgroup sample sizes, n k p and n k ( 1 − p ) .

Remark 2. The first two terms in (9) pertain to separate variances for the two research groups because we allow for heterogeneous treatment effects. The third term pertains to the treatment-control covariance, Ω R k 2 ( 1,0 ) , expressed in terms of the heterogeneity of treatment effects, Ω 2 ( τ k ) , which cannot be identified from the data but can be bounded [31].

Remark 3. (C3) and (C7) are Lindeberg-type conditions from Li and Ding [23] that control the tails of the potential outcome and covariate distributions. (C6) yields a weak law of large numbers for the observed subgroup shares so that π k t / π k → p 1 (using Theorem B in Scott and Wu [24]). This condition is used to account for the randomness in n k t , because, for instance, it allows us to express the sample mean for the treatment group as np n k 1 ∑ i = 1 n G ik T i Y i ( 1 ) np , where the bracketed term converges to 1 by (C6) and the denominator in the summation term is fixed, so we can apply the CLT results in Li and Ding [23] and Slutsky’s theorem. While (C6) is implied by (C4) and (C5), it facilitates the addition of other weights (Section 3.3). (C8) specifies limiting values of the variances and variance-covariance matrices.

Remark 4. Theorem 1 is proved in two stages by expressing the ATE estimator in (7) as,

(10) τ ˆ k x = τ ˆ k x β − ( x ̅ k 1 − x ̅ k 0 ) ( β ˆ − β ) ,

where τ ˆ k x β = ( y ̅ k 1 − y ̅ k 0 ) − ( x ̅ k 1 − x ̅ k 0 ) β and β = ∑ k = 1 K π k S x , k 2 − 1 [ ∑ k = 1 K p π k S x , Y , k 2 ( 1 ) + ∑ k = 1 K ( 1 − p ) π k S x , Y , k 2 ( 0 ) ] is assumed known. This β parameter is the (hypothetical) population OLS coefficient that would result from a regression of [ p Y i ( 1 ) + ( 1 − p ) Y i ( 0 ) ] on the covariates. In the first stage, we obtain a CLT for τ ˆ k x β . In the second stage, we prove that τ ˆ k x has the same asymptotic distribution as τ ˆ k x β by showing that ( x ̅ k 1 − x ̅ k 0 ) ( β ˆ − β ) = o p ( n − 1 / 2 ) , which holds under our conditions because x ̅ k 1 and x ̅ k 0 are both asymptotically normal and β ˆ − β → p 0 .

Remark 5. Under (C1)–(C6), Theorem 1 also applies to τ ˆ k for the model without covariates by setting β = 0 in (6) and in the residuals, ε ik ( t ) and R ik ( t ) , which enter the variances in (8) and (9). In Supplement S2.1, we prove that τ ˆ k is unbiased using the approach reported by Miratrix et al. [25] and Schochet [18], where we condition on n k 1 > 0 and n k 0 > 0 and then average over possible subgroup allocations ( A ) to the two research groups to show that E R ( τ ˆ k ) = E A E R ( τ ˆ k | n k 1 , n k 0 ) = τ k . Similarly, using the law of total variance, Supplement S2.1 shows that

(11) Var R ( D ̅ k ) = E A 1 n k 1 Ω R k 2 ( 1 ) + E A 1 n k 0 Ω R k 2 ( 0 ) − Ω 2 ( τ k ) n k ,

where Ω R k 2 and Ω 2 are defined in (9) with β = 0 . Note that lim n → ∞ E A 1 n k 1 = 1 E A ( n k 1 ) = 1 n k p and lim n → ∞ E A 1 n k 0 = 1 n k ( 1 − p ) which aligns with (9) in large samples. In finite samples, however, the variance in (11) is at least as large as in (9) because E A 1 n k 1 ≥ 1 n k p and E A 1 n k 0 ≥ 1 n k ( 1 − p ) by Jensen’s inequality. Our simulations include both sets of sample sizes (Section 5).

The following corollary to Theorem 1, proved in Supplement S1, provides the joint asymptotic distribution of the subgroup estimators, ( τ ˆ 1 x , … , τ ˆ K x ) .

Corollary 1

Under the conditions of Theorem 1, as n → ∞ , the ATE estimators, τ ˆ k x and τ ˆ k ′ x for two subgroups k and k ′ , are asymptotically independent, for ( k , k ′ ) ∈ { 1 , … , K } . Further, the joint asymptotic distribution of the K subgroup ATE estimators, ( τ ˆ 1 x , … , τ ˆ K x ) , is multivariate normal.

This corollary is important for real-world applications because it supports the use of standard F-tests (or chi-square tests) to test the null hypothesis of equal subgroup effects.

3.3 Extensions to related estimators

This section outlines extensions of our CLT result to post-stratification estimators, models that interact T ∼ i with x ∼ i and G ik , BTs, and the use of nonresponse weights to adjust for missing outcome data.

Post-stratification estimators. Miratrix et al. [25] considered variance estimation for a DB post-stratification ATE estimator that obtains overall effects for the model without covariates by averaging τ ˆ k across subgroups. With model covariates, we can express this estimator as, τ ˆ PS x = 1 n ∑ k = 1 K n k τ ˆ k x . Corollary 1 from above can then be applied to yield a new CLT for this estimator: τ ˆ PS x − ( Y ¯ ( 1 ) − Y ¯ ( 0 ) ) Var ( D ̅ PS ) → d N ( 0,1 ) , where Var ( D ̅ PS ) = 1 n 2 ∑ k = 1 K n k 2 Var ( D ̅ k ) .

Interacted models. Theorem 1 can be extended to a model that replaces x ∼ i β in (6) with the interaction terms, ∑ k = 1 K G ik ( 1 − T i ) x ∼ i β k 0 and ∑ k = 1 K G ik T i x ∼ i β k 1 , which allows covariate effects to differ by subgroup and treatment status. The ATE estimator for this model is, τ ˆ k x GT = [ y ̅ k 1 − ( x ̅ k 1 − x ̅ k ) β ˆ k 1 ] − [ y ̅ k 0 − ( x ̅ k 0 − x ̅ k ) β ˆ k 0 ] . Theorem 1 can then be applied by redefining the residuals as, R ik ( t ) = G ik π k ( Y i ( t ) − Y ̅ k ( t ) − ( x i − x ̅ k ) β k t ) , where β k t = ( S x , k 2 ) − 1 S x , Y , k 2 ( t ) . The proof (not shown) follows using the same arguments as in Remark 4 by replacing (10) with τ ˆ k x GT = τ ˆ k x GT β − ∑ t = 0 1 ( x ̅ k t − x ̅ k ) ( β ˆ k t − β k t ) , and noting that under our regularity conditions, β ˆ k t − β k t → p 0 and ( x ̅ k t − x ̅ k ) ( β ˆ k t − β k t ) = o p ( n − 1 / 2 ) for t ∈ { 1,0 } , so that ∑ t = 0 1 ( x ̅ k t − x ̅ k ) ( β ˆ k t − β k t ) = o p ( n − 1 / 2 ) .

BTs. Our results also extend to BTs where each sample member is independently randomized to the treatment group with probability p , leading to random treatment-control sizes. This design pertains, e.g., to an RCT with rolling study intake. First, we can show that our result in Remark 5 on the unbiasedness of τ ˆ k for the model without covariates also applies to BTs. To see this, consider the full sample analysis. Then, Bernoulli sampling has the same properties as a two-stage design that first randomly selects n 1 from a truncated binomial distribution, and then selects a simple random sample of size n 1 to the treatment group [32]. Thus, this setting parallels the one in Remark 5 that calculates sample moments by first conditioning on subgroup sizes. The only difference is that E A is now taken over a truncated binomial rather than truncated hypergeometric distribution. The same argument applies to the subgroup analysis as for the full sample analysis.

Second, we can adapt the CLT in Theorem 1 to BTs by using expected rather than actual sample sizes in the theorem conditions (i.e., by replacing n 1 and n 0 with np and n ( 1 − p ) ). To see this, consider again the full sample analysis, omitting the k = 1 subscript for simplicity. Let p 1 be the observed treatment share, and express the observed mean outcomes as, y ̅ 1 = y ̅ BT 1 g 1 and y ̅ 0 = y ̅ BT 0 g 0 , where y ̅ BT 1 = 1 np ∑ i = 1 n T i y i and y ̅ BT 0 = 1 n ( 1 − p ) ∑ i = 1 n ( 1 − T i ) y i are divided by expected sample sizes rather than actual ones, and g 1 = p p 1 and g 0 = ( 1 − p ) ( 1 − p 1 ) . Note that g t → p 1 for t ∈ { 1,0 } , so y ̅ t and y ̅ BT t have the same asymptotic distributions by Slutsky’s theorem. Then, under our conditions, Theorem 4 in the study by Li and Ding [23] provides a CLT for ( y ̅ BT 1 − y ̅ BT 0 ) , and the same proof as in Supplement S1.2 for Theorem 1 extends this CLT to ( y ̅ 1 − y ̅ 0 ) . A similar approach yields a CLT for the covariate-adjusted ATE estimator using the variance in (9) by applying x ̅ t = x ̅ BT t g t and noting that the asymptotic properties of β ˆ do not change from Theorem 1. The same argument applies to the subgroup analysis.

Data nonresponse weights. Theorem 1 also applies, with additional assumptions, to a “subgroup” analysis that adjusts for missing outcome data using respondents only with nonresponse weights, w i r . To show this, let R i ( T i ) denote an indicator of potential data response in the treatment or control condition, where R i ( T i ) = 1 for a respondent and 0 for a nonrespondent. Further, let r i = R i ( T i ) denote the observed response indicator. If baseline covariates are available for the full sample, a common approach is to set w i rt = 1 / e t ( x i ) as the weight for a respondent in research group t ∈ { 1,0 } , where e t ( x i ) = Pr ( r i = 1 | x i , T i = t ) is the propensity score [33].

We invoke two missing data assumptions for each subgroup: (i) data are missing at random for each research group conditional on covariates [34]: Y i ( 1 ) , Y i ( 0 ) ⫫ r i | T i = t , G ik = 1 , x i ; and (ii) data response and treatment status are independent conditional on the covariates: r i ⫫ T i | G ik = 1 , x i . The first ignorability (selection-on-observables) condition – which is commonly invoked for observational studies using inverse probability weighting methods [35,36] – ensures that weighted ATE estimators using the respondent sample will consistently estimate τ k . The second condition implies that the response, r i = R i ( T i ) = R i , will be identical in the treatment and control conditions, so that e t ( x i ) = e ( x i ) and w i rt = w i r are independent of t . Stated differently, this condition implies that respondents and their weights are randomly allocated to the two research groups. Thus, the summed weights, ∑ i : T i = t w i r , which enter the denominators of the weighted differences-in-means estimators become random, which parallels the subgroup analysis from above with random subgroup sizes in the two research groups.

Accordingly, we can apply Theorem 1, assuming known weights, where the respondent sample and weighted least squares are used to obtain the ATE estimator, τ ˆ k r x , using (6). We assume e ( x i ) is known and converges to e * ( x i ) as n → ∞ , where 0 < e * ( x i ) ≤ 1 for all x i in its finite population support so that (C6) holds. For the proof, we replace the weights, w i = G ik , in the theorem with w i = G ik r i w i r to define the variables and regularity conditions. The resulting variance for the CLT has the same form as (9) but is based on expected subgroup respondent sizes (Supplement S2.2). Developing a finite population CLT that allows for estimated nonresponse weights rather than known weights is a topic for future research.

3.4 Variance estimation

To obtain consistent variance estimators for (9) (and model variants), we can either use expected subgroup sizes, n k p and n k ( 1 − p ) , or actual ones, n k 1 and n k 0 , as for the conditional analysis. Our simulations find very similar results using either approach (Section 5). This occurs because the difference between the hypergeometric random variable, π k t , and its expected value, π k , decreases exponentially with n t [37,38]. For instance, Figure 1a shows that for modest n t , there is a high probability that | π k t − π k | / π k ≤ c for small c (defined as 10 or 20%).

$Figure 1 (a) Probabilities for the differences, ( π k t – π k ) \left({\pi }_{k}^{t}\mbox{--}{\pi }_{k}) , relative to π k {\pi }_{k} , and (b) SE ratios using actual to expected subgroup sizes, as a function of δ 1 / 5 n k {\delta }^{1}/5{n}_{k} for δ 1 = ( n k 1 – n k p ) {\delta }^{1}\left=\left({n}_{k}^{1}\mbox{--}{n}_{k}p) . Note: See text for definitions and formulas. (b) assumes n = 100 n=100 , φ = 1.1 \varphi =1.1 , and ϑ = 0.05 {\vartheta }=0.05 . SE = Standard error.$

Figure 1

(a) Probabilities for the differences, ( π k t – π k ) , relative to π k , and (b) SE ratios using actual to expected subgroup sizes, as a function of δ 1 / 5 n k for δ 1 = ( n k 1 – n k p ) . Note: See text for definitions and formulas. (b) assumes n = 100 , φ = 1.1 , and ϑ = 0.05 . SE = Standard error.

Further, Figure 1b shows that for small c , the ratios of SEs using actual to expected subgroup sizes in (9) are close to 1, leading to similar confidence interval coverage. For example, for n = 100 , p = 0.5 , and π k = 0.5 , the ratios range only from 1 to 1.026 as c ranges from 0 to 0.2 (assuming Ω R k 2 ( 1 ) = φ Ω R k 2 ( 0 ) and Ω 2 ( τ k ) = ϑ Ω R k 2 ( 0 ) with plausible values, φ = 1.1 and ϑ = 0.05 ). In expectation, the SE ratios are greater than 1 for all values of φ and ϑ (Remark 5 above), but the differences are small for typical subgroup sizes used in practice.

To further examine the pattern of the SE ratios in Figure 1b, suppose first that φ = 1 . Then, all ratios are at least 1 when p = 0.5 . However, if p < 0.5 , the ratios are greater than 1 if δ 1 = n k 1 − n k p < 0 or δ 1 > n k ( 1 − 2 p ) , but are less than 1 otherwise, and vice versa for p > 0.5 . As a function of δ 1 , the ratios are convex and symmetric around their minimum value at δ 1 = 0.5 n k ( 1 − 2 p ) when n k 1 = n k 0 . This symmetry is lost when φ ≠ 1 , but the same overall patterns apply (Figure 1b).

Using expected sizes, a consistent (upper bound) plug-in variance estimator for (9) based on estimated subgroup regression residuals is as follows:

(12) V a ˆ r ( D ̅ k ) = s R k 2 ( 1 ) n k p + s R k 2 ( 0 ) n k ( 1 − p ) ,

where

s R k 2 ( 1 ) = 1 ( n k 1 − Vp π k 1 − 1 ) ∑ i = 1 n T i G ik ( y i − α ˆ k x − ( 1 − p ) τ ˆ k x − x ∼ i β ˆ ) 2

and

s R k 2 ( 0 ) = 1 ( n k 0 − V ( 1 − p ) π k 0 − 1 ) ∑ i = 1 n ( 1 − T i ) G ik ( y i − α ˆ k x + p τ ˆ k x − x ∼ i β ˆ ) 2 .

Here we set ϕ k = 1 , which can be relaxed by subtracting π k t in the denominators of s R k 2 ( t ) rather than 1. In (12), the losses in degrees of freedom (df) due to the V covariates are split proportionately across the K subgroups and two research conditions. Note that the same estimator results using a non-centered model in (6) that replaces the G ik T ∼ i and x ∼ i terms with G ik T i and x i . Hypothesis testing can be conducted using t-tests with df = ( n k − V π k − 2 ) or z-tests. Using actual sizes, we can instead use n k 1 and n k 0 in (12) rather than n k p and n k ( 1 − p ) .

As shown in Supplement S3, (12) is asymptotically equivalent to the robust Huber–White (HW) variance estimator [39,40], as has been shown for full sample estimators [16,20,41]. In finite samples, however, the DB variances will typically be larger for the model without covariates due to larger df corrections. We compare the two estimators in our simulations, along with other SE variants.

4 Blocked and clustered designs

The above CLT results extend directly to blocked RCTs where randomization is performed separately within strata (e.g., sites, demographic groups, or time cohorts), and to clustered RCTs where groups (e.g., schools, hospitals, or communities) are randomized rather than individuals.

4.1 Blocked RCTs

In blocked designs, the sample is first divided into subpopulations, and a mini-experiment is conducted in each one. Note that we do not consider blocks formed by subgroups slated for ATE estimation as the theory for the full sample analysis applies in this case (as n k 1 and n k 0 are fixed).

For the blocked design, we use similar notation as above with the addition of the subscript b = ( 1 , 2 , … , B ) to indicate blocks. For instance, T ib is the treatment indicator, p b is the block treatment assignment rate, n b is the number of persons in block b , G ibk is the subgroup indicator, n bk is the size of subgroup k , π bk = n bk / n b is the subgroup share, and Y ib ( t ) is the potential outcome. Further, we define S ib as a 1/0 indicator of block membership and q b = n b / n as the block population share. We assume SUTVA and complete randomization within each block, where vectors of possible treatment assignments are mutually independent across blocks.

With this notation, we can now define the ATE estimand for blocks containing members of subgroup k as τ bk = Y ̅ bk ( 1 ) − Y ̅ bk ( 0 ) , and the pooled ATE estimand across such blocks as,

(13) τ k = ∑ b : π bk > 0 B w bk τ bk ∑ b : π bk > 0 B w bk ,

where w bk is the block weight that can differ across subgroups. We set w bk = n bk , but other options exist [42]. We allow n bk = n b and n bk = 0 .

Consider OLS estimation of the following extension of (6) to blocked RCTs:

(14) y ib = ∑ k = 1 K τ bk S ib G ibk T ∼ ib + ∑ k = 1 K α bk S ib G ibk + x ∼ ib β + η ib ,

where T ∼ ib = T ib − p b and x ∼ ib = x ib − ∑ k = 1 K S ib G ibk x ̅ bk are block-centered variables; x ̅ bk = 1 n bk ∑ i = 1 n b G ibk x ib are covariate means; and η ib is the error term. The OLS ATE estimator for τ bk in (14) is,

(15) τ ˆ bk x = ( y ̅ bk 1 − y ̅ bk 0 ) − ( x ̅ bk 1 − x ̅ bk 0 ) β ˆ ,

where y ̅ bk t and x ̅ bk t are observed treatment and control group means.

Because a mini-experiment is conducted in each block, we can apply Theorem 1 to τ ˆ bk x as n → ∞ for fixed B . This yields the following finite population CLT for the blocked design.

Theorem 2

Assume (C1)–(C4) and (C6)–(C8) for each included block, and the following conditions for b ∈ { 1 , … , B } and k ∈ { 1 , … , K } , for fixed B ≥ 1 and K ≥ 1 :

(C4a) The block shares, n b / n → q b * as n → ∞ , where q b * > 0 and ∑ b = 1 B q b * = 1 .

(C5a) The subgroup shares, n bk / n b → π bk * as n → ∞ , with 0 ≤ π bk * ≤ 1 and ∑ k = 1 K π bk * = 1 .

Then, as n → ∞ for fixed B and K , τ ˆ bk x is a consistent estimator for τ bk , and

τ ˆ bk x − ( Y ¯ bk ( 1 ) − Y ¯ bk ( 0 ) ) Var ( D ̅ bk ) → d N ( 0 , 1 ) ,

where Var ( D ̅ bk ) is defined as in (8) or (9) at the block level.

The proof (not shown) parallels the one for Theorem 1, applied to each block, by redefining the residual as, R ibk ( t ) = G ibk π bk ( Y ib ( t ) − Y ̅ bk ( t ) − ( x ib − x ̅ bk ) β ) , while invoking (C4a) that allows n b to grow with n , and (C5a) that amends (C5) so that π bk can equal 0 or 1. Note that Liu and Yang [43] considered asymptotics for full sample estimators as B → ∞ .

A variance estimator for τ ˆ bk x in (15), V a ˆ r ( D ̅ bk ) , can be obtained using (12), where s R bk 2 ( t ) is now calculated using residuals from the fitted model in (14). The df adjustments are ( n bk 1 − V q b p b π bk 1 − 1 ) for s R bk 2 ( 1 ) and ( n bk 0 − V q b ( 1 − p b ) π bk 0 − 1 ) for s R bk 2 ( 0 ) .

Next we provide a corollary to Theorem 2 on the pooled subgroup estimator across blocks, τ ˆ k , Pooled x = 1 n k ∑ b : n bk > 0 B n bk τ ˆ bk x , where each block is weighted by its subgroup size.

Corollary 2

Under the conditions of Theorem 2, as n → ∞ for fixed B and K , τ ˆ k , Pooled x is a consistent estimator for τ k , Pooled = 1 n k ∑ b : n bk > 0 B n bk τ bk , and

1 Var ( D ̅ k ) ( τ ˆ k , Pooled x − τ k , Pooled ) → d N ( 0 , 1 ) ,

where Var ( D ̅ k ) = 1 ( n k ) 2 ∑ b : n bk > 0 B n bk 2 Var ( D ̅ bk ) .

This result follows because the τ ˆ bk x estimators are asymptotically independent across blocks, which can be shown using the same arguments as in the proof of Corollary 1 in Supplement S1. We can estimate Var ( D ̅ k ) using V a ˆ r ( D ̅ bk ) for each included block. Hypothesis testing for τ ˆ k x can be conducted using t-tests with df = ( n k − V π k − 2 B ) or z-tests.

Finally, a future research topic is to develop a CLT for a restricted model that controls for block main effects but excludes block-by-treatment interactions. An example of such a model is to replace the first set of interactions in (14) with ∑ k = 1 K τ k , R G ibk T ∼ ib . In this case, the OLS ATE estimator for subgroup k is τ ˆ k , R = 1 ∑ b w bk , R ∑ b w bk , R τ ˆ bk x , where w bk , R = n bk p bk 1 ( 1 − p bk 1 ) and p bk 1 = n bk 1 / n bk . Thus, τ ˆ k , R uses a form of precision weighting to weight the block-specific estimators. It is inconsistent but uses fewer parameters. Full sample CLTs for this estimator are considered in [14,21], which become more complex in the subgroup context.

4.2 Clustered RCTs

In clustered RCTs, groups rather than individuals are the unit of randomization. Consider a clustered, non-blocked RCT with m total clusters, where m 1 = mp is assigned to the treatment group and m 0 = m ( 1 − p ) is assigned to the control group. All persons in the same cluster have the same treatment assignment. Let m k denote the number of clusters in subgroup k , where m k 1 and m k 0 are observed counts. We assume that individual-level data are available for analysis, although our results also pertain to data averaged to the cluster level.

We index clusters by j . Thus, we have that T j = 1 for treatment clusters and 0 for control clusters, n jk is the number of subgroup k members in cluster j , Y ij ( t ) is the potential outcome for person i in cluster j , and so on. We also assume SUTVA and complete randomization as generalized to clustered RCTs [14].

Consider an individual-level subgroup ( G ijk = 1 ) where π jk > 0 for all j . In this case, m k 1 = m 1 and m k 0 = m 0 are fixed, and the ATE estimand for subgroup k under the clustered RCT is

(16) τ k = ∑ j = 1 m ∑ i = 1 n j G ijk ( Y ij ( 1 ) − Y ij ( 0 ) ) n k = ∑ j = 1 m n jk ( Y ̅ jk ( 1 ) − Y ̅ jk ( 0 ) ) n k = Y ̿ k ( 1 ) − Y ̿ k ( 0 ) ,

where Y ̅ jk ( t ) = 1 n jk ∑ i = 1 n j G ijk Y ij ( t ) is the mean cluster-level outcome and Y ̿ k ( t ) is the grand mean. Here clusters are weighted by their subgroup sizes, w jk = n jk , but other options exist, such as weighting clusters equally to estimate subgroup effects per cluster rather than per person.

Applying OLS to (6) with clustered data yields the following subgroup ATE estimator:

(17) τ ˆ k , clus x = ( y ̿ k 1 − y ̿ k 0 ) − ( x ̿ k 1 − x ̿ k 0 ) β ˆ ,

where y ̿ k t = 1 n k t ∑ j : T j = t ∑ i = 1 n j G ijk Y ij ( t ) is the mean observed outcome, and similarly for x ̿ k t .

We see that (17) is a ratio estimator because n k 1 and n k 0 are random under the clustered design (if clusters are weighted unequally). However, this is also the case for the full sample estimator, because n 1 and n 0 (i.e., the summed weights) are also random. Thus, as m → ∞ , the full sample CLT results reported by Schochet et al. [14] for the clustered (and blocked) RCT can be applied to τ ˆ k , clus x . This approach is outlined in Supplement S3.2 along with a consistent variance estimator using a version of (12) based on estimated cluster-level residuals. Parallel to the HW analysis, Supplement S3.2 also shows that this variance estimator is asymptotically equivalent to the cluster-robust SE estimator developed by Liang and Zeger [44].

Finally, Supplement S3.3 outlines cluster RCT results for a subgroup analysis defined by a cluster-level characteristic ( G jk = 1 ), such as a school, hospital, or community feature, rather than an individual-level characteristic. In this setting, m k 1 and m k 0 become random, which parallels the subgroup analysis for the non-clustered RCT. Note that a similar formulation also applies for the individual-level subgroup analysis when π jk = 0 for some clusters.

5 Simulation analysis

We conducted simulations to examine the finite sample statistical properties of our DB subgroup ATE estimators. The focus is on the non-clustered RCT because prior full sample simulation results for the clustered RCT also pertain to individual-level subgroup analyses [14,45], as discussed above. For the simulations, we applied the variance estimator in (12) using expected and actual subgroup sizes, for models with and without covariates. We set ϕ k = 1 for most specifications, but also adjusted for ϕ k for some runs. We also ran simulations using the HW estimator and several variants of (12) (Supplement S4).

5.1 Simulation setup

The following model was used to generate potential outcomes for K = 2 subgroups and V = 2 pre-treatment covariates:

Y i ( 0 ) = G i 1 + 2 G i 2 + 0.4 G i 1 x i 1 + 0.8 G i 1 x i 2 + 0.7 G i 2 x i 1 + 0.5 G i 2 x i 2 + e i

(18) Y i ( 1 ) = Y i ( 0 ) + G i 1 θ i 1 + G i 2 θ i 2 ,

where e i is iid N ( 0,1 ) random error; x i 1 and x i 2 are iid N ( 0,1 ) covariates; and θ i 1 and θ i 2 are iid N ( 0,0.5 ) and N ( 0,0.4 ) random errors that capture treatment effect heterogeneity.

We generated five draws of potential outcomes using (18) to help guard against unusual draws and report average results. For each draw, we conducted 10,000 replications, randomly assigning units to either the treatment or control group using p = 0.5 (or p = 0.4 or 0 . 6 for some runs), and only kept randomizations that met our minimum subgroup size criteria for variance estimation. For each replication, we estimated the model in (6) and stored the results. We ran simulations for total sample sizes of n = 40 , 100, and 2 00 and Subgroup 1 shares of π 1 = 0.25 , 0.50 , and 0.75 . To allow for skewed distributions, we also generated model errors and covariates for selected runs using a chi-squared distribution with the same means and variances as above.

In Supplement S4, we discuss variants of (12) used in our simulations. These include applying the df correction for hypothesis testing in Bell and McCaffrey [46]; subtracting a lower bound on the 1 n k Ω 2 ( τ k ) heterogeneity term; multiplying by ( 1 − R TXk 2 ) − 1 , where R TXk 2 is the R 2 from a regression of G ik T ∼ i on x ∼ i and the other terms in (6); and using the finite sample variance in (11).

5.2 Simulation results

Table 1 and Supplement Tables S1–S4 present the simulation results. Of the 300,000 draws of n k 1 and n k 0 used in Table 1, all yielded values of n k 1 > 0 and n k 0 > 0 , so these restrictions have little effect on our theory. Focusing on Subgroup 1, we find negligible biases for all specifications with and without baseline covariates. Confidence interval coverage is close to 95% using t-distribution cutoff values, even with relatively small subgroup samples, but with slight over-coverage across specifications. Accordingly, Type 1 errors tend to be slightly below the nominal 5% level (Tables 1 and D.1). It is interesting that these results differ from those found for the clustered RCT where Type 1 errors tend to be inflated [14,45].

Table 1

Simulation results for the subgroup ATE estimators

Model specification	Bias of ATE estimator^a	Confidence interval coverage	True SE^a,b	Mean estimated SE
Model without covariates
Sample size: n = 40 , π 1 = 0.5 0
Design-based (DB), actual subgroup sizes, ϕ k = 1	−0.002	0.954	0.646	0.640
DB, expected sizes, ϕ k = 1	−0.002	0.952	0.646	0.633
DB, actual sizes, adjust for ϕ k	−0.002	0.948	0.646	0.621
HW	−0.002	0.953	0.646	0.639
Sample size: n = 100 , π 1 = 0.50
DB, actual sizes, ϕ k = 1	0.000	0.958	0.376	0.385
DB, expected sizes, ϕ k = 1	0.000	0.957	0.376	0.383
DB, actual sizes, adjust for ϕ k	0.000	0.956	0.376	0.381
HW	0.000	0.958	0.376	0.385
Sample size: n = 100 , π 1 = 0.25
DB, actual sizes, ϕ k = 1	0.000	0.953	0.626	0.628
DB, expected sizes, ϕ k = 1	0.000	0.951	0.626	0.618
DB, actual sizes, adjust for ϕ k	0.000	0.948	0.626	0.613
HW	0.000	0.948	0.626	0.613
Model with two covariates
Sample size: n = 40 , π 1 = 0.5 0
DB, actual sizes, ϕ k = 1	0.002	0.950	0.501	0.482
DB, expected sizes, ϕ k = 1	0.002	0.948	0.501	0.476
DB, actual sizes, adjust for ϕ k	0.002	0.944	0.501	0.467
HW	0.002	0.953	0.501	0.494
Sample size: n = 100 , π 1 = 0.50
DB, actual sizes, ϕ k = 1	0.000	0.961	0.298	0.305
DB, expected sizes, ϕ k = 1	0.000	0.960	0.298	0.303
DB, actual sizes, adjust for ϕ k	0.000	0.959	0.298	0.302
HW	0.000	0.962	0.298	0.308
Sample size: n = 100 , π 1 = 0.25
DB, actual sizes, ϕ k = 1	0.000	0.956	0.486	0.482
DB, expected sizes, ϕ k = 1	0.000	0.954	0.486	0.475
DB, actual sizes, adjust for ϕ k	0.000	0.951	0.486	0.470
HW	0.000	0.952	0.486	0.470

Note: See text for simulation details. The calculations assume two subgroups with a focus on results for Subgroup 1, a treatment assignment rate of p = 0.50 , and normally distributed covariates and errors. For each specification, the figures are based on 10,000 simulations for each of 5 potential outcome draws, and the findings average across the 5 draws. Ordinary least square (OLS) methods are used for ATE estimation using the model in (6), and design-based SEs are obtained using (12). Huber-White estimates are obtained using the lm_robust procedure in R.

ATE = Average treatment effect; DB = Design-based; HW = Hubert–White.

^aBiases and true SEs are the same for all specifications within each sample size category because they use the same data and OLS model for ATE estimation.

^bTrue SEs are measured as the standard deviation of the estimated treatment effects across simulations.

Estimated SEs are close to “true” values, as measured by the standard deviation of the ATE estimates across replications. Consistent with the theory on SE ratios in Section 3.4, the SEs are slightly larger using actual subgroup sizes than expected ones, leading to narrower confidence interval coverage using the expected sizes. Also consistent with the theory, the SEs are slightly smaller for the HW estimator for the model without covariates, and for specifications that adjust for ϕ k < 1 . Type 1 errors for F-tests to gauge differences in Subgroup 1 and 2 effects are close to 5% but tend to be somewhat liberal (Tables S1 and S2). We find similar results using data generated from a chi-squared distribution (Table S3) and using p = 0.4 or 0 . 6 (Table S4). Finally, applying variants of the variance formula in (12) as detailed in Supplement S4 does not change the overall findings or improve performance (Table S3).

6 Empirical application using the motivating NYC voucher experiment

To demonstrate our DB subgroup ATE estimators, we used baseline and outcome data from the NYC School Choice Scholarships Foundation Program (SCSF) [9]. SCSF was funded by philanthropists to provide scholarships to public school students in grades K–4 from low-income families to attend any participating NYC private school. In spring 1997, more than 20,000 students applied to receive a voucher. SCSF then used random lotteries to offer 3-year vouchers of up to $1,400 annually to 1,000 eligible families in the treatment group. Of the remaining families not offered the voucher, 960 were randomly selected to the control group.

SCSF assisted the treatment group in finding private-school placements. More than 78% of treatment families used a voucher, for 2.6 years on average, where 98% of users attended parochial schools. Here we focus on estimating ATEs (i.e., intention-to-treat effects on the voucher offer) for two race/ethnicity subgroups as defined in the original study [9]: African Americans and Latinos each of who comprise about 47% of the sample. The study authors hypothesized that African Americans might benefit more from the vouchers as they tended to live in more disadvantaged communities with lower-performing public schools.

Following the original study [9], the primary outcomes for our analysis are composite national percentile rankings in math and reading from the study-administered Iowa Test of Basic Skills (ITBS). We focus on first follow-up year test scores, where the response rate was 78% for treatments and 71% for controls. Our goal is not to replicate study results but to illustrate our subgroup ATE estimators.

The voucher study was a blocked RCT. Applicants from schools with average test scores below the city median were assigned a higher probability of winning a scholarship, and blocks were also formed by lottery date and family size (with 30 blocks in total). The design is also partly clustered because families were randomized, where all eligible children within a family could receive a scholarship; 30% of families had at least two children in the evaluation.

We used (14) for ATE estimation and (12) for variance estimation for each block, where blocks were weighted by their subgroup sizes to obtain the overall subgroup effects. To adjust for clustering, we averaged data to the family level. Following [9], we used weights to adjust for missing follow-up test scores. We ran models without covariates and those that included baseline ITBS scores to increase precision, though they were not collected for the entire kindergarten cohort. Following the original study, other demographic covariates were not included in the models due to the large number of blocks.

Table 2 presents the subgroup findings that mirror those from the original study. We find that the offer of a voucher had no effect on test scores overall or for Latinos across specifications. The effects on African Americans are also not statistically significant at the 5% level for the model without baseline test scores. However, the effects on African Americans become positive and statistically significant for the model with baseline scores, that excludes the kindergarteners but nonetheless yields SEs that are reduced by about 12%. These effects are 4.7 percentile ranking points, which translates into a 0.26 standard deviation increase, with a significant F-test for the subgroup interaction effect (p-value = 0.028). The effects for African Americans remain significant using the sample with baseline test scores without controlling for them in the model.

Table 2

Estimated ATEs on composite test scores for the NYC voucher experiment

Model specification	Overall sample	African American	Latino
Model excludes baseline test scores
DB, actual subgroup sizes	0.25 (1.06)	2.54 (1.45)	−0.86 (1.58)
DB, expected subgroup sizes	0.25 (1.06)	2.54 (1.45)	−0.86 (1.58)
HW	0.25 (1.03)	2.54 (1.42)	−0.86 (1.49)
DB: actual sizes using sample with baseline test scores	0.88 (1.30)	4.47* (1.73)	−1.11 (1.76)
Model includes baseline test scores
DB, actual subgroup sizes	1.70 (1.01)	4.70* (1.27)	0.50 (1.44)
DB, expected subgroup sizes	1.70 (1.01)	4.70* (1.27)	0.50 (1.44)
HW	1.70 (0.98)	4.70* (1.24)	0.50 (1.39)
Student sample size (without/with baseline test scores)	2,012/1,434	902/643	964/682

Note: SEs are in parentheses. See text for ATE and SE formulas. All estimates are weighted to adjust for follow-up test score nonresponse.

ATE = Average treatment effect.

*Statistically significant at the 5% level, two-tailed test.

We find across specifications that the DB SEs are nearly identical using actual and expected sample sizes. Further, consistent with theory, the DB SEs are slightly larger than the HW SEs, but both yield the same study conclusions: the vouchers did not improve test scores overall, but there is evidence they had a positive effect on African American students in grades 1–4. A detailed reanalysis of the original study data, however, cautions that the results for African Americans are sensitive to alternative race/ethnicity definitions and should be interpreted carefully [10].

7 Conclusion

This article considered DB RCT methods for ATE estimation for discrete subgroups defined by pre-treatment sample characteristics. Our subgroup estimators derive from the Neyman–Rubin–Holland model that underlies experiments and are based on simple least squares regression methods. We considered ratio estimators due to the randomness of observed subgroup sample sizes in the treatment and control groups that were not conditioned on for the asymptotic analysis. The DB approach is appealing in that it applies to continuous, binary, and discrete outcomes, and is nonparametric in that makes no assumptions about the distribution of potential outcomes or the model functional form.

We developed a new finite population, unconditional CLT for our subgroup ATE estimators under the non-clustered RCT, allowing for baseline covariates to improve precision. The main difference between our CLT and prior full sample ones is that the asymptotic variance for the subgroup estimator is based on expected subgroup sizes rather than actual ones. Another difference is that the subgroup variance includes a finite sample adjustment ( ϕ k ) that reflects the single treatment indicator shared by the subgroups. To apply the estimators in practice, we discussed simple consistent variance estimators using regression residuals that are asymptotically equivalent to robust variance estimators, but with finite sample degrees-of-freedom adjustments that derive directly from the experimental design. Our re-analysis of the NYC Voucher experiment demonstrated the simplicity of the methods, while maintaining statistical rigor.

A contribution of this work is that it provides a unified DB framework for subgroup analyses across a range of RCT designs. We discussed extensions of the asymptotic theory to blocked and clustered designs. We also discussed extensions to other commonly used estimators with random treatment-control sample sizes or summed weights: post-stratification estimators that average subgroup estimators to obtain overall effects, weighted estimators to adjust for data nonresponse, and estimators from BTs.

Our simulations for the non-clustered RCT show that the subgroup ATE estimators yield low bias and confidence interval coverage near nominal levels, although with slight over-coverage. This is somewhat surprising as the simulation literature on DB and robust variance estimators for clustered RCTs – that also applies to the subgroup context – shows the opposite issue of under-coverage [14,45].

Our simulations find very similar results using either actual or expected subgroup sample sizes for variance estimation. As demonstrated in several ways, this occurs because the difference between the observed subgroup proportions, π k 1 and π k 0 , and their expected value, π k , decreases exponentially with the overall sample size. This finding justifies the typical approach of using actual subgroup sample sizes for variance estimation, which blurs the distinction between a subgroup analysis conditional on the observed treatment-control subgroup sizes and an unconditional subgroup analysis considered here. Thus, a conditional analysis may be preferred due to its simplicity and parallel structure to the full sample DB analysis.

The free RCT-YES software (www.rct-yes.com), funded by the U.S. Department of Education, estimates ATEs for both full sample and baseline subgroup analyses using the DB methods discussed in this article using either R or Stata. The software applies actual sample sizes for the variance formulas for subgroup analyses and allows for general weights. The software also allows for multi-armed trials with multiple treatment condition.

tel: +(609) 936-2783

Acknowledgements

The author would like to thank the two reviewers for very helpful suggestions and comments.

Funding information: Author states no funding involved.
Author contribution: The author confirms the sole responsibility for the conception of the study, presented results and manuscript preparation.
Conflict of interest: Author states no conflict of interest.
Data availability statement: The NYC Voucher data for the empirical analysis were obtained under a restricted data use license agreement with Mathematica. Per license requirements, these data cannot be shared with journal readers. However, to the best of my knowledge, these data can be obtained, and I would be happy to provide the SAS and R programs used for the analysis.

References

[1] Neyman J. On the application of probability theory to agricultural experiments: essay on principles. Sect 9, Translated Stat Sci. 1990;5:465–72.10.1214/ss/1177012031Search in Google Scholar

[2] Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701.10.1037/h0037350Search in Google Scholar

[3] Rubin DB. Assignment to treatment group on the basis of a covariate. J Educ Stat. 1977;2:1–26.10.3102/10769986002001001Search in Google Scholar

[4] Holland PW. Statistics and causal inference. J Am Stat Assoc. 1986;81:945–60.10.1080/01621459.1986.10478354Search in Google Scholar

[5] Rothwell PM. Subgroup analyses in randomized controlled trials: importance, indications, and interpretation. Lancet. 2005;365:176–86.10.1016/S0140-6736(05)17709-5Search in Google Scholar PubMed

[6] Schochet PZ, Puma M, Deke J. Understanding variation in treatment effects in education impact evaluations: an overview of quantitative methods (NCEE 2014-4017). Washington, DC: National Center for Education Evaluation and Regional Assistance; 2014.Search in Google Scholar

[7] Wang R, Lagakos SW, Ware JH, Hunter DJ, Drazen JM. Statistics in medicine-reporting of subgroup analyses in clinical trials. N Engl J Med. 2007;357(21):2189–94.10.1056/NEJMsr077003Search in Google Scholar PubMed

[8] Schulz KF, Altman DG, Moher D. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. J Pharmacol Pharmacother. 2019;1:100–7.10.4103/0976-500X.72352Search in Google Scholar PubMed PubMed Central

[9] Mayer D, Peterson P, Myers D, Tuttle C, Howell W. School choice in New York City: an evaluation of the school choice scholarships program. Mathematica Policy Research, Washington, DC; 2002.Search in Google Scholar

[10] Krueger AB, Zhu P. Another look at the New York City school voucher experiment. Am Behav Scientist 47(5):658–98.10.1177/0002764203260152Search in Google Scholar

[11] Bland JM. Cluster randomised trials in the medical literature: two bibliometric surveys. BMC Med Res Methodol. 2004;4:21.10.1186/1471-2288-4-21Search in Google Scholar PubMed PubMed Central

[12] Schochet PZ. Statistical power for random assignment evaluations of education programs. J Educ Behav Stat. 2008;33:62–87.10.3102/1076998607302714Search in Google Scholar

[13] Pashley NE. Note on the delta method for finite population inference with applications to causal inference. Working Paper: Harvard University Statistics Department, Cambridge MA; 2019.Search in Google Scholar

[14] Schochet PZ, Pashley NE, Miratrix LW, Kautz T. Design-based ratio estimators and central limit theorems for clustered, blocked RCTs. J Am Stat Assoc. 2022;117(540):2135–46.10.1080/01621459.2021.1906685Search in Google Scholar

[15] Yang L, Tsiatis A. Efficiency study of estimators for a treatment effect in a pretest-posttest trial. Am Statistician. 2001;55:314–21.10.1198/000313001753272466Search in Google Scholar

[16] Freedman D. On regression adjustments to experimental data. Adv Appl Math. 2008;40:180–93.10.1016/j.aam.2006.12.003Search in Google Scholar

[17] Schochet PZ. Is regression adjustment supported by the Neyman model for causal inference? J Stat Plan Inference. 2010;140:246–59.10.1016/j.jspi.2009.07.008Search in Google Scholar

[18] Schochet PZ. Statistical theory for the RCT-YES software: design-based causal inference for RCTs: second edition (NCEE 2016–4011). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education; 2016.Search in Google Scholar

[19] Aronow PM, Middleton JA. A class of unbiased estimators of the average treatment effect in randomized experiments. J Causal Inference. 2013;1:135–54.10.1515/jci-2012-0009Search in Google Scholar

[20] Lin W. Agnostic notes on regression adjustments to experimental data: reexamining Freedman’s critique. Ann Appl Stat. 2013;7:295–318.10.1214/12-AOAS583Search in Google Scholar

[21] Imbens G, Rubin D. Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge, UK: Cambridge University Press; 2015.10.1017/CBO9781139025751Search in Google Scholar

[22] Middleton JA, Aronow PM. Unbiased estimation of the average treatment effect in cluster-randomized experiments. Statistics Politics Policy. 2015;6:39–75.10.1515/spp-2013-0002Search in Google Scholar

[23] Li X, Ding P. General forms of finite population central limit theorems with applications to causal inference. J Am Stat Assoc. 2017;112:1759–69.10.1080/01621459.2017.1295865Search in Google Scholar

[24] Scott A, Wu CF. On the asymptotic distribution of ratio and regression estimators. J Am Stat Assoc. 1981;1981(112):1759–69.10.1080/01621459.1981.10477612Search in Google Scholar

[25] Miratrix LW, Sekhon JS, Yu B. Adjusting treatment effect estimates by post-stratification in randomized experiments. J R Stat Soc Ser B. 2013;75(2):369–96.10.1111/j.1467-9868.2012.01048.xSearch in Google Scholar

[26] Cochran W. Sampling techniques. New York: John Wiley and Sons; 1977.Search in Google Scholar

[27] Lohr SL. Sampling: design and analysis. 2nd edn. Pacific Grove, CA: Duxbury Press; 2009.Search in Google Scholar

[28] Thompson S. Sampling. Hoboken, NJ: John Wiley & Sons; 2012.Search in Google Scholar

[29] Rubin DB. Which ifs have causal answers? Discussion of Holland’s “statistics and causal inference”. J Am Stat Assoc. 1986;81:961–2.10.1080/01621459.1986.10478355Search in Google Scholar

[30] Fraser DAS. Ancillaries and conditional inference. Stat Sci. 2004;19(2):333–69.10.1214/088342304000000323Search in Google Scholar

[31] Aronow PM, Green DP, Lee DKK. Sharp bounds on the variance in randomized experiments. Ann Stat. 2014;42:850–71.10.1214/13-AOS1200Search in Google Scholar

[32] Wright T. On some properties of variable size simple random sampling and a limit theorem. Commun Stat Theory Methods. 1988;17(9):2997–3016.10.1080/03610928808829785Search in Google Scholar

[33] Rosenbaum P, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55.10.1093/biomet/70.1.41Search in Google Scholar

[34] Rubin DB. Multiple imputation for nonresponse in surveys. NY: J. Wiley and Sons; 1987.10.1002/9780470316696Search in Google Scholar

[35] Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–85.10.1080/01621459.1952.10483446Search in Google Scholar

[36] Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Stat Med. 2004;23(19):2937–60.10.1002/sim.1903Search in Google Scholar PubMed

[37] Serfling RJ. Probability inequalities for the sum in sampling without replacement. Ann Stat. 1974;2:39–48.10.1214/aos/1176342611Search in Google Scholar

[38] Greene E, Wellner JA. Exponential bounds for the hypergeometric distribution. Bernoulli. 2017;23(3):1911–50.10.3150/15-BEJ800Search in Google Scholar PubMed PubMed Central

[39] Huber PJ. The behavior of maximum likelihood estimates under nonstandard conditions. Proced Fifth Berkeley Symp Math Stat Probability. 1967;1:221–33.Search in Google Scholar

[40] White H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica. 1980;1980(48):817–38.10.2307/1912934Search in Google Scholar

[41] Su F, Ding P. Model-assisted analyses of cluster-randomized experiments. J R Stat Soc Ser B. 2021;83(5):994–1015.10.1111/rssb.12468Search in Google Scholar

[42] Pashley NE, Miratrix LW. Insights on variance estimation for blocked and matched pairs designs. J Educ Behav Stat. 2021;46(3):271–96.10.3102/1076998620946272Search in Google Scholar

[43] Liu H, Yang Y. Regression-adjusted average treatment effect estimates in stratified randomized experiments. Biometrika. 2020;107(4):935–48.10.1093/biomet/asaa038Search in Google Scholar

[44] Liang K, Zeger S. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.10.1093/biomet/73.1.13Search in Google Scholar

[45] Cameron AC, Miller DL. A practitioner’s guide to cluster-robust inference. J Hum Resour. 2015;50:317–72.10.3368/jhr.50.2.317Search in Google Scholar

[46] Bell R, McCaffrey D. Bias reduction in standard errors for linear regression with multi-stage samples. Surv Methodol. 2002;28:169–81.Search in Google Scholar

Received: 2023-08-25

Revised: 2024-02-17

Accepted: 2024-05-13

Published Online: 2024-07-25

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary material

Articles in the same Issue

https://doi.org/10.1515/jci-2023-0056

Keywords for this article

randomized controlled trials; subgroup analyses; design-based estimators; finite population central limit theorems

Creative Commons

BY 4.0