Doubly weighted M-estimation for nonrandom assignment and missing outcomes

Akanksha Negi

doi:10.1515/jci-2023-0016

Article Open Access

Doubly weighted M-estimation for nonrandom assignment and missing outcomes

Akanksha Negi

Published/Copyright: February 2, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Causal Inference Volume 12 Issue 1

Abstract

This article proposes a class of M-estimators that double weight for the joint problems of nonrandom treatment assignment and missing outcomes. Identification of the main parameter of interest is achieved under unconfoundedness and missing at random assumptions with respect to the treatment and sample selection problems, respectively. Given the parametric framework, the asymptotic theory of the proposed estimator is outlined in two parts: first, when the parameter solves an unconditional problem, and second, when it solves a stronger conditional problem. The two parts help to summarize the misspecification scenarios permissible under the given framework and the role played by double weighting in each. As illustrative examples, the article also discusses the estimation of causal parameters like average and quantile treatment effects. With respect to the average treatment effect, this article shows that the proposed estimator is doubly robust. Finally, a detailed application to Calónico and Smith’s (The women of the national supported work demonstration. J Labor Econom. 2017;35(S1):S65–S97.) reconstructed sample from the National Supported Work training program is used to demonstrate the estimator’s performance in empirical settings.

Keywords: unconfoundedness; missing at random; double weighting; M-estimation; treatment effects

MSC 2010: 62D20

1 Introduction

When interest lies in causal inference, the prevalence of missing data poses a major identification challenge. In observational studies, causal effects estimation is complicated due to a nonrandom selection of individuals into programs or interventions (nonrandom treatment assignment). If, in addition, the observed outcome of interest is missing due to attrition or nonresponse, then this creates a double identification challenge for the estimation of treatment effects. This article proposes a class of doubly weighted M-estimators that are consistent and asymptotically normal for the joint problems of nonrandom treatment assignment and missing outcomes.

Despite the ubiquity of missing data problems in observational and experimental studies, the traditional inverse probability weighted (IPW) literature has only considered treatment and sample selection problems in isolation. In this article, I extend the literature to incorporate both issues simultaneously in a general framework. The main parameter of interest is defined to solve a population objective function. To correct for the two selection issues at hand, the article proposes weighting by both the propensity score and the missing outcome probability to identify the true parameter. The two key assumptions are unconfoundedness and missing at random, which represent ignorable assignment and missing data mechanisms, respectively [1]. Estimation follows in two steps: first the probabilities are estimated using binary response maximum likelihood, and second, the estimated probabilities are plugged in as weights to solve some objective function.

In the missing data IPW literature, Ding and Li [2] and Cao et al. [3] considered estimation of the population mean in the presence of missing outcomes. Robins and Rotnitzky [4] considered the estimation of regression parameters when the outcome is censored. Chen et al. [5] and Graham et al. [6] considered estimation of parameters indexing moment conditions, whereas Wooldridge [7] focused on the parameter solving an optimization problem. See the study by Seaman and White [8] for a review of IPW in the missing data context. While most of this literature draws parallels between missing data and missing potential outcomes, none have considered the problem of treatment selection. On the other hand, the causal inference literature uses propensity score weighting for identifying treatment effects but does not address any traditional missing data issues [9–13].

A few articles have looked at the double selection problem. For instance, Huber [14] presented a systematic treatment of the forms of attrition (i.e., whether selection is on observables or unobservables) that yield true average treatment effects (ATEs) in an experiment. His results under the “selection-on-observables” assumption are nested within this article as a special case. In another article, Huber [15] used an instrument for sample selection along with a conditionally exogeneous treatment. He then proposes a weighted estimator that nests the sample selection probability as an additional covariate in the propensity score. This is different from the weighting scheme employed in this article which neither involves any instruments nor nests probabilities. Rather, it exploits the sequential relationship between the two selection mechanisms; individuals may be more or less likely to attrit after selecting into a particular treatment group beyond what is dictated by their covariates. Other articles that take a selection-on-unobservables view of one or both problems include [16,17]. Typically, the discussion in these articles is limited to identification results for the ATE along with an incomplete characterization of the estimator in question. One exception is Huber [15] who also considered the unconditional quantile treatment effect (QTE) for the selected subpopulation.

This article makes the following contributions. First, it attempts to provide a comprehensive treatment of the two problems under a selection-on-observables framework where the parameter of interest minimizes an objective function. Consequently, the double-weighting solution applies to a wide range of procedures that can be framed as an M-estimation problem. The asymptotic theory of the proposed estimator is characterized in two parts. The first part describes cases where a conditional feature of interest is potentially misspecified, which is formalized in terms of a weak identification assumption. This part requires the weights to be correctly specified in order to achieve identification. In contrast, the second part assumes a conditional feature of interest to be correctly specified and allows the weights to be wrong. Together, they summarize two important cases of misspecification that can arise in this double-weighting setup and help us determine what can be consistently estimated under each setting. For instance, with respect to the estimation of QTEs, this article shows how one can estimate conditional QTE (CQTE) or a linear approximation to CQTE depending on whether the conditional quantile function is assumed to be correctly specified or not.

This article also shows that certain quasi-log-likelihood and mean function combinations deliver a “doubly robust” (DR) estimator of ATE under the given framework. This is different from the augmented-IPW style of estimators that are well known for being DR in the missing data literature ([18–20], see [21] for a review). DR estimators involve models for the conditional mean and propensity score and are consistent if at least one of two models is correctly specified. The extant literature on DR estimators have either focused on sample selection [2,6] or treatment selection [7,13], but not both at the same time [22]. A related contribution is to propose a DR estimator for ATE, which is distinct from the augmented-IPW class, when both problems are present. An advantage of this estimator is that it is less sensitive to extreme values of the weights and also ensures that the range of the estimated mean function aligns with the nature of the outcomes being studied [13]. Simulations show that the doubly weighted ATE and QTE estimates have the lowest finite sample bias compared to alternatives that ignore one or both problems. More recently, Bia et al. [23] have adapted the double machine learning framework of Chernozhukov et al. [24] to accommodate both treatment and sample selection problems in the presence of high-dimensional covariates.

This article also adds to the existing IPW literature on “efficiency puzzle” which finds that estimated nuisance parameters often provide a more efficient estimator for the main parameter of interest. This result appears while characterizing the variance of the proposed estimator under the first half of the asymptotic theory. The estimation of the weights helps to exploit the correlation between the first- and second-step moment conditions obtained from the binary response and M-estimation problems, respectively, which the known-weights estimator fails to do. However, this ceases to be true in the second half where there are no efficiency improvements from using estimated weights. This puzzle has been well studied in [7,25–27], and more recently in the studies by Lok [28] and Su et al. [29]. This article also discusses conditions when weighting may be inefficient compared to not weighting at all.

Finally, the proposed method is applied to estimate the average and distributional impacts of the National Supported Work (NSW) training program on earnings for the Aid to Families with Dependent Children (AFDC) target group. The sample was obtained from Calónico and Smith [30] who recreated Lalonde’s within-study analysis for this women’s sample. The presence of experimental and non-experimental comparison groups in the data help to evaluate whether the doubly weighted estimator brings us close to the experimental benchmark relative to other alternatives. I find that the empirical bias for the doubly weighted estimate is much smaller than that for the unweighted estimate.

The rest of this article proceeds as follows: Section 2 describes the basic potential outcomes framework and provides a short description of the population models with an introduction to the naive unweighted estimator; Section 3 discusses the treatment assignment and missing outcome mechanisms, which leads us directly to the identification lemma; Section 4 develops the first half of the asymptotic theory for the doubly weighted estimator with a focus on misspecification of a conditional feature of interest and correct weights; in contrast, Section 5 considers the second half where a conditional model of interest is correctly specified but the weights may be misspecified; Section 6 studies the estimation of average and QTEs within the proposed framework; Section 7 provides supporting Monte Carlo evidence under two cases of misspecification: the correct conditional model with misspecified weights and misspecified conditional model with correct weights; Section 8 applies the proposed method to the NSW job training program; and Section 9 concludes.

2 Potential outcomes and the population models

Let Y ( 1 ) and Y ( 0 ) denote potential outcomes for the treatment and control states, respectively, and let W be an indicator for the binary treatment. Then,

(1) Y = Y ( 0 ) ⋅ ( 1 − W ) + Y ( 1 ) ⋅ W .

Also, let X be a fixed vector of covariates, which includes an intercept. Some feature of the distribution of ( Y ( g ) , X ) is assumed to depend on a finite vector θ g . Let q ( Y ( g ) , X , θ g ) be an objective function where some examples include the smooth least squares function, q ( ⋅ ) = ( Y ( g ) − X θ g ) 2 , or the non-smooth quantile regression, q ( ⋅ ) = c τ ( Y ( g ) − X θ g ) , where c τ ( u ) = ( τ − 1 { u < 0 } ) u is the asymmetric loss function for a random variable, u .

Assumption 1

(Identification of θ g 0 ) The parameter vector θ g 0 ∈ Θ g is a unique solution to the population minimization problem, min θ g ∈ Θ g E [ q ( Y ( g ) , X , θ g ) ] , for each g = 0 , 1 .

Assumption 1 defines the parameter of interest as the one that uniquely solves the population optimization problem. If q ( ⋅ ) involves a misspecified conditional feature like a conditional mean, variance, or even the full conditional distribution, Assumption 1 guarantees a unique pseudo-true solution [31]. In that case, determining whether the pseudo-truth, θ g 0 , is a meaningful parameter will depend on the conditional feature being studied and the estimation method used. For example, least squares can provide us with the best linear approximation to the true conditional mean, even if one mis-specifies the true function. Similarly, Angrist et al. [32] established the approximation properties of quantile regression where one can still estimate the best linear approximation to the true conditional quantile function under misspecification. In both cases, θ g 0 indexes linear projections (LPs) to different conditional models. On the other hand, when q ( ⋅ ) involves a correctly specified model, θ g 0 has a straightforward interpretation.

2.1 Nonrandom treatment assignment and missing outcomes

Let S be a binary indicator that denotes whether the outcome is observed or missing. Then,

(2) Y = Y ( 0 ) ⋅ ( 1 − W ) + Y ( 1 ) ⋅ W , if S = 1 missing , if S = 0 .

Given that the main objective of this article is to consistently estimate θ g 0 , a common empirical strategy is to only use complete cases for estimation. This means solving treatment and control group problems,

(3) min θ 1 ∈ Θ 1 ∑ i = 1 N S i ⋅ W i ⋅ q ( Y i , X i , θ 1 ) and min θ 0 ∈ Θ 0 ∑ i = 1 N S i ⋅ ( 1 − W i ) ⋅ q ( Y i , X i , θ 0 ) ,

where q ( Y , X , θ g ) = W ⋅ q ( Y ( 1 ) , X , θ 1 ) + ( 1 − W ) ⋅ q ( Y ( 0 ) , X , θ 0 ) . The estimator that solves (3) is called the unweighted estimator and is denoted by θ ˆ g u . This will be consistent if it identifies θ g 0 in the population. For example, consider

Y ( g ) = X θ g + U ( g ) , g = 0 , 1 , where E [ X ′ U ( g ) ] = 0 .

In this case, even if the treatment is randomly assigned, missingness in the outcome may still be correlated with treatment, observable factors, or both. Hence, the population first-order condition for the selected sample, E [ S ⋅ W ⋅ X ′ U ( g ) ] , is not zero even though E [ X ′ U ( g ) ] = 0 . Therefore, θ g 0 cannot be identified.

3 Identification of the parameter of interest

For identification of the main parameter, I make the following assumption.

Assumption 2

(Strong ignorability) Assume, Y ( 0 ) , Y ( 1 ) ⊥ ⊥ W ∣ X .

The vector of covariates, X , is always observed for the entire sample.
For all x ∈ X , define p ( x ) = P ( W = 1 ∣ X = x ) such that κ < p ( x ) < 1 for a constant κ > 0 .

This indicates that conditioning on covariates is enough to parse out any systematic differences between the treatment and control groups, also known as unconfoundedness. Part (i) requires that we observe these covariates for all individuals. Part (ii) is an overlap condition that ensures that we observe units in both the treatment and control groups for each value of x in the population. Previous literature has found several situations where unconfoundedness is a tenable assumption. This is especially true when pre-treatment values of the outcome variable are available. For example, Lalonde [33] and Hotz et al. [34] have shown that controlling for pre-training earnings alone reduces significant bias between non-experimental and experimental estimates. The literature assessing teacher impact on student achievement has reported similar findings with pre-test scores [35–37].

Assumption 3

(Missing at Random) Assume, Y ( 0 ) , Y ( 1 ) ⊥ ⊥ S ∣ X , W . (i) In addition to X , W is always observed for the entire sample. (ii) For each ( x , w ) ∈ ( X , W ) , define r ( x , w ) = P ( S = 1 ∣ X = x , W = w ) such that r ( x , w ) > η for a constant η > 0 and w = 0 , 1 .

Assumption 4

(Random Sampling) { ( Y i , X i , W i , S i ) ; i = 1 , 2 , … , N } are i.i.d draws from an infinite population.

Assumption 3 is known as missing at random or MAR and represents an ignorable missing data mechanism. It implies that missingness depends only on observables and not on the missing values of the variable itself [38]. This includes missing completely at random as a special case [1]. The need to condition on W in addition to X helps to deal with the possibility that treatment itself can alter the probability of observing the outcome. This is especially useful in explaining cases of differential nonresponse. Parts (i) and (ii) have similar interpretations as before. Finally, Assumption 4 is a standard random sampling assumption.

Lemma 1

(Identification) Given Assumptions 1–3, assume (i) q ( Y ( g ) , X , θ g ) is a real-valued function for all ( Y ( g ) , X ) and (ii) E [ ∣ q ( Y ( g ) , X , θ g ) ∣ ] < ∞ for all θ g ∈ Θ g , g = 0 , 1 , then

E S ⋅ W r ( X , W ) ⋅ p ( X ) ⋅ q ( Y , X , θ 1 ) = E [ q ( Y ( 1 ) , X , θ 1 ) ]

and

E S ⋅ ( 1 − W ) r ( X , W ) ⋅ ( 1 − p ( X ) ) ⋅ q ( Y , X , θ 0 ) = E [ q ( Y ( 0 ) , X , θ 0 ) ] .

Define ω 1 = S ⋅ W r ( X , W ) ⋅ p ( X ) , and ω 0 = S ⋅ ( 1 − W ) r ( X , W ) ⋅ ( 1 − p ( X ) ) for notational simplicity. Lemma 1 implies that solving the population problem with weights, ω g , is equivalent to solving the original M-estimation problem given in Assumption 1. The proof uses two applications of the law of iterated expectations (LIEs) with unconfoundedness and MAR to arrive at the above result.

Remark

Given that treatment selection is widely viewed as a form of a missing data problem, one argument is to simply combine the two selection problems into one. Imagine a single binary indicator, D = S ⋅ W . Arguably, existing IPW results could then be applied directly using the single indicator, D . While this may be convenient, such an approach fails to acknowledge the fact that (i) S and W often represent two different selection problems, and (ii) combining them into one may lead to a loss in efficiency. Hence, a more rigorous treatment necessitates considering each issue separately.

4 Asymptotic theory when the conditional feature of interest is misspecified

Given that r ( X , W ) and p ( X ) are unknown, the following assumptions posit that we have a correctly specified model for the propensity score and missing outcome probability. Since both W and S are binary responses, estimation of γ 0 and δ 0 using maximum likelihood (MLE) will be asymptotically efficient under correct specification of these functions. Consistency and asymptotic normality for γ 0 and δ 0 follow from Theorems 2.5 and 3.3 of [39].

Assumption 5

(Correct parametric specification of probability models) Assume that

there exists a known parametric function G ( X , γ ) for p ( X ) , where γ ∈ Γ and 0 < G ( X , γ ) < 1 . Similarly, there exists a known parametric function R ( X , W , δ ) for r ( X , W ) , where δ ∈ Δ and R ( X , W , δ ) > 0 ;
there exists γ 0 ∈ Γ and δ 0 ∈ Δ s.t. p ( X ) = G ( X , γ 0 ) and r ( X , W ) = R ( X , W , δ 0 ) .

The “doubly weighted” estimator is then defined as follows:

(4) θ ˆ g = arg min θ g ∈ Θ g ∑ i = 1 N ω ^ i g ⋅ q ( Y i , X i , θ g ) ,

where

ω ^ i 1 ≡ S i ⋅ W i R ( X i , W i , δ ˆ ) ⋅ G ( X i , γ ˆ ) and ω ^ i 0 ≡ S i ⋅ ( 1 − W i ) R ( X i , W i , δ ˆ ) ⋅ ( 1 − G ( X i , γ ˆ ) )

are the estimated weights for solving the treatment and control group problems, respectively. Let d i and b i denote scores of the binary response log-likelihood problems for estimating the propensity score and missing outcome probability models evaluated at probability limits γ 0 and δ 0 , respectively. Also, let h ( θ g 0 ) ≡ h g denote the score of q ( ⋅ ) at the true parameter value and assume that it exists with probability one.

Theorem 1

(Asymptotic normality) Under Assumptions 1–5 and conditions (1)–(13) in the Appendix, N ( θ ˆ g − θ g 0 ) → d N ( 0 , H g − 1 Ω g H g − 1 ) , where Ω g = E ( l i g l i g ′ ) − E ( l i g b i ′ ) E ( b i b i ′ ) − 1 E ( b i l i g ′ ) − E ( l i g d i ′ ) E ( d i d i ′ ) − 1 E ( d i l i g ′ ) for each g = 0 , 1 and l i g ≡ ω i g h i g is score of the weighted objective function evaluated at θ g 0 .

The asymptotic variance expression derived above offers some interesting insights. First, the middle term, Ω g , represents the variance of the residual from the population regression of the weighted score, l i g , on the two binary response scores, b i and d i . Note that the covariance term between the two MLE scores is zero since they are conditionally independent.

Second, the expression for Ω g has an efficiency implication for θ ˆ g . When one is only willing to assume identification of θ g 0 in the unconditional sense of Assumption 1, it is potentially more efficient to estimate the two weights even when they are known. To show this formally, let us assume that p ( X ) and r ( X , W ) are known and θ ˜ g is the doubly weighted estimator that uses known weights, ω g . Then,

Corollary 1

(Efficiency gain with estimated weights) Under the assumptions of Theorem 1,

Avar [ N ( θ ˜ g − θ g 0 ) ] − Avar [ N ( θ ˆ g − θ g 0 ) ] = H g − 1 Σ g H g − 1 − H g − 1 Ω g H g − 1 = H g − 1 ( Σ g − Ω g ) H g − 1

is positive semi-definite and where Σ g = E ( l i g l i g ′ ) .

In other words, we do no worse, asymptotically, by estimating the weights even when we actually know them. This result has also been called the “efficiency puzzle” and has been studied in Wooldridge [7] and others [26–29]. It is understood to arise from the suboptimal use of moment conditions in two-step procedures.

5 The conditional feature of interest is correctly specified

This section discusses the second half of the asymptotic theory for the doubly weighted estimator where identification of the parameter in question is formalized using a strong identification assumption.

Assumption 6

(Strong conditional identification of θ g 0 ) The parameter vector θ g 0 ∈ Θ g is the unique solution to the population minimization problem, min θ g ∈ Θ g E [ q ( Y ( g ) , X , θ g ) ∣ X ] , for each g = 0 , 1 .

Assumption 6 describes situations where a conditional feature of interest is correctly specified. This can be seen as strengthening the identification assumption in Section 4 since LIE implies that θ g 0 will also be a solution to the unconditional M-estimation problem.

An implication of this identification argument is that θ g 0 solves the conditional score of the objective function, i.e., E [ h g ∣ X ] = 0 . For instance, the conditional score will be zero when estimating a correctly specified conditional mean function using either least squares or quasi maximum likelihood in the linear exponential family. It would also hold for a correctly specified conditional quantile function estimated either using quantile regression or quasi maximum likelihood in the tick exponential family [40].

Under this conditional identification assumption, correct specification of the probability weights is not required for the doubly weighted estimator to be consistent for θ g 0 . In other words, R ( ⋅ , ⋅ , δ ) and G ( ⋅ , γ ) are allowed to be misspecified. Formally,

Assumption 7

(Parametric specification of probability models) Assume that

First part of Assumption 5 holds.
There exists γ ∗ ∈ Γ and δ ∗ ∈ Δ such that plim ( γ ˆ ) = γ ∗ and plim ( δ ˆ ) = δ ∗ , respectively.

Under this setting, the weights are given by

ω 1 ∗ = S ⋅ W R ( X , W , δ ∗ ) ⋅ G ( X , γ ∗ ) and ω 0 ∗ = S ⋅ ( 1 − W ) R ( X , W , δ ∗ ) ⋅ ( 1 − G ( X , γ ∗ ) ) ,

where γ ˆ and δ ˆ solve the same binary response problems as before but converge to probability limits given by pseudo-true values γ ∗ and δ ∗ , respectively [31]. The identification argument in this case can be briefly explained as follows:

(5) θ ˆ g = arg min θ g ∑ i = 1 N ω i g ∗ ⋅ q ( Y i , X i , θ g ) → p arg min θ g E [ ω g ∗ ⋅ q ( Y ( g ) , X , θ g ) ] = arg min θ g E [ ξ g ( X ) ⋅ q ( Y ( g ) , X , θ g ) ] = arg min θ g E [ ξ g ( X ) ⋅ E ( q ( Y ( g ) , X , θ g ) ∣ X ) ] = θ g 0 ,

where ξ g ( X ) > 0 is a function of weights. If θ g 0 is a solution to the conditional problem E [ q ( ⋅ ) ∣ X ] , it will also solve equation (5) (multiplication by ξ g ( X ) will not affect the conditional minimization problem). Therefore, solving the doubly weighted objective function identifies the parameter even if the weights are misspecified. Theorem 2 establishes asymptotic results under this case.

Theorem 2

(Asymptotic normality under strong identification) Under Assumptions 2–4, 6, and 7, and conditions (1)–(13) in the Appendix, N ( θ ˆ g − θ g 0 ) → d N ( 0 , H g − 1 Ω g H g − 1 ) , where Ω g = E ( l i g l i g ′ ) with H g and l i g defined in Theorem 1 with asymptotic weights given by ω i g ∗ .

Unlike the previous section, Ω g now is simply the variance of the weighted score of q ( ⋅ ) without the first-stage adjustment of the estimated probabilities. This is because under Assumption 6, the correlation between the score functions of the first- and second-step estimating equations is zero i.e., E ( l i g b i ′ ) = E ( l i g d i ′ ) = 0 . In other words, when θ g 0 is correctly specified for a conditional feature of interest and an appropriate estimation method is used, there is optimal usage of moment conditions.

A simpler expression for Ω g also means that we can no longer exploit the correlation between scores to obtain an efficient estimator of θ g 0 . Again, let θ ˜ g uses true weights, ω g . Then, we have the following result.

Corollary 2

(No gain with estimated weights under strong identification) Under the assumptions of Theorem 2, Avar [ N ( θ ˜ g − θ g 0 ) ] = Avar [ N ( θ ˆ g − θ g 0 ) ] = H g − 1 Ω g H g − 1 .

A special case is when ω g ∗ is just a constant since R ( ⋅ , ⋅ ) and G ( ⋅ , ⋅ ) are allowed to be any positive functions of X and W . This implies the unweighted estimator, θ ˆ g u , which does not weight at all, is also consistent for θ g 0 under the results of Theorem A.2. In this case, one may turn to asymptotic efficiency to guide our choice between weighting or not weighting at all. The following result says that if the objective function satisfies the generalized conditional information matrix equality (GCIME), the unweighted estimator is asymptotically more efficient than any weighted counterpart (correct weights or not).

Corollary 3

(Efficiency gain with unweighted estimator under GCIME) Under assumptions of Theorem 2, if we additionally suppose that the objective function satisfies GCIME in the population, which is defined as:

(6) E [ h ( Y ( g ) , X , θ g 0 ) h ( Y ( g ) , X , θ g 0 ) ′ ∣ X ] = σ 0 g 2 ⋅ ∇ θ g E [ h ( Y ( g ) , X , θ g 0 ) ∣ X ] = σ 0 g 2 ⋅ A ( X , θ g 0 ) ,

then, Avar [ N ( θ ˆ g − θ g 0 ) ] = H g − 1 Ω g H g − 1 and Avar [ N ( θ ˆ g u − θ g 0 ) ] = ( H g u ) − 1 Ω g u ( H g u ) − 1 , and

Avar [ N ( θ ˆ g − θ g 0 ) ] − Avar [ N ( θ ˆ g u − θ g 0 ) ]

is positive semi-definite.

The proof of this theorem follows from noting that we can express the difference in the two asymptotic variances as the expected outer product of population residuals from the regression of B i on D i , which are weighted versions of square root of matrix A i (see Appendix B for details). Hence, the difference is positive semi-definite.

We know GCIME in a variety of estimation contexts. In the case of full maximum likelihood, GCIME holds for q ( Y ( g ) , X , θ g ) = − ln f g ( Y ∣ X , θ g ) , where f g ( ⋅ ∣ ⋅ ) is the true conditional density with σ 0 g 2 = 1 . For estimating conditional mean parameters using quasi maximum likelihood estimation in the linear exponential family, GCIME holds if Var ( Y ( g ) ∣ X ) = σ 0 g 2 ⋅ v [ m ( X , θ g 0 ) ] . In other words, GCIME will be satisfied if Var ( Y ( g ) ∣ X ) satisfies the generalized linear model assumption, irrespective of whether the higher-order moments of the conditional distribution correspond with the chosen quasi-log likelihood or not. For estimation using nonlinear least squares, GCIME will hold for q ( Y ( g ) , X , θ g ) = [ Y ( g ) − m ( X , θ g ) ] 2 with the homoskedasticity assumption. In all these cases, the unweighted estimator will be more efficient than any weighted counterpart. Otherwise, the two may not be easy to rank.

6 Estimation of treatment effects

I now use asymptotic results discussed in Sections 4 and 5 for estimation of ATE and QTEs, which can be expressed as functions of the doubly weighted estimator, θ ˆ g .

6.1 Double robust estimation of ATE

Let m ( X , θ g ) be a parametric model for the conditional mean, E [ Y ( g ) ∣ X ] . Define

(7) Δ ate = E [ m ( X , θ 1 ) − m ( X , θ 0 ) ] .

6.1.1 First half: Correct conditional mean

If the mean model is correctly specified, one could consistently estimate θ g 0 using nonlinear least squares where the estimator solves a weighted nonlinear least squares problem, i.e.,

θ ˆ g = arg min θ g ∑ i = 1 N ω ^ i g ⋅ ( Y i − m ( X i , θ g ) ) 2 .

Since we are operating under Assumption 6, results from Section 5 dictate that the doubly weighted estimator would be consistent for θ g 0 irrespective of correct or incorrect weights. This is what forms the “first part” of the DR result for ATE estimation. One could also replace q ( ⋅ ) with quasi maximum likelihood and still consistently estimate conditional mean parameter, θ g 0 for g = 0 , 1 .

6.1.2 Second half: Correct weights

A consistent estimator for the ATE can generally not be obtained using equation (7) if the conditional mean function is misspecified. In the generalized linear model literature, certain combinations of quasi log likelihood and link functions lead to first-order conditions such that

E [ Y ( g ) ] = E [ h ( X θ g 0 ) ]

even though h ( X θ g 0 ) is misspecified for the true conditional mean. By allowing misspecification in the mean model, we are operating under Assumption 1, which implies that weighting is crucial for identification of the pseudo-true parameter, θ g 0 [31]. This forms the “second half” of the DR result for ATE estimation and draws on the theory outlined in the first half (Section 4) under the weak identification assumption.

In particular, the estimation strategy is to choose the mean model to be the canonical link, h − 1 ( ⋅ ) (where h ( ⋅ ) is a strictly increasing function on the real line), and the quasi-log likelihood associated with a linear exponential family. This choice will depend on the range and nature of Y . Then, θ g 0 solves the following population first-order conditions:

(8) E ω g ⋅ ∇ θ g h ( X θ g 0 ) ⋅ X ′ ⋅ ( Y − h ( X θ g 0 ) ) v [ h ( X θ g 0 ) ] = 0 ,

where v [ h ( ⋅ ) ] is variance of the mean function. With the chosen canonical link, equation (8) simplifies to

(9) E [ ω g ⋅ X ′ ⋅ ( Y − h ( X θ g 0 ) ) ] = 0 .

Since X includes an intercept, the first-order conditions give us

(10) E [ Y ( g ) ] = E [ h ( X θ g 0 ) ] .

The doubly weighted estimator, θ ˆ g , then solves the sample analogue of equation (9), i.e.,

∑ i = 1 N ω ^ i g ⋅ X i ′ ⋅ ( Y i − h ( X i θ ˆ g ) ) = 0 .

For Y with unrestricted support, normal quasi-log likelihood and identity link function ( h ( X θ g ) = X θ g ) deliver the mean fitting property. Other combinations of quasi-log likelihood and canonical link functions can be found in Table 2 of the study by Negi and Wooldridge [41].

Summary

DR estimation of ATE with double weighting

Case 1: Correct conditional mean and misspecified (or correct) weights: In this case, we are operating under Assumption 6 along with Assumption 7 (or Assumption 5 in the case of correct weights). Results in Section 5 will apply. The first half of this section discusses estimation of ATE under this scenario.

Case 2: Correct weights and misspecified mean: In this case, we are operating under Assumption 1. Results in Section 4 apply. The second half of this section discusses estimation of ATE under this scenario.

Combining the two halves, Δ ˆ ate = 1 N ∑ i = 1 N { m ( X i , θ ˆ 1 ) − m ( X i , θ ˆ 0 ) } gives us a DR estimator of Δ ate .

A similar result is discussed in Theorem 14.2 of the study by Ding [42], which shows that a weighted least squares fit of Y i on ( 1 , W i , X i , W i ⋅ X i ) with inverse propensity score weights produces a DR estimator for the ATE. This is identical to the inverse propensity weighted linear regression adjustment estimator discussed in the study by Imbens and Wooldridge [43], which is well known as being DR under the treatment selection problem.

6.2 Estimation of QTEs

In this section, I use double weighting to illustrate the estimation of three different quantile parameters, namely, unconditional quantile treatment effect (UQTE), CQTE, and a weighted linear approximation to the CQTE, each of which may be of interest to the researcher depending on whether quantiles of the conditional or unconditional outcomes distribution are of interest.

CQTE τ : Let q τ ( X , θ g ( τ ) ) be a correctly specified parametric model for the conditional quantile function. Then,

CQTE τ ( X ) = q τ ( X , θ 1 ( τ ) ) − q τ ( X , θ 0 ( τ ) ) .

In this case, there are two methods that will ensure consistent estimation of θ g 0 ( τ ) . The first is quantile regression [44], where

(11) θ ˆ g ( τ ) = arg min θ g ( τ ) ∈ Θ g ∑ i = 1 N ω ^ i g ⋅ c τ ( Y i − q τ ( X i , θ g ( τ ) ) ) .

Since we are operating under Assumption 6, the results in Section 5 dictate that weighting the objective functions, irrespective of whether the weights are correctly specified or not, will yield a consistent estimator of θ g ( τ ) . One could also use quasi maximum likelihood in the special “tick-exponential” family of distributions to consistently estimate the conditional quantile parameters. As shown by Komunjer [40], quantile regression proposed in Koenker and Bassett [44] is a special case of this quasi-maximum likelihood class of estimators.

LP to CQTE τ : In the event that q τ ( X , θ g ( τ ) ) is misspecified as being linear, one can still interpret it as providing the best linear approximation to the true conditional quantile function. This property of quantile regression is analogous to the approximation property of linear regression under conditional mean misspecification and was established in the study by Angrist et al. [32]. Given such misspecification in the conditional quantile function, the first half requires the weights to be correct in order to consistently estimate the LP parameters, θ g ( τ ) , i.e.,

(12) θ ˆ g ( τ ) = arg min θ g ∈ Θ g ∑ i = 1 N ω ^ i g ⋅ c τ ( Y i − X i θ g ( τ ) )

will be consistent for the pseudo-true parameter, θ g 0 ( τ ) , which indexes the population LP to the true conditional quantile function. Then,

LP ^ [ CQTE τ ( X ) ] = X [ θ ˆ 1 ( τ ) − θ ˆ 0 ( τ ) ]

is interpreted as providing the best LP to the true CQTE.

Unconditional QTE τ ( UQTE τ ): Let θ g ( τ ) be the τ th unconditional quantile of the potential outcome, Y ( g ) . Then,

UQTE ^ τ = θ ˆ 1 ( τ ) − θ ˆ 0 ( τ ) ,

where

θ ˆ g ( τ ) = arg min θ g ( τ ) ∈ Θ g ∑ i = 1 N ω ^ i g ⋅ c τ ( Y i − θ g ( τ ) ) .

In this case, weighting is crucial for consistent estimation of θ g ( τ ) since no quantile model is being specified, and we are operating under Assumption 1.

7 Simulations

This section compares the empirical distributions of average and QTE estimators using unweighted, propensity score weighted (ps-weighted), and doubly weighted ( d -weighted) estimators. For estimating ATE, data are simulated using a probit as, Y ( g ) = 1 ( X θ g 0 + U ( g ) > 0 ) , where X includes an intercept and two covariates. For estimating the QTE parameters, I use an exponential data generating process where the potential outcomes are generated as Y ( g ) = exp [ X θ g 0 + U ( g ) ] , for g = 0 , 1 . For each setting, the covariates and latent errors have been drawn from two independent bivariate normal distributions. The treatment assignment and missing outcome mechanisms satisfy the assumptions of unconfoundedness and MAR with a 41% probability of being treated and 38% of being observed in the population, respectively. The empirical distributions of the unweighted, ps-weighted, and d -weighted estimators are then obtained for a sample of size 5,000 using 1,000 replication draws from a population of 1 million observations. Additional details can be found in Section S.1 of the Supplementary Material.

The discussion of the results is centered around two main misspecification scenarios: (1) when some conditional model (conditional mean function or conditional quantile function) is misspecified and (ii) when the weights are misspecified (enumerated in Tables S.1.1 and S.1.2). These correspond to scenarios outlined in the first and second half of the asymptotic theory.

7.1 ATE: Results

Case 1 in Figure 1 considers a misspecified mean function but correct probability weights. This is the principal case covered in Section 4 where weighting is crucial. As one can see, the empirical distribution of the doubly weighted estimator is centered on the true ATE, whereas the distribution for the unweighted estimator is shifted to the right.

$Figure 1 Empirical distribution of estimated ATE for N = 5,000 N=\hspace{0.1em}\text{5,000}\hspace{0.1em} . Notes: This figure plots the empirical distributions of the unweighted, ps-weighted, and d d -weighted ATE estimates using 1,000 Monte Carlo simulation draws of sample size 5,000. The average treated sample size is N 1 = 5,000 × 0.41 × 0.38 = 779 {N}_{1}=\hspace{0.1em}\text{5,000}\hspace{0.1em}\times 0.41\times 0.38=779 , and the average control sample size is N 0 = 5,000 × ( 1 − 0.41 ) × 0.38 = 1 , 121 {N}_{0}=\hspace{0.1em}\text{5,000}\hspace{0.1em}\times \left(1-0.41)\times 0.38=1,121 . The true ATE = 0.096, and the population is generated using a million observations. The unweighted estimator does not weight the observed data. The ps-weighted estimator weights to correct only for nonrandom assignment, and the d d -weighted estimator weights by both the treatment and missing outcomes probabilities.$

Figure 1

Empirical distribution of estimated ATE for N = 5,000 . Notes: This figure plots the empirical distributions of the unweighted, ps-weighted, and d -weighted ATE estimates using 1,000 Monte Carlo simulation draws of sample size 5,000. The average treated sample size is N 1 = 5,000 × 0.41 × 0.38 = 779 , and the average control sample size is N 0 = 5,000 × ( 1 − 0.41 ) × 0.38 = 1 , 121 . The true ATE = 0.096, and the population is generated using a million observations. The unweighted estimator does not weight the observed data. The ps-weighted estimator weights to correct only for nonrandom assignment, and the d -weighted estimator weights by both the treatment and missing outcomes probabilities.

Case (2) of the same figure depicts the scenario of a correctly specified conditional mean function but misspecified weights. Here, weighting does not matter for consistent estimation of ATE since the mean function is correctly specified. All three empirical distributions, namely, unweighted, ps-weighted, and d -weighted, coincide and are centered on the true ATE.

7.2 QTE: Results

Figure 2 discusses the case when conditional quantile function is misspecified but the weights are correct. Using results obtained in the study by Angrist et al. [32], I interpret the solution to the double-weighted problem given in equation (12) as providing a consistent weighted linear approximation to the true conditional quantile function, which is then used to estimate an LP to the true CQTE. Figure 2 plots the bias in estimated LP using the three estimators relative to the true LP as a function of X 1 . We see that the double-weighted estimator performs the best.

$Figure 2 Relative bias of the estimated LP (CQTE) as a function of X 1 {X}_{1} when the conditional quantile function is misspecified but weights are correct. Notes: This figure plots the bias in the unweighted, ps-weighted, and d d -weighted LPs to CQTE relative to the true population LP for N = 5,000 N=\hspace{0.1em}\text{5,000}\hspace{0.1em} . The average treated sample size is N 1 = 5,000 × 0.41 × 0.38 = 779 {N}_{1}=\hspace{0.1em}\text{5,000}\hspace{0.1em}\times 0.41\times 0.38=779 , and the average control sample size is N 0 = 5,000 × ( 1 − 0.41 ) × 0.38 = 1,121 {N}_{0}=\hspace{0.1em}\text{5,000}\hspace{0.1em}\times \left(1-0.41)\times 0.38=\hspace{0.1em}\text{1,121}\hspace{0.1em} . The unweighted estimator does not weight the observed data. The ps-weighted estimator weights to correct only for nonrandom assignment, and the d d -weighted estimator weights by the treatment and missing outcomes probabilities.$

Figure 2

Relative bias of the estimated LP (CQTE) as a function of X 1 when the conditional quantile function is misspecified but weights are correct. Notes: This figure plots the bias in the unweighted, ps-weighted, and d -weighted LPs to CQTE relative to the true population LP for N = 5,000 . The average treated sample size is N 1 = 5,000 × 0.41 × 0.38 = 779 , and the average control sample size is N 0 = 5,000 × ( 1 − 0.41 ) × 0.38 = 1,121 . The unweighted estimator does not weight the observed data. The ps-weighted estimator weights to correct only for nonrandom assignment, and the d -weighted estimator weights by the treatment and missing outcomes probabilities.

Next, I consider the case when the conditional quantile function is correctly specified but the weights are wrong. Since in this case one can consistently estimate the CQTE irrespective of the weights, Figure 3 plots the CQTE curve as a function of X 1 . In this case, as theory suggests, all three estimators of the CQTE function, i.e., unweighted, ps-weighted, and d -weighted, coincide with the true CQTE. Section S.1.2 of the Supplementary Material provides details about plotting the CQTE curve.

$Figure 3 Estimated CQTE with true CQTE as a function of X 1 {X}_{1} when conditional quantile function is correct but weights are misspecified. Notes: This figure plots the average d d -weighted CQTE function with the true CQTE along X 1 {X}_{1} for 1,000 Monte Carlo simulation draws of sample size N = 5,000 N=\hspace{0.1em}\text{5,000}\hspace{0.1em} . Along with these two graphs, the figure also plots the individual function across the 1,000 simulation draws. The average treated sample is N 1 = 5,000 × 0.41 × 0.38 = 779 {N}_{1}=\hspace{0.1em}\text{5,000}\hspace{0.1em}\times 0.41\times 0.38=779 and average control sample is N 0 = 5,000 × ( 1 − 0.41 ) × 0.38 = 1,121 {N}_{0}=\hspace{0.1em}\text{5,000}\hspace{0.1em}\times \left(1-0.41)\times 0.38=\hspace{0.1em}\text{1,121}\hspace{0.1em} .$

Figure 3

Estimated CQTE with true CQTE as a function of X 1 when conditional quantile function is correct but weights are misspecified. Notes: This figure plots the average d -weighted CQTE function with the true CQTE along X 1 for 1,000 Monte Carlo simulation draws of sample size N = 5,000 . Along with these two graphs, the figure also plots the individual function across the 1,000 simulation draws. The average treated sample is N 1 = 5,000 × 0.41 × 0.38 = 779 and average control sample is N 0 = 5,000 × ( 1 − 0.41 ) × 0.38 = 1,121 .

Finally, Figure 4 plots the empirical distribution of UQTE for the three estimators. We find that both unweighted and d -weighted estimators have a comparable finite sample bias. Propensity score weighting performs the worst in both cases. All results correspond to the twenty-fifth quantile of the outcome distribution. The results for other quantiles can be found in Section S.5 of the Supplementary Material.

$Figure 4 Empirical distribution of estimated UQTE. Notes: This figure plots the empirical distributions of the unweighted, ps-weighted, and d d -weighted UQTE estimates using 1,000 Monte Carlo simulation draws of sample size 5,000. The average treated sample is N 1 = 5,000 × 0.41 × 0.38 = 779 {N}_{1}=\hspace{0.1em}\text{5,000}\hspace{0.1em}\times 0.41\times 0.38=779 , and the average control sample is N 0 = 5,000 × ( 1 − 0.41 ) × 0.38 = 1,121 {N}_{0}=\hspace{0.1em}\text{5,000}\hspace{0.1em}\times \left(1-0.41)\times 0.38=\hspace{0.1em}\text{1,121}\hspace{0.1em} . The unweighted estimator does not weight the observed data. The ps-weighted estimator weights to correct only for nonrandom assignment, and the d d -weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with nonrandom assignment and missing outcome problems.$

Figure 4

Empirical distribution of estimated UQTE. Notes: This figure plots the empirical distributions of the unweighted, ps-weighted, and d -weighted UQTE estimates using 1,000 Monte Carlo simulation draws of sample size 5,000. The average treated sample is N 1 = 5,000 × 0.41 × 0.38 = 779 , and the average control sample is N 0 = 5,000 × ( 1 − 0.41 ) × 0.38 = 1,121 . The unweighted estimator does not weight the observed data. The ps-weighted estimator weights to correct only for nonrandom assignment, and the d -weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with nonrandom assignment and missing outcome problems.

8 Returns to job training

In this section, I apply the proposed estimator to the AFDC sample of women from the NSW training program compiled by Calónico and Smith [30] (CS, thereafter). NSW was a transitional and subsidized work experience program, which was implemented as a randomized experiment in the United States between 1975–1979. CS replicate Lalonde’s [33] within-study analysis for the AFDC women in the program, where the purpose of such an analysis is to evaluate how training estimates obtained from using non-experimental identification strategies (assuming unconfoundedness) compare to experimental estimates. To compute the non-experimental estimates, CS combine the NSW experimental sample with two non-experimental comparison groups drawn from panel study of income dynamics (PSID), called PSID-1 and PSID-2. I utilize this within-study feature to estimate how close the d -weighted estimates are to the experimental benchmark compared with ps-weighted and unweighted estimates.

To construct these empirical bias measures, I first augment the CS sample to allow for women who had missing earnings information in 1979. This renders 26% of the experimental and 11% of the PSID samples missing. I then combine the experimental treatment group of NSW with three distinct comparison groups present in the CS dataset, namely, the experimental control group, and the two PSID samples, to compute the unweighted, ps-weighted, and d -weighted training estimates, respectively. The difference between the non-experimental estimate, obtained from using the d -weighted estimator, and the experimental estimate provides the first measure of estimated bias associated with the proposed strategy. Combining the experimental control group with the non-experimental comparison group gives a second measure of estimated bias [45]. I report both these bias measures for the average returns to training estimates.

Given the growing importance of estimating distributional impacts of job training programs, I also estimate returns to training at every tenth quantile of the 1979 earnings distribution. The role of double weighting is strong for estimating marginal quantiles since it serves to remove biases arising from the two selection problems.

8.1 Results

First, to evaluate whether women with missing earnings in 1979 were significantly different than those who were observed, Table 1 reports the mean and standard deviation of the woman’s age, years of schooling, pre-training earnings, and other characteristics across the observed and missing samples. In terms of age, the women who were observed in the experimentally treated group of NSW and the PSID-1 sample were, on average, older than those who were missing. The observed women in PSID-1 were also more likely to be married. For the PSID-2 sample, women who were observed had, on average, more kids with higher pre-training earnings. All these differences are statistically significant indicating that covariates were statistically different among the missing and observed PSID women (see the non-experimental columns in Table 1). For the experimental group, we do not find the covariates to be systematically different between those who were observed vs. those who were missing (see the experimental columns in Table 1).

Table 1

Covariate means and p -values from the test of equality of two means for the observed and missing samples

Covariates	Experimental						Non-experimental
	Control			Treatment			PSID-1			PSID-2
	Missing	Observed	P ( ∣ T ∣ > ∣ t ∣ )	Missing	Observed	P ( ∣ T ∣ > ∣ t ∣ )	Missing	Observed	P ( ∣ T ∣ > ∣ t ∣ )	Missing	Observed	P ( ∣ T ∣ > ∣ t ∣ )
Age, years	33.36	33.74	0.51	32.15	33.77	0.01	34.00	37.07	0.01	33.32	34.54	0.62
	(7.30)	(7.15)		(7.39)	(7.40)		(10.50)	(10.57)		(10.81)	(9.34)
Years of education	10.29	10.26	0.85	10.29	10.31	0.89	11.44	11.30	0.60	11.05	10.49	0.18
	(1.93)	(2.03)		(2.05)	(1.88)		(2.17)	(2.77)		(1.73)	(2.13)
Proportion of high school dropouts	0.70	0.68	0.57	0.69	0.70	0.77	0.43	0.45	0.73	0.55	0.59	0.68
	(0.46)	(0.47)		(0.46)	(0.46)		(0.50)	(0.50)		(0.51)	(0.49)
Proportion married	0.05	0.04	0.61	0.03	0.02	0.75	0.00	0.02	0.00	0.00	0.01	0.16
	(0.21)	(0.19)		(0.16)	(0.15)		(0.00)	(0.14)		(0.00)	(0.10)
Proportion Black	0.81	0.82	0.81	0.83	0.84	0.87	0.74	0.65	0.10	0.91	0.86	0.50
	(0.39)	(0.39)		(0.38)	(0.37)		(0.44)	(0.48)		(0.29)	(0.35)
Proportion Hispanic	0.12	0.13	0.87	0.13	0.12	0.64	0.01	0.02	0.82	0.05	0.02	0.62
	(0.33)	(0.33)		(0.33)	(0.32)		(0.11)	(0.12)		(0.21)	(0.15)
Number of children in 1975	2.33	2.23	0.34	2.14	2.19	0.69	1.54	1.71	0.33	2.41	2.97	0.05
	(1.29)	(1.34)		(1.32)	(1.29)		(1.45)	(1.78)		(1.14)	(1.79)
Real earnings in 1975	621.54	879.28	0.12	610.77	861.65	0.11	6927.95	7510.92	0.50	896.56	2211.45	0.02
	(1,523.00)	(2,194.93)		(1,677.36)	(2,005.53)		(7,330.74)	(7,541.41)		(2,315.12)	(3,567.50)
Observations	795	795		796	796		729	729		204	204

Notes: Along with the covariate means and standard deviation (in parentheses), the table also reports p -values from the test of equality for two means between the observed and missing samples. Real earnings in 1975 are expressed in terms of 1982 dollars.

The presence of non-experimental control groups implies that assignment was nonrandom and therefore an issue in the sample. This is because the comparison groups were drawn from PSID after imposing only a partial version of the full NSW eligibility criteria. Table 2 provides descriptive statistics for the covariates by the treatment status. As can be expected, the treatment and control groups of NSW are not observably different. In contrast, the women in PSID-1 and PSID-2 groups are statistically different from the treatment group.

Table 2

Covariate means and p -values from the test of equality of two means, by treatment status

Covariates	Treatment	Control	P ( ∣ T ∣ > ∣ t ∣ )	PSID-1	P ( ∣ T ∣ > ∣ t ∣ )	PSID-2	P ( ∣ T ∣ > ∣ t ∣ )
Age, years	33.37	33.64	0.46	36.73	0.00	34.41	0.11
	(7.42)	(7.19)		(10.60)		(9.48)
Years of education	10.30	10.27	0.72	11.32	0.00	10.55	0.07
	(1.92)	(2.00)		(2.71)		(2.09)
Proportion of high school dropouts	0.70	0.69	0.73	0.45	0.00	0.59	0.00
	(0.46)	(0.46)		(0.50)		(0.49)
Proportion married	0.02	0.04	0.03	0.02	0.05	0.01	0.08
	(0.15)	(0.20)		(0.13)		(0.10)
Proportion Black	0.84	0.82	0.29	0.66	0.00	0.87	0.13
	(0.37)	(0.39)		(0.47)		(0.34)
Proportion Hispanic	0.12	0.13	0.59	0.02	0.00	0.02	0.00
	(0.32)	(0.33)		(0.12)		(0.16)
Number of children in 1975	2.17	2.26	0.21	1.70	0.00	2.91	0.00
	(1.30)	(1.32)		(1.75)		(1.73)
Real earnings in 1975	799.88	811.19	0.91	7446.15	0.00	2069.65	0.00
	(1931.92)	(2041.32)		(7515.59)		(3474.10)
Observations	796	795		729		204

Notes: Along with the covariate means and standard deviation (in parentheses), the table also reports p -values from the test of equality for two means. Column 4 tests for differences between the NSW treatment and control groups, and columns 6 and 8 report the same using PSID-1 and PSID-2 comparison groups, respectively. Real earnings in 1975 are expressed in terms of 1982 dollars.

Table 3 reports the d -weighted, ps-weighted, and unweighted average returns to training estimates using three different comparison groups: NSW control, PSID-1, and PSID-2. The unweighted (unadjusted and adjusted) experimental estimates given in row 1 are the same as the estimates reported by CS in Table 3 of their article. Overall, one can see that the double-weighted experimental estimates are more stable than the single-weighted or unweighted estimates across the different regression specifications, with a range between $824 and $828. Moreover, the d -weighted estimator is DR for the ATE, which makes it more reliable as opposed to the other two.

Table 3

Unweighted and weighted earnings comparisons and estimated training effects using NSW and PSID comparison groups

Comparison group	Unadjusted			Adjusted			Adjusted
	Unweighted	PS-weighted	D-weighted	Unweighted	PS-weighted	D-weighted	Unweighted	PS-weighted	D-weighted
				Post-training earnings estimates
NSW	821	848	824	845	852	828	864	850	826
N = 1,185	(307.22)	(304.04)	(304.61)	(303.60)	(302.94)	(303.53)	(303.47)	(302.96)	(303.58)
PSID-1	− 799	827	803	298	909	907	335	905	904
N = 1,016	(444.84)	(503.00)	(503.26)	(428.60)	(497.76)	(501.54)	(440.18)	(518.54)	(522.97)
PSID-2	− 31	569	566	492	1,040	996	698	1,082	1,049
N = 720	(713.88)	(1041.81)	(1027.12)	(664.46)	(961.74)	(953.80)	(784.28)	(1264.18)	(1217.46)
				Bias estimates using NSW control
PSID-1	-1,620	169	156	− 493	− 40	− 21	− 568	− 38	− 21
N = 1,001	(431.75)	(561.74)	(553.07)	(427.93)	(499.91)	(501.44)	(434.59)	(504.19)	(507.02)

PSID-2	− 853	− 228	− 212	− 109	207	200	− 378	− 17	− 24
N = 705	(707.87)	(1041.44)	(1025.87)	(663.80)	(962.85)	(954.61)	(759.75)	(1195.47)	(1156.39)
Adjusted covariates
Pre-training earnings (1975)				✓	✓	✓	✓	✓	✓
Age				✓	✓	✓	✓	✓	✓
Age 2				✓	✓	✓	✓	✓	✓
Education				✓	✓	✓	✓	✓	✓
High school dropout				✓	✓	✓	✓	✓	✓
Black				✓	✓	✓	✓	✓	✓
Hispanic				✓	✓	✓	✓	✓	✓
Marital status				✓	✓	✓	✓	✓	✓
Number of Children (1975)							✓	✓	✓

Notes: This table reports unadjusted and adjusted post-training earnings differences between the NSW treatment and three different comparison groups, namely, NSW control, PSID-1 and PSID-2. The first row reports experimental training estimates that combine the NSW treatment and control group, whereas the second and third rows report non-experimental estimates computed from using the PSID-1 and PSID-2 groups, respectively. Each of the non-experimental estimates should be compared to the experimental benchmark. The second panel of the table reports bias estimates computed from combining the NSW control with PSID-1 and PSID-2 comparison groups, respectively. These represent a second measure of bias which should be compared to zero. Bootstrapped standard errors are given in parentheses and have been constructed using 10,000 replications. All values are in 1982 dollars. The samples used for estimating the training and bias estimates have been trimmed to ensure common support in the distribution of weights for the treatment and comparison groups. For more detail, see Section S.3 of the Supplementary Material.

For computing the ps-weighted and d -weighted non-experimental estimates, I first trim the sample to ensure common support between the treatment and comparison groups. Appendix S.3 describes estimation of the two probability weights along with the sample trimming criteria. This reduces the sample size from 1,248 to 1,016 observations for the PSID-1 estimates and from 782 to 720 observations for the PSID-2 estimates. A pattern that is consistent across the two sets of non-experimental estimates is that weighting gets us much closer to the benchmark relative to not weighting at all. For instance, the unweighted simple difference in means estimate of training, which uses the PSID-1 comparison group, is − $ 799 , whereas the weighted estimates are $827 and $803. For the PSID-2 comparison group, the unweighted estimate that controls for all covariates is $335, whereas the weighted estimates are $905 and $904.

The second panel of Table 3 reports the bias in training estimates from combining the experimental control group with the PSID comparison groups. A similar pattern is seen here with weighted bias estimates being much closer to zero than the unweighted estimates. These results suggest that the argument for weighting is strong when using a non-experimental comparison group where nonrandom assignment and missing outcomes are significant problems. Note that the large standard errors for the non-experimental estimates can be attributed to the small sample sizes and to the large residual variance of earnings in the PSID-1 and PSID-2 populations.

Figure 5 plots the relative bias in UQTE estimates at every tenth quantile of the 1979 earnings distribution. Much like the average training estimates, we see that the weighted estimates consistently lie below the unweighted estimates for most quantiles, irrespective of whether we use the PSID-1 or PSID-2 nonexperimental group.

$Figure 5 Relative estimated bias in UQTE estimates at different quantiles of the 1979 earnings distribution. (a) PSID-1 control group and (b) PSID-2 control group. Notes: This graph plots the bias in the unweighted, ps-weighted, and d d -weighted UQTE estimates relative to the true experimental estimates across different quantiles of the 1979 earnings distribution. Panel (a) plots the relative bias estimates using the PSID-1 comparison group and panel (b) plots the same using the PSID-2 comparison group. The treatment and missing outcome propensity score models have been estimated as flexible logits, and the samples used for constructing these estimates have been trimmed to ensure common support across the two groups. The treatment propensity score has been estimated using the full experimental sample along with either PSID-1 or PSID-2 comparison group. The UQTE estimates for τ < 0.46 \tau \lt 0.46 are omitted from the graph since these are zero.$

Figure 5

Relative estimated bias in UQTE estimates at different quantiles of the 1979 earnings distribution. (a) PSID-1 control group and (b) PSID-2 control group. Notes: This graph plots the bias in the unweighted, ps-weighted, and d -weighted UQTE estimates relative to the true experimental estimates across different quantiles of the 1979 earnings distribution. Panel (a) plots the relative bias estimates using the PSID-1 comparison group and panel (b) plots the same using the PSID-2 comparison group. The treatment and missing outcome propensity score models have been estimated as flexible logits, and the samples used for constructing these estimates have been trimmed to ensure common support across the two groups. The treatment propensity score has been estimated using the full experimental sample along with either PSID-1 or PSID-2 comparison group. The UQTE estimates for τ < 0.46 are omitted from the graph since these are zero.

9 Conclusion

In empirical research, the problems of nonrandom assignment and missing outcomes threaten identification of treatment effects. This article proposes a class of M-estimators that double weight by the propensity score and missing outcome probability to correct for the two problems within a selection-on-observables framework. The asymptotic theory of the proposed estimator is characterized in two halves where the first half allows misspecification in some conditional feature of interest and the second allows for misspecified weights. Together, the two parts completely characterize the kinds of misspecification scenarios permissible under the given framework.

As illustrative examples, the article utilizes results from the first and second half to discuss estimation of causal parameters like the average and QTEs. In the case of ATE, the proposed estimator is shown to be DR irrespective of whether the mean function or the weights are misspecified (not both). For the case of QTEs, one may either obtain the CQTE if the conditional quantile function is assumed correct or a linear approximation to it. This is demonstrated in the simulations where we find that the double-weighted ATE and QTE estimates have the lowest bias when compared to naive alternatives (unweighted and ps-weighted estimators). Finally, an application of the procedure to Calónico and Smith’s (2017) reconstructed NSW sample helps to quantify the degree of distortion created by the two problems on the returns to training estimates through a comparsion with the experimental benchmark.

Even though missing outcomes are a common concern in empirical analysis, it is equally common to encounter missing data on the covariates. A particularly important future extension can be to allow for missing data on both. In this case, using a generalized method of moments framework, which incorporates information on complete and incomplete cases, could provide efficiency gains over just using the observed data. A different possibility would be to relax the identifying restrictions to allow for selection on unobservables and possibly explore estimation of local ATE.

Webpage: www.anegi.net

Acknowledgements

I am grateful to Jeffrey M. Wooldridge, Steven Haider, Ben Zou, and Kenneth Frank. Special thanks to Tim Vogelsang, Wendun Wang, Alyssa Carlson, Christian Cox, Tymon Słoczyński, and seminar & conference participants, for insightful comments and suggestions on this article.

Funding information: No funding to declare.
Conflict of interest: The author states that there is no conflict of interest.
Ethical approval: The conducted research is not related to either human or animals use.
Data availability statement: The datasets generated during and/or analyzed during the current study are available from the corresponding author on request.

Appendix A Regularity conditions for asymptotic theory

Let the population problem be denoted as, Q 0 ( θ g ) ≡ E [ ω g ⋅ q ( Y , X , θ g ) ] , and its sample analog be given as, Q N ( θ g ) ≡ 1 N ρ ˆ g ∑ i = 1 N ω ^ i g ⋅ q ( Y i , X i , θ g ) , where ρ ˆ g = N g ∕ N and N ρ ˆ g → ∞ as ρ ˆ g → ρ g .

Θ g is compact for g = 0 , 1 .
G ( X , γ ) and R ( X , W , δ ) satisfy Assumption 5 and is continuous for each γ and δ on the support of X and ( X , W ) , respectively.
q ( Y ( g ) , X , θ g ) is continuous at each θ g ∈ Θ g with probability one.
E [ sup θ g ∈ Θ g ∣ q ( Y ( g ) , X , θ g ) ∣ ] < ∞ .
θ g 0 ∈ int ( Θ g ) .
q ( Y ( g ) , X , θ g ) is continuously differentiable on int ( Θ g ) with probability one.
1 N ∑ i = 1 N ω ^ i g ⋅ h ( Y i ( g ) , X i , θ ˆ g ) = o P ( N − 1 ∕ 2 ) .
E [ sup θ g ∈ Θ g ‖ h ( Y ( g ) , X , θ g ) ‖ 2 ] < ∞ .
G ( ⋅ , γ ) and R ( ⋅ , δ ) are both twice continuously differentiable on int ( Γ ) and int ( Δ ) , respectively.
E [ sup δ ∈ Δ ‖ b ( X , W , S , δ ) ‖ 2 ] < ∞ , E [ sup γ ∈ Γ ‖ d ( X , W , γ ) ‖ 2 ] < ∞ .
E [ ω g ⋅ h ( Y ( g ) , X , θ g ) ] is continuously differentiable on int ( Θ g ) .
H g ≡ ∇ θ g E [ ω g ⋅ h ( Y ( g ) , X , θ g 0 ) ] is non-singular.
{ v N ( θ g ) : N ≥ 1 } is stochastically equicontinuous, where
(A1) v N ( θ g ) ≡ 1 N ∑ i = 1 N { ω ^ i g h i g ( θ g ) − E [ ω ^ i g h i g ( θ g ) ] } .

A.1 Consistency of the doubly weighted estimator

Given the two-step nature of the estimation problem, wherein the first step uses binary response MLE for estimating the probability weights and the second step solves an objective function using the first-step weights, the asymptotic theory utilizes results for two-step estimators with a non-smooth objective function to establish the large sample properties of θ ˆ g . The following theorem fills in the primitive regularity conditions for applying the uniform law of large numbers.

Theorem A.1

(Consistency under weak identification) Under Assumptions 1–4 and conditions (1)–(4), θ ˆ g → p θ g 0 for each g = 0 , 1 .

The proof follows from verifying the conditions in Lemma 2.4 of the study by Newey and McFadden [39].

Proof

It has already been established that

E [ ω g ⋅ q ( Y , X , θ ) ] ≡ E [ ω g ⋅ q ( Y ( g ) , X , θ g ) ] = E [ q ( Y ( g ) , X , θ g ) ]

for both g = 0 , 1 . By (iii), ω g ( γ , δ ) is continuous in γ and δ and is bounded in absolute value by Assumption 5. Moreover, ω g ( ⋅ , γ , δ ) q ( ⋅ , θ ) is continuous with probability one. Then, along with (v), dominated convergence theorem (DCT), and boundedness of ω g ( ⋅ , ⋅ ) , we obtain

(A2) sup ( θ g , γ , δ ) ∈ ( Θ g , Γ ˜ , Δ ˜ ) 1 N ∑ i = 1 N ω i g ( γ , δ ) ⋅ q ( Y i ( g ) , X i , θ g ) − E [ ω g ( γ , δ ) ⋅ q ( Y ( g ) , X , θ g ) ] → p 0

using Lemma 2.4 in the study by Newey and McFadden [39] since Γ ˜ and Δ ˜ are compact neighborhoods around γ 0 and δ 0 . By triangle inequality,

(A3) sup θ g ∈ Θ g 1 N ∑ i = 1 N ω ^ i g ⋅ q ( Y i ( g ) , X i , θ g ) − E [ ω g ⋅ q ( Y ( g ) , X , θ g ) ] ≤ sup θ g ∈ Θ g 1 N ∑ i = 1 N g ω ^ i g ⋅ q ( Y i ( g ) , X i , θ g ) − E [ ω ^ g ⋅ q ( Y ( g ) , X , θ g ) ] + sup θ g ∈ Θ g ∣ E [ ω ^ g ⋅ q ( Y ( g ) , X , θ g ) ] − E [ ω g ⋅ q ( Y ( g ) , X , θ g ) ] ∣ .

(A.3) is o P ( 1 ) due to γ ˆ → p γ 0 , δ ˆ → p δ 0 , and uniform continuity of E [ ω g ⋅ q ( Y ( g ) , X , θ g ) ] on Θ g × Γ ˜ × Δ ˜ . Then, consistency of θ ˆ g for θ g 0 follows from Theorem 2.1 of the study by Newey and McFadden [39].□

Theorem A.2

(Consistency under strong identification) Under Assumptions 2–4, 6, and 7, and regularity conditions (1)–(4), θ ˆ g → p θ g 0 as N → ∞ .

Proof

We first establish that θ g 0 solves

E [ ω g ∗ ⋅ q ( Y ( g ) , X , θ g ) ] .

The proof of uniform convergence follows similar to the proof of Theorem A.1 where we replace ω g by ω g ∗ . Then, consistency of θ ˆ g for θ g 0 follows from Theorem 2.1 in [39]. To show that θ g 0 is still a solution to the double-weighted population problem with misspecified weights, consider the following argument:

(A4) θ ˆ g = arg min θ g ∑ i = 1 N ω ^ i g ⋅ q ( Y i , X i , θ g ) → p arg min θ g E [ ω g ∗ ⋅ q ( Y , X , θ g ) ] ≡ arg min θ g E [ ω g ∗ ⋅ q ( Y ( g ) , X , θ g ) ] .

Consider, E [ ω 1 ∗ ⋅ q ( Y ( 1 ) , X , θ 1 ) ] . Then, using law of LIEs,

E [ ω 1 ∗ ⋅ q ( Y ( 1 ) , X , θ 1 ) ] = E [ E ( ω 1 ∗ ⋅ q ( Y ( 1 ) , X , θ 1 ) ∣ X , W , Y ( 1 ) ) ] = E 1 R ( X , W , δ ∗ ) ⋅ W G ( X , γ ∗ ) ⋅ q ( Y ( 1 ) , X , θ 1 ) ⋅ P ( S = 1 ∣ X , W , Y ( 1 ) ) = E r ( X , W ) R ( X , W , δ ∗ ) ⋅ W G ( X , γ ∗ ) ⋅ q ( Y ( 1 ) , X , θ 1 ) ,

where the second equality applies the inner expectation to S and the third equality uses the fact that P ( S = 1 ∣ X , W , Y ( 1 ) ) = P ( S = 1 ∣ X , W ) because of MAR. Applying LIE again,

(A5) E r ( X , W ) R ( X , W , δ ∗ ) ⋅ W G ( X , γ ∗ ) ⋅ q ( Y ( 1 ) , X , θ 1 ) = E E r ( X , W ) R ( X , W , δ ∗ ) ⋅ W G ( X , γ ∗ ) ⋅ q ( Y ( 1 ) , X , θ 1 ) ∣ X , Y ( 1 ) = E 1 G ( X , γ ∗ ) ⋅ q ( Y ( 1 ) , X , θ 1 ) ⋅ E r ( X , W ) R ( X , W , δ ∗ ) ⋅ W ∣ X = E [ ξ 1 ( X ) ⋅ q ( Y ( 1 ) , X , θ 1 ) ] .

Here, the second equality uses unconfoundedness and the third equality recognizes that 1 G ( X , γ ∗ ) ⋅ E { ⋅ ∣ X } is a function of X , which I denote by ξ 1 ( X ) . One can show E [ ω 0 ∗ ⋅ q ( Y , X , θ 0 ) ] = E [ ξ 0 ( X ) ⋅ q ( Y ( 0 ) , X , θ 0 ) ] analogously. Then,

(A6) arg min θ g E [ ω g ∗ ⋅ q ( Y ( g ) , X , θ g ) ] = arg min θ g E [ ξ g ( X ) ⋅ q ( Y ( g ) , X , θ g ) ] = arg min θ g E [ ξ g ( X ) ⋅ E ( q ( Y ( g ) , X , θ g ) ∣ X ) ] ,

where the second equality holds due to LIE.□

B Proofs

Proof of Lemma 1

Let us first consider the argument for θ 1 0 . By LIE and using the fact that q ( Y , X , θ g ) = W ⋅ q ( Y ( 1 ) , X , θ 1 ) + ( 1 − W ) ⋅ q ( Y ( 0 ) , X , θ 0 ) , we can write

E [ ω 1 ⋅ q ( Y , X , θ 1 ) ] = E E S r ( X , W ) ⋅ W p ( X ) ⋅ q ( Y ( 1 ) , X , θ 1 ) ∣ Y ( 1 ) , X , W = E W r ( X , W ) ⋅ p ( X ) ⋅ q ( Y ( 1 ) , X , θ 1 ) ⋅ P ( S = 1 ∣ Y ( 1 ) , X , W ) = E W r ( X , W ) ⋅ p ( X ) ⋅ q ( Y ( 1 ) , X , θ 1 ) ⋅ P ( S = 1 ∣ X , W ) = E W p ( X ) ⋅ q ( Y ( 1 ) , X , θ 1 ) .

Using another application of LIE along with unconfoundedness, we obtain

E W p ( X ) ⋅ q ( Y ( 1 ) , X , θ 1 ) = E [ q ( Y ( 1 ) , X , θ 1 ) ] ,

where the third equality follows from MAR and fourth follows from part (ii) of Assumption 3. The proof for θ 0 0 follows analogously.□

Proof of Theorem 2

Explicit dependence on data is suppressed for notational simplicity. Then, expanding ω ^ i g around ω i g ,

N − 1 ∕ 2 ∑ i = 1 N ω ^ i g ⋅ h i g = N − 1 ∕ 2 ∑ i = 1 N { ω i g h i g − ω ˜ i g h i g ⋅ b i ′ ( δ ˜ ) ⋅ ( δ ˆ − δ 0 ) − ω ˜ i g h i g ⋅ d i ′ ( γ ˜ ) ⋅ ( γ ˆ − γ 0 ) } + o P ( 1 ) = N − 1 ∕ 2 ∑ i = 1 N ω i g h i g − N − 1 ∑ i = 1 N ω ˜ i g h i g b i ′ ( δ ˜ ) ⋅ N ( δ ˆ − δ 0 ) − N − 1 ∑ i = 1 N ω ˜ i g h i g d i ′ ( γ ˜ ) ⋅ N ( γ ˆ − γ 0 ) + o P ( 1 ) ,

where δ ˜ lies between δ ˆ and δ 0 and γ ˜ lies between γ ˆ and γ 0 . Now let, ( θ g ∗ , δ ∗ ) = arg sup θ g ∈ Θ g , δ ∈ Δ ‖ h ( θ g ) ⋅ b ′ ( δ ) ‖ . Then,

(A7) ( E [ ‖ h ( θ g ∗ ) b ′ ( δ ∗ ) ‖ ] ) 2 ≤ E [ ‖ h ( θ g ∗ ) ‖ 2 ] E [ ‖ b ′ ( δ ∗ ) ‖ 2 ] ≤ E [ sup θ g ∈ Θ g ‖ h ( θ g ) ‖ 2 ] E [ sup θ g ∈ Θ g ‖ b ′ ( δ ) ‖ 2 ] < ∞ ,

where the first inequality holds by Cauchy–Schwartz, the second inequality holds due to the definition of supremums, and the third inequality holds by conditions (iv) and (vi). Then,

E [ sup θ g ∈ Θ g , δ ∈ Δ ‖ h ( θ g ) b ′ ( δ ) ‖ ] ≤ ( E [ sup θ g ∈ Θ g , δ ∈ Δ ‖ h ( θ g ) b ′ ( δ ) ‖ ] ) 2 < ∞ ,

where the first inequality holds trivially and the second inequality holds because of (A7). An analogous argument may be made for showing E [ sup θ g ∈ Θ g , γ ∈ Γ ‖ h ( θ g ) d ′ ( γ ) ‖ ] < ∞ . Using the fact that ω g ( γ , δ ) is continuous and bounded along with continuity of l ( θ g ) (condition (ii)), b ( δ ) , d ( γ ) (condition (iii) of Theorem A.1), we obtain

(A8) 1 N ∑ i = 1 N ω ˜ i g h i g b i ′ ( δ ˜ ) = E [ ω i g h i g b i ′ ] + o P ( 1 ) , 1 N ∑ i = 1 N ω ˜ i g h i g d i ′ ( γ ˜ ) = E [ ω i g h i g d i ′ ] + o P ( 1 )

using Lemma 4.3 in [39] as γ ˜ → p γ 0 and δ ˜ → p δ 0 . Rewriting equation (7) using influence function representations for γ ˆ and δ ˆ along with (A8), we obtain

(A9) N − 1 ∕ 2 ∑ i = 1 N ω ^ i g h i g = N − 1 ∕ 2 ∑ i = 1 N { l i g − E [ l i g b i ′ ] ⋅ E [ b i b i ′ ] − 1 b i − E [ l i g d i ′ ] ⋅ E [ d i d i ′ ] − 1 d i } + o P ( 1 ) ≡ N − 1 ∕ 2 ∑ i = 1 N u i g + o P ( 1 ) → d N ( 0 , Ω g ) ,

where u i g ≡ l i g − E [ l i g b i ′ ] ⋅ E [ b i b i ′ ] − 1 b i − E [ l i g d i ′ ] ⋅ E [ d i d i ′ ] − 1 d i . Since E ( u i g ) = 0 ,

Ω g = E ( l i g l i g ′ ) − E ( l i g b i ′ ) E ( b i b i ′ ) − 1 E ( b i l i g ′ ) − E ( l i g d i ′ ) E ( d i d i ′ ) − 1 E ( d i l i g ′ ) .

The next part of the proof uses the theory of empirical processes for obtaining asymptotic normality of the doubly weighted estimator. Using the definition in (A1) along with the fact that E [ ω ^ i g h i ( θ g ) ] → p E [ ω i g h i ( θ g ) ] (by continuity of ω ( γ , δ ) h ( θ g ) , condition iv) and DCT as ( γ ˆ , δ ˆ ) → p ( γ 0 , δ 0 ) , rewrite

(A10) v N ( θ g ) = v N ∗ ( θ g ) + o P ( 1 ) ,

where v N ∗ ( θ g ) ≡ 1 N ∑ i = 1 N { ω ^ i g h i ( θ g ) − E [ ω i g h i ( θ g ) ] } . Let

m ¯ N ( θ g ) = 1 N ∑ i = 1 N ω ^ i g h i ( θ g ) and m N ∗ ( θ g ) = E [ ω i g h i ( θ g ) ] .

Then, performing element by element mean value expansions of m N ∗ ( θ ˆ g ) around θ g 0 , we obtain

0 = N m N ∗ ( θ g 0 ) = N m N ∗ ( θ ˆ g ) − ∇ θ g m N ∗ ( θ ˜ g ) ′ ⋅ N ( θ ˆ g − θ g 0 ) ,

where θ ˜ g lies between θ ˆ g and θ g 0 . Since the population first-order condition is zero at the truth,

0 = ∇ θ g E [ ω g ⋅ q ( Y ( g ) , X , θ g 0 ) ] = E [ ω g ⋅ h ( Y ( g ) , X , θ g 0 ) ] ≡ m N ∗ ( θ g 0 ) .

The second equality follows from dominance condition (iv) and application of Lemma 3.6 in the study by Newey and McFadden [39]. Then, by the continuity of ∇ θ g E [ ω i g h i ( θ g ) ] (condition (vi)),

∇ θ g m N ∗ ( θ ˜ g ) → p H g .

By continuous mapping theorem and condition (viii),

(A11) N ( θ ˆ g − θ g 0 ) = ( H g − 1 + o P ( 1 ) ) ⋅ N m N ∗ ( θ ˆ g ) .

Consider,

− N m N ∗ ( θ ˆ g ) = v N ∗ ( θ ˆ g ) − N m ¯ N ( θ ˆ g ) = v N ∗ ( θ ˆ g ) − v N ∗ ( θ g 0 ) + v N ∗ ( θ g 0 ) − N m ¯ N ( θ ˆ g ) = v N ∗ ( θ g 0 ) + o P ( 1 ) ,

where v N ∗ ( θ ˆ g ) − v N ∗ ( θ g 0 ) = o P ( 1 ) by asymptotic equivalence in (A10) and stochastic equicontinuity by condition (ix). Moreover, N m ¯ N ( θ ˆ g ) = o P ( 1 ) by condition (iii). Therefore,

v N ∗ ( θ g 0 ) = 1 N ∑ i = 1 N ω ^ i g h i g → d N ( 0 , Ω g )

by (A9). Then, using (A11) along with slutsky’s theorem, N ( θ ˆ g − θ g 0 ) → d N ( 0 , H g − 1 Ω g H g − 1 ) .□

Proof of Corollary 1

Consider,

Σ g − Ω g = E ( l i g l i g ′ ) − { E ( l i g l i g ′ ) − E ( l i g b i ′ ) E ( b i b i ′ ) − 1 E ( b i l i g ′ ) − E ( l i g d i ′ ) E ( d i d i ′ ) − 1 E ( d i l i g ′ ) } = E ( l i g b i ′ ) E ( b i b i ′ ) − 1 E ( b i l i g ′ ) + E ( l i g d i ′ ) E ( d i d i ′ ) − 1 E ( d i l i g ′ ) .

Since each component matrix in the above expression is positive semi-definite, therefore the sum of the two matrices is also positive semi-definite.□

Proof of Theorem 4

The proof follows in the manner of Theorem 1 where we replace ω g by ω g ∗ . Also, Ω g now denotes the variance of the score of the objective function, l i g , without the first stage adjustment for the estimated weights. This is because, E ( l i g b i ′ ) = E ( l i g d i ′ ) = 0 because the conditional score of l i g is zero under strong identification of θ g 0 i.e. E [ h ( Y ( g ) , X , θ g 0 ) ∣ X ] = 0 .□

Proof of Corollary 2

The proof follows from Theorem 2, Note that the asymptotic variance of the estimator that uses known weights is

Avar [ N ( θ ˜ g − θ g 0 ) ] = H g − 1 Ω g H g − 1 ,

where Ω g = E ( l i g l i g ′ ) . The result follows immediately.□

Proof of Corollary 3 (Efficiency gain with unweighted estimator under GCIME)

Using two applications of LIE and invoking MAR and unconfoundedness, I can rewrite

E S i ⋅ W i R ( X i , W i , δ ∗ ) ⋅ G ( X i , γ ∗ ) ⋅ q ( Y i ( 1 ) , X i , θ 1 0 ) = E r ( X i , 1 ) R ( X i , 1 , δ ∗ ) ⋅ p ( X i ) G ( X i , γ ∗ ) ⋅ q ( Y i ( 1 ) , X i , θ 1 0 ) .

Using another application of LIE, I can rewrite the above as follows:

E r ( X i , 1 ) R ( X i , 1 , δ ∗ ) ⋅ p ( X i ) G ( X i , γ ∗ ) ⋅ E { q ( Y i ( 1 ) , X i , θ 1 0 ) ∣ X i } .

Then,

H 1 = E r ( X i , 1 ) R ( X i , 1 , δ ∗ ) ⋅ p ( X i ) G ( X i , γ ∗ ) ⋅ ∇ θ 1 E { h ( Y i ( 1 ) , X i , θ 1 0 ) ∣ X i } = E r ( X i , 1 ) R ( X i , 1 , δ ∗ ) ⋅ p ( X i ) G ( X i , γ ∗ ) ⋅ A ( X i , θ 1 0 ) .

Similarly, use LIE to express Ω 1 as

Ω 1 = E r ( X i , 1 ) R 2 ( X i , 1 , δ ∗ ) ⋅ p ( X i ) G 2 ( X i , γ ∗ ) ⋅ E { h ( Y i ( 1 ) , X i , θ 1 0 ) h ( Y i ( 1 ) , X i , θ 1 0 ) ′ ∣ X i } = σ 01 2 ⋅ E r ( X i , 1 ) R 2 ( X i , 1 , δ ∗ ) ⋅ p ( X i ) G 2 ( X i , γ ∗ ) ⋅ A ( X i , θ 1 0 ) .

For the unweighted estimator, the variance simplifies, and this happens precisely due to the GCIME. To see this, consider H 1 u . Then, using LIE, we can rewrite

H 1 u = E [ r ( X i , 1 ) ⋅ p ( X i ) ⋅ ∇ θ 1 E { h ( Y i ( 1 ) , X i , θ 1 0 ) ∣ X i } ] = E [ r ( X i , 1 ) ⋅ p ( X i ) ⋅ A ( X i , θ 1 0 ) ] ,

and similarly, we can rewrite Ω 1 u using LIE as

Ω 1 u = E [ r ( X i , 1 ) ⋅ p ( X i ) ⋅ E { h ( Y i ( 1 ) , X i , θ 1 0 ) h ( Y i ( 1 ) , X i , θ 1 0 ) ′ ∣ X i } ] = σ 01 2 ⋅ E [ r ( X i , 1 ) ⋅ p ( X i ) ⋅ A ( X i , θ 1 0 ) ] .

Therefore, the asymptotic variance simplifies to simply

Avar [ N ( θ ˆ 1 u − θ 1 0 ) ] = σ 01 2 ⋅ ( E [ r ( X i , 1 ) ⋅ p ( X i ) ⋅ A ( X i , θ 1 0 ) ] ) − 1 .

For showing that the two variances are positive semi-definite, consider the following:

[ Avar { N ( θ ˆ 1 u − θ 1 0 ) } ] − 1 − [ Avar { N ( θ ˆ 1 − θ 1 0 ) } ] − 1 = 1 σ 01 2 ⋅ E ( r i 1 ⋅ p i ⋅ A i ) − E r i 1 ⋅ p i R i 1 ⋅ G i ⋅ A i ⋅ E r i 1 ⋅ p i R i 1 2 ⋅ G i 2 ⋅ A i − 1 ⋅ E r i 1 ⋅ p i R i 1 ⋅ G i ⋅ A i .

Let B i = r i 1 1 ∕ 2 ⋅ p i 1 ∕ 2 ⋅ A i 1 ∕ 2 and D i = ( r i 1 1 ∕ 2 ∕ R i 1 ) ⋅ ( p i 1 ∕ 2 ∕ G i ) ⋅ A i 1 ∕ 2 , then = 1 σ 01 2 { E ( B i ′ B i ) − E ( B i ′ D i ) ⋅ E ( D i ′ D i ) − 1 ⋅ E ( D i ′ B i ) } .

The quantity inside the brackets is nothing but the variance of the residuals from the population regression of B i on D i . Hence, the difference is positive semi-definite. The results for g = 0 can be proven analogously.□

References

[1] Ding P, Li F. Causal inference: a missing data perspective. Stat Sci. 2018;33(2):214–37. 10.1214/18-STS645Search in Google Scholar

[2] Cao W, Tsiatis AA, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96(3):723–34. 10.1093/biomet/asp033Search in Google Scholar PubMed PubMed Central

[3] Vansteelandt S, Carpenter J, Kenward MG. Analysis of incomplete data using inverse probability weighting and doubly robust estimators. Methodology. 2010. 10.1027/1614-2241/a000005Search in Google Scholar

[4] Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Amer Stat Assoc. 1995;90(429):122–9. 10.1080/01621459.1995.10476494Search in Google Scholar

[5] Chen X, Hong H, Tarozzi A. Semiparametric efficiency in GMM models with auxiliary data. Ann Stat. 2008;36(2):808–43. 10.1214/009053607000000947Search in Google Scholar

[6] Graham BS, de Xavier Pinto CC, Egel D. Inverse probability tilting for moment condition models with missing data. Rev Econom Stud. 2012;79(3):1053–79. 10.1093/restud/rdr047Search in Google Scholar

[7] Wooldridge JM. Inverse probability weighted estimation for general missing data problems. J Econom. 2007;141(2):1281–301. 10.1016/j.jeconom.2007.02.002Search in Google Scholar

[8] Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Stat Meth Med Res. 2013;22(3):278–95. 10.1177/0962280210395740Search in Google Scholar PubMed

[9] Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. 10.1093/biomet/70.1.41Search in Google Scholar

[10] Hahn J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica. 1998;66:315–31. 10.2307/2998560Search in Google Scholar

[11] Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Services Outcomes Res Methodology. 2001;2(3–4):259–78. 10.1023/A:1020371312283Search in Google Scholar

[12] Firpo S. Efficient semiparametric estimation of quantile treatment effects. Econometrica. 2007;75(1):259–76. 10.1111/j.1468-0262.2007.00738.xSearch in Google Scholar

[13] Słoczyński T, Wooldridge JM. A general double robustness result for estimating average treatment effects. Econom Theory. 2018;34(1):112–33. 10.1017/S0266466617000056Search in Google Scholar

[14] Huber M. Identification of average treatment effects in social experiments under alternative forms of attrition. J Educat Behav Stat. 2012;37(3):443–74. 10.3102/1076998611411917Search in Google Scholar

[15] Huber M. Treatment evaluation in the presence of sample selection. Econom Rev. 2014;33(8):869–905. 10.1080/07474938.2013.806197Search in Google Scholar

[16] Frölich M, Huber M. Treatment evaluation with multiple outcome periods under endogeneity and attrition. J Amer Stat Assoc. 2014;109(508):1697–711. 10.1080/01621459.2014.896804Search in Google Scholar

[17] Fricke H, Frölich M, Huber M, Lechner M. Endogeneity and non-response bias in treatment evaluation-nonparametric identification of causal effects by instruments. J Appl Econ. 2020;35(5):481–504. 10.1002/jae.2764Search in Google Scholar

[18] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Amer Stat Assoc. 1994;89(427):846–66. 10.1080/01621459.1994.10476818Search in Google Scholar

[19] Scharfstein D, Rotnitzky A, Robins J. Comments and rejoinder. J Amer Stat Assoc. 1999;94(448):1121–46. 10.1080/01621459.1999.10473869Search in Google Scholar

[20] Robins JM, Rotnitzky A, van der Laan M. On profile likelihood: comment. J Amer Stat Assoc. 2000;95(450):477–82. 10.1080/01621459.2000.10474224Search in Google Scholar

[21] Kang JD, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci. 2007;22(4):523–39. 10.1214/07-STS227Search in Google Scholar PubMed PubMed Central

[22] Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–73. 10.1111/j.1541-0420.2005.00377.xSearch in Google Scholar PubMed

[23] Bia M, Huber M, Lafférs L. Double machine learning for sample selection models. 2020. arXiv: http://arXiv.org/abs/arXiv:201200745. Search in Google Scholar

[24] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, et al. Double/debiased machine learning for treatment and structural parameters. Econom J. 2018;21(1):C1–C68.10.1111/ectj.12097Search in Google Scholar

[25] Henmi M, Eguchi S. A paradox concerning nuisance parameters and projected estimating functions. Biometrika. 2004;91(4):929–41. 10.1093/biomet/91.4.929Search in Google Scholar

[26] Hitomi K, Nishiyama Y, Okui R. A puzzling phenomenon in semiparametric estimation problems with infinite-dimensional nuisance parameters. Econom Theory. 2008;24(6):1717–28. 10.1017/S0266466608080699Search in Google Scholar

[27] Prokhorov A, Schmidt P. GMM redundancy results for general missing data problems. J Econom. 2009;151(1):47–55. 10.1016/j.jeconom.2009.03.010Search in Google Scholar

[28] Lok JJ. How estimating nuisance parameters can reduce the variance (with consistent variance estimation). 2021. arXiv: http://arXiv.org/abs/arXiv:210902690. Search in Google Scholar

[29] Su F, Mou W, Ding P, Wainwright M. When is the estimated propensity score better? High-dimensional analysis and bias correction. 2023. arXiv: http://arXiv.org/abs/arXiv:230317102. Search in Google Scholar

[30] Calónico S, Smith J. The women of the National Supported Work demonstration. J Labor Econom. 2017;35(S1):S65–97. 10.1086/692397Search in Google Scholar

[31] White H. Maximum likelihood estimation of misspecified models. Econometr J Econometric Soc. 1982;50:1–25. 10.2307/1912526Search in Google Scholar

[32] Angrist J, Chernozhukov V, Fernández-Val I. Quantile regression under misspecification, with an application to the U.S. wage structure. Econometrica. 2006;74(2):539–63. 10.1111/j.1468-0262.2006.00671.xSearch in Google Scholar

[33] Lalonde RJ. Evaluating the econometric evaluations of training programs with experimental data. Amer Econom Rev. 1986;76:604–20. Search in Google Scholar

[34] Hotz VJ, Imbens GW, Klerman JA. Evaluating the differential effects of alternative welfare-to-work training components: A reanalysis of the California GAIN program. J Labor Econom. 2006;24(3):521–66. 10.1086/505050Search in Google Scholar

[35] Chetty R, Friedman JN, Rockoff JE. Measuring the impacts of teachers I: evaluating bias in teacher value-added estimates. Amer Econom Rev. 2014;104(9):2593–632. 10.1257/aer.104.9.2593Search in Google Scholar

[36] Kane TJ, Staiger DO. Estimating teacher impacts on student achievement: An experimental evaluation. NBER Working Paper No. 14607. National Bureau of Economic Research (2008).10.3386/w14607Search in Google Scholar

[37] Shadish WR, Clark MH, Steiner PM. Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. J Amer Stat Assoc. 2008;103(484):1334–44. 10.1198/016214508000000733Search in Google Scholar

[38] Little RJ, Rubin DB. Statistical analysis with missing data. vol. 793. Hoboken, New Jersey: John Wiley & Sons; 2019. 10.1002/9781119482260Search in Google Scholar

[39] Newey WK, McFadden D. Large sample estimation and hypothesis testing. Handbook Econometrics. Vol. 4. North Holland, Amsterdam: Elsevier; 1994. pp. 2111–245. 10.1016/S1573-4412(05)80005-4Search in Google Scholar

[40] Komunjer I. Quasi-maximum likelihood estimation for conditional quantiles. J Econometrics. 2005;128(1):137–64. 10.1016/j.jeconom.2004.08.010Search in Google Scholar

[41] Negi A, Wooldridge JM. Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Rev. 2021;40(5):504–34. 10.1080/07474938.2020.1824732Search in Google Scholar

[42] Ding P. A first course in causal inference. 2023. arXiv: http://arXiv.org/abs/arXiv:230518793. Search in Google Scholar

[43] Imbens GW, Wooldridge JM. Recent developments in the econometrics of program evaluation. J Econom Literature. 2009;47(1):5–86. 10.1257/jel.47.1.5Search in Google Scholar

[44] Koenker R, Bassett G. Regression quantiles. Econometrica. 1978;46(1):33–50. 10.2307/1913643Search in Google Scholar

[45] Heckman J, Ichimura H, Smith J, Todd P. Characterizing selection bias using experimental data. Econometrica. 1998;66(5):1017–98. 10.2307/2999630Search in Google Scholar

Received: 2023-03-30

Revised: 2023-10-20

Accepted: 2023-10-21

Published Online: 2024-02-02

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary material

Articles in the same Issue

https://doi.org/10.1515/jci-2023-0016

Keywords for this article

unconfoundedness; missing at random; double weighting; M-estimation; treatment effects

Creative Commons

BY 4.0