Estimation of network treatment effects with non-ignorable missing confounders

Zhaohan Sun; Yeying Zhu; Joel A. Dubin

doi:10.1515/jci-2022-0079

Article Open Access

Estimation of network treatment effects with non-ignorable missing confounders

Zhaohan Sun , Yeying Zhu and Joel A. Dubin

Published/Copyright: August 10, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Journal of Causal Inference Volume 12 Issue 1

Abstract

In causal inference, interference takes place when the intervention on one unit affects the outcome of other units. Most of the previous methods for estimating network causal effects assume that the covariate information is complete, which may lead to biased estimates when missingness exists. In this study, we consider the partial and direct interference setting. Specifically, the whole population can be divided into different clusters. Within each cluster, the outcome of each unit is dependent on the intervention received by other units, but not dependent on the confounders or outcomes of other units within the same cluster or of those in different clusters. We also assume that the confounders are subject to non-ignorable missingness, and a confounder is considered as missing if any component of it is missing. We propose three consistent estimators for the direct, indirect, total, and overall effect of the intervention on the outcome, and derive the asymptotic results accordingly. A comprehensive study is carried out as well to investigate the finite sample properties of the proposed estimators. We illustrate the proposed methods by analyzing the dataset collected from an acid rain program, which was launched to reduce air pollution in the United States by encouraging the scrubber’s installation on power plants, where the records of some operating characteristics of the power generating facilities are subject to missingness.

Keywords: causal inference; missing data; partial interference; air pollution

MSC 2010: 62P12

1 Introduction

In the causal inference literature, the problem of inferring the treatment effect when data are subject to missingness has drawn a great amount of attention in recent years. According to Rubin [1], there are two types of missingness: ignorable and non-ignorable missingness. Ignorable missingness refers to missingness that is independent of the missing values, and non-ignorable missingness refers to the missingness that is dependent on the missing values. The inference for non-ignorable missingness is more challenging than ignorable missingness because the full data distribution is not fully identifiable without any assumptions, sometimes very restrictive. Following Yang et al. [2], we consider the outcome independent missingness assumption, which is plausible when the covariates are collected at the beginning of the study, and the outcome is collected long after the covariates are measured.

In most of the aforementioned work in handling missing data in the literature, methods rely on the stable unit treatment value assumption (SUTVA) [3]. SUTVA states that (i) the potential outcome of each unit is unaffected by the treatment assignment of any other unit, and (ii) there are no different versions of each treatment level. However, the first assumption, which is known as the no interference assumption [4], can be violated in some scenarios. For example, in 1990, the acid rain program (ARP) was launched to reduce ambient PM2.5 (atmospheric particulate matter (PM) that has a diametre of less than 2.5 μ m ) by assigning power plants to install scrubber facilities [5]. The monitored reduction of SO 2 emission data at the location of one power plant not only depends on its own scrubber installation but may also be affected by the intervention on power plants upwind. Another example comes from the nationally representative US Population Assessment of Tobacco and Health (PATH) Study [6], where researchers were interested in evaluating the influence of Electronic Nicotine Delivery Systems (ENDS) and pharmaceutical cessation aids on persistent abstinence from cigarette smoking and reduced cigarette consumption, and in this study, it has been revealed that one individual’s marital satisfaction and family members smoking status can affect this individual’s smoking cessation, that is, the smoking cessation of one individual may be affected by the intervention of other family members. More examples can also be found in biomedical research, health sciences, and social networking studies.

Various identification and estimation methods have been proposed in the scenario when interference exists, but the confounders and outcome are fully observed. Generally, there are two types of interference: full interference and partial interference. The full interference happens when the outcome of a unit is affected by the intervention on other units that interfere with this unit. The network interference structure can be represented by an adjacency matrix: the entries of which take the value on { 0 , 1 } (e.g., if the unit i is affected by the intervention on the individual j , then the entry of the matrix in i th row and j th column equals one; otherwise, the value of the entry equals zero). The partial interference is a special case of the full interference where the adjacency matrix follows block diagonal structure, the entry of the matrix is equal to zero if the unit of the corresponding row and the unit of the corresponding column are not in the same block, that is, the interference may happen between units in the same block but not between units in different blocks. In this article, we consider the latter type of interference and focus on the semiparametric estimation of four network treatment effects: the direct effect, indirect effect, total effect, and overall effect (Tchetgen Tchetgen and VanderWeele [7], Hudgens and Halloran [8], Liu et al. [9], Papadogeorgou et al. [10]). The definitions and illustrations of these treatment effects are presented in Section 2.

There is limited work discussing the estimation of network treatment effect when confounders are subject to non-ignorable missingness. Sun and Liu [11] proposed a doubly robust estimator when data are subject to non-ignorable missingness, where the data are assumed to be independent and the interference was ignored in data application. Unlike Sun and Liu, we study the partial interference setting and propose three pairs of semiparametric estimators: inverse probability weighting (IPW), regression, and doubly robust (DR) estimators for the four types of network treatment effects. The proposed IPW estimator requires correct specification of the propensity score for treatment selection and the missingness mechanism. The regression estimator requires a correctly specified outcome regression model, and the doubly robust estimator is shown to be consistent if either set of the IPW models (propensity score of missingness and propensity score of treatment) or the regression model, but not necessarily both, are correctly specified. The IPW and DR estimators require the product of unit-level propensity scores, where the existence of extreme probabilities may lead to high-variant estimators. To circumvent the problem of varying cluster sizes, we propose self-normalized IPW and DR estimators, which can be viewed as stabilized versions of and have smaller variances than their respective conventional estimators.

The rest of this article is organized as follows. We start by introducing the notation and assumptions in Section 2. In Section 3, we propose three pairs of semiparametric network treatment effect estimators. The performance of the proposed methods is further illustrated via simulation studies in Section 4. We analyze real data from the ARP in Section 5 and leave the discussion of our methodology and potential extensions as future work in Section 6.

2 Notation and assumptions

Following Perez-Heydrich et al. [12] and Tchetgen Tchetgen and VanderWeele [7], we consider that K clusters are randomly sampled from an infinite superpopulation of clusters. Assume the total number of units is N, and each group i has N i units, where 1 ≤ i ≤ K and 1 ≤ N i ≤ N − K + 1 . Let X i j = ( X 1 i j , X 2 i j , … X p i j ) denote p -dimensional confounders of unit j in group i , the values of which may be subject to missingness, and let X i = ( X i 1 , X i 2 , … X i N i ) be the confounders of all units in group i . Let R i j = 1 if X i j is complete and R i j = 0 if X i j is missing. In this article, we consider the single missingness pattern, that is, R i j = 0 if any components of the confounders of unit j in group i are missing. We leave the extension to multiple missingness patterns as a future research direction. Let Y i j and A i j denote the observed outcome and treatment status for unit j in group i , respectively, and Y i = ( Y i 1 , Y i 2 , … , Y i N i ) and A i = ( A i 1 , A i 2 , … , A i N i ) denote the vectors of observed outcome and treatment indicators, respectively, for all units in group i . Assume A i ( − j ) = ( A i 1 , … , A i j − 1 , A i j + 1 , … A i N i ) is the vector of treatment status for all units in group i except for individual j . Let a i j , a i ( − j ) , and a i denote possible values of A i j , A i ( − j ) , and A i , respectively. Suppose A ( n ) is the set of vectors of all possible treatment assignments of length n . Then, there are 2 N i possible treatment assignments in group i , and a i ∈ A ( N i ) .

We consider the scenario when only X is missing, while the other variables are fully observed. Suppose the treatment is assigned with an α -strategy where every unit in a group is assigned to the treatment with average treatment allocation probability α . In a randomized trial, the probability of being treated is fully determined by the treatment allocation probability. However, in observational studies, whether an individual receives the treatment assignment is not only determined by the randomized treatment allocation strategy but dependent on his/her choice of participation in the study. To avoid the additional modelling of participation (probability of the participation status given confounders), we will model the probability of the treatment indicator conditional on confounders directly. Let Y i j ( a i ) and Y i j ( a i j , a i − j ) denote the potential outcome for unit j in group i under treatment allocation a i , where a i = ( a i 1 , … a i , j − 1 , a i j , a i , j + 1 , … a i , N i ) , and let denote Y i ( a i ) as the vector of potential outcomes for all units in group i . Let Y ¯ i ( a , α ) = N i − 1 ∑ j = 1 N i ∑ a i ( − j ) ∈ A ( N i − 1 ) Y i j ( a , a i ( − j ) ) π ( a i ( − j ) ; α ) denote the average potential outcome for group i , where π ( a i ( − j ) ; α ) = Pr ( A i ( − j ) = a i ( − j ) ∣ A i j = a i j , X i ) is the probability of A i ( − j ) = a i ( − j ) in group i . Let Y ¯ i ( α ) = N i − 1 ∑ j = 1 N i ∑ a i ∈ A ( N i ) Y i j ( a i ) π ( a i ; α ) denote the marginal average potential outcome for group i .

There are different treatment allocation strategies that can be deployed. For example, in the cholera vaccine study [8], π ( a i ; α ) = ∏ j α a i j ( 1 − α ) 1 − a i j and π ( a i ( − j ) ; α ) = ∏ j ′ ≠ j α a i j ′ ( 1 − α ) 1 − a i j ′ , that is, the treatment allocation strategy does not depend on individuals’ characteristics. Once the individuals choose to participate in the trial, they are classified into different groups. For each group, the individuals are assigned with the vaccine randomly with probability α . On the other hand, in the ARP study, the intervention of scrubber installation is encouraged by federal regulations. However, the adoption of intervention is also affected by the characteristics of power plants such as size and heat input. To account for the influence of confounders on the treatment allocation probability, Papadogeorgou et al. [10] models P α , x ( a i j ) as logit { P α , x ( a i j = 1 ∣ X i j ) } = ξ i α + δ X i j , where ξ i ( α ) satisfying ( N i − 1 ) ∑ i = 1 N i expit ( ξ i ( α ) + δ X i j ) = α is a parameter to be estimated, and δ is some pre-fixed value. P α , x ( a i ) represents treatment allocation for all units in group i , while P α , x ( a i ( − j ) ) denotes the treatment assignment in group i excluding unit j . More details can be found in Section 5. We adopt the same modelling strategy in this article.

The average potential outcome and the marginal average potential outcome are defined as μ a α = E ( Y ¯ i ( a , α ) ) and μ α = E ( Y ¯ i ( α ) ) , respectively. Then, the direct effect (or the population average treatment effect) is defined as DE ¯ ( α ) = E ( Y ¯ i ( 1 , α ) − Y ¯ i ( 0 , α ) ) = μ 1 α − μ 0 α . For the ARP study, the direct effect represents the difference in the amount of PM2.5 emissions when a power generating facility is equipped with scrubbers compared to when scrubbers are not installed. For policies α 0 and α 1 , the indirect effect is defined as IE ¯ ( α 0 , α 1 ) = E ( Y ¯ i ( 0 , α 1 ) − Y ¯ i ( 0 , α 0 ) ) = μ 0 α 1 − μ 0 α 0 , where only the units that are not equipped with scrubbers are considered. The total effect is denoted by TE ¯ ( α 1 , α 1 ) = E ( Y ¯ i ( 1 , α 1 ) − Y ¯ i ( 0 , α 0 ) ) = μ 1 α 1 − μ 0 α 0 , which corresponds to the combination of both the direct effect and the indirect effect. The overall effect is defined as OE ¯ ( α 0 , α 1 ) = E ( Y ¯ i ( α 1 ) − Y ¯ i ( α 0 ) ) = μ α 1 − μ α 0 , which represents the difference in the amount of PM2.5 emissions for units under one coverage probability of scrubbers’ installation compared to units with another level of coverage probability. In Section 3, we focus on estimating the direct effect. The estimation of the other network causal effects can be approached in a similar way, and we present the results of four types of network causal effects in the data application. For the purpose of estimation with incomplete confounders, we assume the type of interference to be direct interference (see more details in Ogburn et al. [13]), where the interference happens only through the effect of the treatment assignment on other units in the same clusters. We leave more sophisticated interference types such as contagion interference and allocation interference as a future research direction. Throughout this article, we also make the following assumptions:

Causal consistency assumption: A subject’s potential outcome under their observed treatment assignment is equal to the outcome that will actually be observed, that is, Y i j = ∑ a i ∈ A ( N i ) 1 ( A i = a i ) Y i j ( a i ) .
Exchangeability assumption: For each group, the treatment vector A i is assumed to be conditionally independent with potential outcomes given confounders X i , that is, A i ⊥ ⊥ Y i ( a i ) ∣ X i .
Positivity assumption: Pr ( A i ∣ X i ) > 0 and Pr ( R i j = 1 ∣ A i , X i ) > 0 for all A i and X i .
Outcome-independent missingness assumption: R i ⊥ ⊥ Y i ∣ A i , X i .

It has been shown in Ding and Geng [14] that without any assumptions, the joint distribution of ( A i , Y i , X i ) is not identifiable, and they provided the lower and upper bounds for the causal effects, which may be wide. Yang et al. [2] also considered non-parametric identification and non-parametric estimation of the causal effect with an auxiliary variable Z . Here, we propose the IPW estimator, regression, and DR estimator for the defined network causal effects, and discuss the intuition behind the construction of these estimators.

In this article, we consider the non-ignorable missing confounders. The missingness mechanism is ignorable if the absence of the confounders is only dependent on the observed data, that is, R i ⊥ ⊥ X i ∣ A i , Y i , and the confounders are non-ignorable if the missingness depends on the missing values. Under non-ignorable missingness, the full data distribution is not identifiable without additional assumptions. Therefore, for identification purposes, we assume that the missingness mechanism for confounders X i satisfies the group-level outcome independent missingness assumption, which is modified from the outcome-independent missingness assumption in [2], that is, R i ⊥ ⊥ Y i ∣ A i , X i . The associated causal diagram is shown in Figure 1. The assumption is plausible if the confounders are measured long before the outcome data are collected. For example, as mentioned in Yang et al. [2], the potentially exposed children and their neighbourhoods were more carefully measured than those that were not under the risk of exposure in the water crisis study in Flint, Michigan U.S., which implies that the missingness may depend on both the measured confounders and the exposure status. In addition, the health status of the children was tested long after the confounders (e.g., age) were collected. Thus, the missingness of confounders is independent of the outcome conditional on all the other relevant information including observed confounders and exposure status. Miao and Tchetgen Tchetgen [15] proposed using an additional auxiliary variable in the missing data setting and proposed various semiparametric and DR estimators. In our case, the outcome Y serves as the auxiliary variable for the confounder missing indicator R . Our estimators here are counterpart estimators of those.

$Figure 1 The causal diagram for the group-level outcome independent missingness assumption, where the dashed line represents the conditional independence between Y i {Y}_{i} and R i {R}_{i} , 1 ≤ i ≤ N i 1\le i\le {N}_{i} .$

Figure 1

The causal diagram for the group-level outcome independent missingness assumption, where the dashed line represents the conditional independence between Y i and R i , 1 ≤ i ≤ N i .

3 Estimation

3.1 Inverse probability weighting

In this section, we propose an IPW estimator for network treatment effects when confounders are subject to non-ignorable missingness mechanism. The idea of constructing the IPW estimator is to weigh each individual with the inverse of probability of receiving the treatment, such that the association between confounders and treatment assignment can be removed. However, since X is not fully observed, Pr ( A ∣ X ) is not estimable without further adjustment. Besides, even if the data are fully observed, we cannot directly apply the IPW as the data are not independent within the same group. To address this, we utilize the inverse of the group-level joint propensity score of treatment assignment and missingness mechanism as the weight for each subject, that is, 1 ∕ Pr ( A i , R i j ∣ X i ) , i = 1 , 2 , … K , j = 1 , 2 , … N i , and replace the number of individuals in each group by the sum of inverse joint propensity scores within the group.

To obtain the group-joint propensity score modelling for the treatment and missingness, we assume P ( R i j = 1 ∣ A i j , X i j ) is correctly specified as P ( R i j = 1 ∣ A i j , X i j ; γ ) , and P ( A i j = 1 ∣ R i j = 1 , X i j ) is correctly specified as P ( A i j = 1 ∣ R i j = 1 , X i j ; δ ) . The unknown parameter γ can be estimated using generalized methods of moments, and δ can be estimated via maximum likelihood approach. After obtaining the estimates of P ( R i j = 1 ∣ A i j , X i j ) and P ( A i j = 1 ∣ R i j = 1 , X i j ) , the joint density of treatment A and missingness R conditional on confounders X can be parameterized as follows [16]:

(1) Pr ( a , r ∣ x ) = ψ ( a , a 0 , r , r 0 ∣ x ) Pr ( r ∣ a 0 , x ) Pr ( a ∣ r 0 , x ) ∑ r = 0 1 ∑ a = 0 1 ψ ( a , a 0 , r , r 0 ∣ x ) Pr ( r ∣ a 0 , x ) Pr ( a ∣ r 0 , x ) ,

where r 0 = 1 , a 0 = 1, and

ψ ( a , a 0 , r , r 0 ∣ x ) = Pr ( r ∣ a , x ) Pr ( r 0 ∣ a 0 , x ) Pr ( r ∣ a 0 , x ) Pr ( r 0 ∣ a , x ) .

For a complete proof of equation (1), please refer to Web Appendix A of Chen [16]. We then construct the IPW estimator for the population average potential outcome with estimated parameters as follows:

(2) μ ˆ a α ipw = 1 K ∑ i = 1 K Y ˆ i IPW ( a , α ) = 1 K ∑ i = 1 K ∑ j = 1 N i w ˆ i j a α Y i j ∑ j = 1 N i w ˆ i j a α ;

and the marginal average potential outcome is defined as follows:

(3) μ ˆ α ipw = 1 K ∑ i = 1 K Y ˆ i IPW ( α ) = 1 K ∑ i = 1 K ∑ j = 1 N i w ˆ i j α Y i j ∑ j = 1 N i w ˆ i j α ,

where

(4) w ˆ i j a α = ∑ j = 1 N i 1 ( A i j = a ) 1 ( R i j = 1 ) P α , x ( A i ( − j ) ) Pr ˆ ( A i , R i j ∣ X i ; δ ˆ , γ ˆ ) ,

(5) w ˆ i j α = ∑ j = 1 N i 1 ( R i j = 1 ) P α , x ( A i ) Pr ˆ ( A i , R i j ∣ X i ; δ ˆ , γ ˆ ) ,

where Y ˆ i IPW ( a , α ) = ∑ j = 1 N i w ˆ i j a α Y i j ∕ ∑ j = 1 N i w ˆ i j a α , and Y i IPW ( α ) = ∑ j = 1 N i w ˆ i j α Y i j ∕ ∑ j = 1 N i w ˆ i j α . Assume the estimation equations for δ and γ are ∑ i = 1 K ψ δ ( O i ; δ ) = 0 and ∑ i = 1 K ψ γ ( O i ; γ ) = 0 , respectively, where O i = ( X i , Y i , A i , R i ) . Let ψ a IPW ( O i ; μ 1 α , γ , δ ) = Y ˆ i IPW ( 1 , α ) − μ 1 α , and ψ a IPW ( O i ; μ 0 α , γ , δ ) = Y ˆ i IPW ( 0 , α ) − μ 0 α . Then the estimated parameter θ ˆ IPW = ( δ ˆ , γ ˆ , μ ˆ 1 α IPW , μ ˆ 0 α IPW ) is a solution of the following estimation equations:

(6) ψ K ( θ ) = ∑ i = 1 K ψ IPW ( O i ; θ ) = 0 ,

where ψ IPW ( O i ; θ ) = { ψ δ ( O i ; δ ) , ψ δ ( O i ; γ ) , ψ a IPW ( O i ; μ 1 α , γ , δ ) , ψ a IPW ( O i ; μ 0 α , γ , δ ) } T . The true values of unknown parameter θ = ( δ , γ , μ 1 α , μ 0 α ) is the solution to ψ ( θ ) = ∫ ψ IPW ( O i ; θ ) d F ( o i ) = 0 , where F denotes the cumulative function of O i . Note that the random observations O i are i.i.d., and it can be shown that, for the last two of the true parameters θ , μ a α = E ( ∑ a i ( − j ) ∈ A ( N i − 1 ) Y i j ( a , a i ( − j ) ) π ( a i ( − j ) ; α ) ∕ N i ) , a = 0 , 1 . By the strong law of large numbers, and the uniqueness of the solution to ψ a IPW ( O i ; μ ) , we have that Y ˆ IPW ( a ; α ) is consistent for μ a α . The asymptotic normality follows by the M-estimation theory [17]. We show an example below of using equations (2) and (3) to obtain the IPW estimators. In this article, we utilize the direct interference pattern, where the interference happens between the units only through their treatment assignment. The treatment assignment is dependent on its own characteristics and the group random effect. For the missingness mechanism, we assume that missingness of each unit is only dependent on its own treatment assignment and characteristics. The outcome of each unit is dependent on its own characteristics and the treatment assignment of the whole group. For example, we may assume:

logit { Pr ( A i j = 1 ∣ X i j = x i j , R i j = 1 , b i ; δ ) } = δ 0 + δ 1 x i j + b i ,
logit { Pr ( R i j = 1 ∣ A i j = a i j , X i j = x i j ; γ ) } = γ 0 + γ 1 a i j + γ 2 x i j , f ( y i j ∣ a i , x i , r i j = 1 ) = N ( β 0 + β 1 + β 2 ⋅ ( a i j , f a ( a i ( − j ) ) ) T + β 3 x i j , ζ i ) , where N ( μ , σ 2 ) is a normal density function with mean μ and variance σ 2 ,
logit { P α , x ( a i j ) } = ξ i ( α ) + δ x i j ,

where f a ( ⋅ ) is a summary function that represents the other units effect on the same cluster, ζ i ∼ N ( 0 , σ ζ 2 ) and b i ∼ N ( 0 , σ b 2 ) represent the random effects that introduce the dependency between units in the same cluster. Assume p ˆ a r ( X i ) = Pr ( A i j ′ , R i j ∣ X i j , b i , δ ˆ , γ ˆ ) . The estimate of the joint probability of Pr ( A i , R i j ∣ X i ) can be obtained by

(7) Pr ( A i , R i j ∣ X i , δ ˆ , γ ˆ ) = ∫ − ∞ ∞ ∏ j ′ = 1 N i p ˆ a r ( X i ) A i j ( 1 − p ˆ a r ( X i ) ) 1 − A i j f ( b i ) d b i ,

and p ˆ a r ( X i ) is obtained by equation (1). The consistency and asymptotic normality of the IPW estimator are presented in Theorem 1.

Theorem 1

If P ( A i j = 1 ∣ R i j = 1 , X i j ; δ ) and P ( R i j = 1 ∣ A i j , X i j ; γ ) are correctly specified, then IPW estimator μ ˆ a α IPW is consistent for μ a α , and

K ( θ ˆ IPW − θ ) ⟶ d N ( 0 , Σ IPW ) ,

as K goes to infinity, where Σ IPW = U ( θ ) − 1 V ( θ ) { U ( θ ) − 1 } T , U ( θ ) = E { − ∂ ψ IPW ( O i ; θ ) ∕ ∂ θ T } , and V ( θ ) = E { ψ IPW ( O i ; θ ) ψ IPW ( O i ; θ ) T } .

The proof of Theorem 1 is given in the Supplementary materials.

3.2 Regression

In this section, we first introduce the idea of constructing the estimation equation for the regression estimator. Notice that by exchangibility assumption, we can estimate the causal effect by regressing Y on A and X , and then marginalize over X if confounders are fully observed. However, if confounders are subject to non-ignorable missingness, we cannot directly estimate the causal effect because we do not know the distribution of the confounders in the whole population. To recover the full data distribution, we need to obtain the joint probability density of X and Y conditional on A and R = 0 for each group, which requires estimation of the odds ratio model and the group level joint probability density of X and Y conditional on A and R = 1 . First, we model f ( Y i j ∣ A i , X i j , R i j = 1 ) as f ( Y i j ∣ A i , X i j , R i j = 1 ; β ) , and model f ( X i j ∣ A i , R i j = 1 ) as f ( X i j ∣ A i , R i j = 1 ; β ) . We then define the odds ratio function O R ( X , Y ∣ A ) as follows:

(8) O R ( X i j , Y i j ∣ A i ) = log f ( X i j , Y i j ∣ A i , R i j = 0 ) f ( X i j = x 0 , Y i j ∣ A i , R i j = 1 ) f ( X i j , Y i j ∣ A i , R i j = 1 ) f ( X i j = x 0 , Y i j ∣ A i , R i j = 0 ) = log P ( R i j = 0 ∣ A i , X i j , Y i j ) P ( R i j = 1 ∣ A , X i j = x 0 , Y i j = 0 ) P ( R i j = 1 ∣ A , X i j , Y i j ) P ( R i j = 0 ∣ A i , X i j = x 0 , Y i j = 0 ) = log P ( R i j = 0 ∣ A i , X i j ) P ( R i j = 1 ∣ A i , X i j = x 0 ) P ( R i j = 1 ∣ A i , X i j ) P ( R i j = 0 ∣ A i , X i j = x 0 ) = log f ( X i j ∣ A i , R i j = 0 ) f ( X i j = x 0 ∣ A i , R i j = 1 ) f ( X i j ∣ A i , R i j = 1 ) f ( X i j = x 0 ∣ A i , R i j = 0 ) ,

where 1 ≤ i ≤ K , 1 ≤ j ≤ N i , and x 0 is an arbitrary fixed constant. For simplicity, we let x 0 = 0 in the subsequent sections. Since the last equation does not depend on Y , we can simplify the notation O R ( X i j , Y i j ∣ A i ) as O R ( X i j ∣ A i ) , which we model as O R ( X i j ∣ A i ; ζ ) . Thus, we can parameterize the joint probability density of X and Y conditional on A and R = 0 as follows:

(9) f ( X i j , Y i j ∣ A i , R i j = 0 ; β , ζ ) = exp { O R ( X i j ∣ A i ; ζ ) } f ( X i j , Y i j ∣ A i , R i j = 1 ; β ) E [ exp { O R ( X i j ∣ A ; ζ ) } ∣ A i , R i j = 1 ; β ] .

The parameter ζ can be estimated from the equation ∑ i = 1 K ψ ζ ( O i ; ζ ) = 0 , where ψ ζ ( O i ; ζ ) has the following expression:

(10) ( 1 − R i ) T { l ( A i , Y i ) − E { l ( A i , Y i ) ∣ A i , R i = 0 ; β ˆ , ζ } } ,

where l ( A , Y ) are pre-defined vectorized differentiable functions, and

(11) E { l ( A i , Y i j ) ∣ A i , R i j = 0 ; β ˆ , ζ } = ∫ exp { O R ( X i j ∣ A i ; ζ ) } f ( X i j , Y i j ∣ A i , R i j = 1 ; β ˆ ) l ( A i , Y i j ) d y E [ exp { O R ( X i j ∣ A i ; ζ ) } ∣ A i , R i j = 1 ; β ˆ ] .

Once we have f ( X i j , Y i j ∣ A i , R i j = 0 ; β , ζ ) , we can obtain f ( X i j , Y i j ∣ A i ; β , ζ ) = ∑ r = 0 , 1 f ( X i j , Y i j ∣ A i , R i j = r i j ; β , ζ ) P ( R = r i j ∣ A i ) , where P ( R i j = r i j ∣ A i j ) can be directly estimated by maximum likelihood estimation (MLE) because R and A are fully observed. Note that f ( Y i j ∣ A i , X i j ; β , ζ ) ∝ f ( X i j , Y i j ∣ A i ; β , ζ ) ; thus, we can obtain the regression estimators. More specifically, let g i j ( a i , x i ) = E ( Y i j ∣ A i = a i , X i = x i , R i j = 1 ) , then the regression estimators for the average potential outcome and the marginal average potential outcome have the following expressions:

(12) μ ˆ a α reg = 1 K ∑ i = 1 K Y ˆ i reg ( a , α ) = 1 K ∑ i = 1 K 1 N i ∑ j = 1 N i ∑ a i ( − j ) 1 ( R i j = 1 ) g ˆ i j ( a i , X i ; β ˆ ) P α , x ( a i ( − j ) ) + 1 N i ∑ j = 1 N i ∑ a i ( − j ) 1 ( R i j = 0 ) P α , x ( a i ( − j ) ) E { g ˆ i j ( a i , X i ; β ˆ ) ∣ A i = a i , R i j = 0 ; β ˆ , ζ ˆ } ,

(13) μ ˆ α reg = 1 K ∑ i = 1 K Y ˆ i reg ( α ) = 1 K ∑ i = 1 K 1 N i ∑ j = 1 N i ∑ a i 1 ( R i j = 1 ) g ˆ i j ( a i , X i ; β ˆ ) P α , x ( a i ) + 1 N i ∑ j = 1 N i ∑ a i 1 ( R i j = 0 ) P α , x ( a i ) E { g ˆ i j ( a i , X i ; β ˆ ) ∣ A i = a i , R i j = 0 ; β ˆ , ζ ˆ } ,

where β ˆ is the MLE of β and ζ ˆ is an estimator of ζ by equation (10). Assume the estimation equation for β is ∑ i = 1 K ψ β ( O i ; β ) = 0 . Let ψ a reg ( O i ; μ 1 α , ζ , β ) = Y ˆ i reg ( 1 , α ) − μ 1 α , and ψ a reg ( O i ; μ 0 α , ζ , β ) = Y ˆ i reg ( 0 , α ) − μ 0 α denote the estimation equations for μ 1 α and μ 0 α , respectively. Then the estimated parameter θ ˆ reg = ( β ˆ , ζ ˆ , μ ˆ 1 α reg , μ ˆ 0 α reg ) is a solution of the following estimation equations:

(14) ∑ i = 1 K ψ reg ( O i ; θ ) = 0 .

The consistency and normality of the regression estimator are presented in Theorem 2.

Theorem 2

If the baseline outcome regression model f ( Y i j ∣ A i , X i j R i j = 1 ; β ) and the distribution of observed confounders f ( X i j ∣ A i , R i j = 1 ) are correctly specified, then the regression estimator μ ˆ a α reg is consistent for μ a α , and

K ( θ ˆ reg − θ ) ⟶ d N ( 0 , Σ reg ) ,

as K goes to infinity, where Σ reg = U ( θ ) − 1 V ( θ ) { U ( θ ) − 1 } T , U ( θ ) = E { − ∂ ψ reg ( O i ; θ ) ∕ ∂ θ T } , and V ( θ ) = E { ψ reg ( O i ; θ ) ψ reg ( O i ; θ ) T } .

The proof of Theorem 2 is given in the Supplementary materials.

3.3 Doubly robust estimation

In this section, we aim to propose a DR estimator μ ˆ a α in the sense that it is consistent if either the propensity score models or the baseline regression models, but not necessarily both, are correctly specified. The typical DR estimator involves two parts, where the first part is the regression term, and the second part is the inverse probability weighted residuals of the regression estimator. When confounders are missing not at random, both the propensity score of missingness and the outcome regression model contain the odds ratio model which can be seen in the definition in equation (8). Therefore, the specification of propensity score models and the regression models are not variational independent. Hence, the specification of the propensity score models and the regression model cannot be fully seperated. More specifically, in our setting, the specification of both the propensity score of missingness and the probability f ( x i j , y i j ∣ a i , r i j = 0 ) contains the odds ratio O R ( x i j ∣ a i ) , that is, O R ( x i j ∣ a i ) lies in the intersection of the IPW estimator and the regression estimator. Thus, to construct the DR estimator, we assume O R ( x ∣ a ; ζ ) is always correctly specified, and the proposed DR estimator is consistent if either the IPW models, Pr ( a i j = 1 ∣ r i j = 1 , x i j ; δ ) and Pr ( r i j = 1 ∣ a i j = 1 , x i j ; γ ′ ) , are correctly specified or the outcome regression model conditional on the observed values, f ( y i j ∣ a i , x i j , r i j = 1 , β ) and f ( x i j ∣ a i , r i j = 1 , β ) , are correctly specified, where γ = ( ζ , γ ′ ) . ζ is obtained by solving the equation ∑ i = 1 K ψ ζ ( O i ; ζ ) = 0 , where ψ ζ ( O i ; ζ ) = v T ( A i , R i , X i ; δ , ξ ) ϕ ( A i , Y i ; β ˆ , ξ ˆ ) ,

(15) v ( A i , R i , X i ; δ , ξ ) = R i 1 Pr ( R i 1 = 1 ∣ A i 1 , X i 1 , δ ˆ , ξ ˆ ) − 1 R i 2 Pr ( R i 2 = 1 ∣ A i 2 , X i 2 , δ ˆ , ξ ˆ ) − 1 ⋯ R i N i Pr ( R i N i = 1 ∣ A i N i , X i N i , δ ˆ , ξ ˆ ) − 1 , and

(16) ϕ ( A i , Y i ; β ˆ , ξ ˆ ) = l ( A i 1 , Y i 1 ) − E { l ( A i 1 , Y i 1 ) ∣ A i , R i 1 = 0 ; β ˆ , ξ ˆ } l ( A i 2 , Y i 2 ) − E { l ( A i 2 , Y i 2 ) ∣ A i , R i 2 = 0 ; β ˆ , ξ ˆ } ⋯ l ( A i N i , Y i N i ) − E { l ( A i N i , Y i N i ) ∣ A i , R i N i = 0 ; β ˆ , ξ ˆ } .

Since the confounders are not fully observed, we replace the weights in the residuals of the regression estimator in the traditional DR estimator by the joint probability of A and R conditional on X , and construction of the regression estimator term is similar as that in Section 3.2. Hence, the DR estimator has the following representation:

(17) μ ˆ a α d r = 1 K ∑ i = 1 K ∑ j = 1 N i E { h ˆ i j a ( A i , X i , Y i ) ∣ A i = a i , R i j = 0 ; δ ˆ , γ ˆ , β ˆ } + 1 ( R i j = 1 ) Pr ( R i j ∣ A i = a i , X i ; γ ) × { h ˆ i j a ( A i , X i , Y i ) − E { h ˆ i j a ( A i , X i , Y i ) ∣ A i = a i , R i j = 0 ; δ ˆ , γ ˆ , β ˆ } } ] ]

and

(18) μ ˆ α d r = 1 K ∑ i = 1 K ∑ j = 1 N i E { h ˆ i j ( A i , X i , Y i ) ∣ A i = a i , R i j = 0 ; δ ˆ , γ ˆ , β ˆ } + 1 ( R i j = 1 ) Pr ( R i j ∣ A i , X i ; γ ) { h ˆ i j ( A i , X i , Y i ) − E { h ˆ i j ( A i , X i , Y i ) ∣ A i , R i j = 0 ; δ ˆ , γ ˆ , β ˆ } } ] ] ,

where h i j a ( A i , X i , Y i ; δ , γ , β ) , h i j ( A i , X i , Y i ; δ , γ , β ) have the following expressions:

(19) h i j a ( A i , X i , Y i ; δ , γ , β ) = w i j a α ( Y i j − g i j ( A i , X i ; β ) ) + 1 N i ∑ a i ( − j ) P α , x ( a i ( − j ) ) g i j ( a , X i , β ) ,

(20) h i j ( A i , X i , Y i ; δ , γ , β ) = w i j α ( Y i j − g i j ( A i , X i ; β ) ) + 1 N i ∑ a i P α , x ( a i ) g i j ( a , X i , β ) ,

(21) w i j a α = 1 ( A i j = a ) P α , x ( A i ) Pr ( A i ∣ X i ; δ , γ ) ∑ j = 1 N i 1 ( A i j = a ) P α , x ( A i ) Pr ( A i ∣ X i ; δ , γ ) ,

and

(22) w i j α = P α , x ( A i ) Pr ( A i ∣ X i ; δ , γ ) ∑ j = 1 N i P α , x ( A i ) Pr ( A i ∣ X i ; δ , γ ) .

The basic idea of the construction of h i j a ( A i , X i , Y i ; δ , γ , β ) lies in that equation (19) is a doubly robust estimator of μ a α , in the sense that the expectation of h a ( A , X , Y ) converges to μ a α when data are fully observed, that is, E { ∑ j = 1 N i h i j a ( A i , X i , Y i ; δ , γ , β ) } = μ a α , if either the IPW models or regression models are correctly specified. When the propensity score models are correctly specified,

E ∑ j = 1 N i h i j a ( A i , X i , Y i ; δ , γ , β ) = E ∑ j = 1 N i w i j a α Y i j + ∑ a i ( − j ) P α , x ( A i ( − j ) ) g i j ( a , X i , β ) − w i j a α g i j ( A i , X i ; β ) = μ a α ipw .

Similarly, when the baseline regression models have been correctly specified, it is obvious to see that the expectation of the first part of equation (19) is equal to zero, and the expectation of the second part is the regression estimator, that is, E { ∑ j = 1 N i h i j a ( A i , X i , Y i ; δ , γ , β ) } = μ a α reg . Therefore, to obtain the doubly robust estimator when confounders are subject to non-ignorable missingness, we can replace the observed outcomes and regression estimator in the conventional residuals weighted DR estimator by h i j a ( A i , X i , Y i ; δ , γ , β ) and E { h i j ( X i , Y i ; δ , γ , β ) ∣ A i , R i j = 0 } , respectively. Hence, the constructed estimator can be viewed as a new residuals weighted DR estimator, where the weights are the inverse of the estimated propensity score of missingness, and the residuals come from the constructed h i j a ( A i , X i , Y i ; δ , γ , β ) instead of the regression estimator. By allowing either one of the set of models to be correctly specified, the doubly robustness property can also be achieved accordingly. For estimating γ , δ , β , the estimators can be obtained by maximum likelihood approach using observed data when R = 1 . While our current study primarily focuses on the assumption of homogeneous parameters, β , γ , and δ , it is worth considering the potential implications of allowing these parameters to vary across clusters. In such a scenario, the parameters of the group average potential outcome would exhibit variation across different clusters. The estimation equation for γ , δ , β can be written as ∑ i = 1 K ψ α ( O i ; γ ) = 0 , ∑ i = 1 K ψ δ ( O i ; δ ) = 0 , and ∑ i = 1 K ψ β ( O i ; β ) = 0 , respectively. Let ψ a d r ( O i ; μ 1 α , ζ , β ) = Y ˆ i d r ( 1 , α ) − μ 1 α , and ψ a d r ( O i ; μ 0 α , ζ , β ) = Y ˆ i d r ( 0 , α ) − μ 0 α denote the estimation equations for μ 1 α and μ 0 α , respectively. Then the estimated parameter θ ˆ d r = ( α ˆ , δ ˆ , β ˆ , ζ ˆ , μ ˆ 1 α d r , μ ˆ 0 α d r ) is a solution of the following estimation equations:

(23) ∑ i = 1 K ψ d r ( O i ; θ ) = 0 .

The consistency and asymptotic normality of the proposed DR estimator are summarized in the following theorem.

Theorem 3

Assume O R ( X i j ∣ A i ; ζ ) is correctly specified, if either (a) Pr ( R i j = 1 ∣ A i j = 1 , X i ; δ ) and Pr ( A i j = 1 ∣ R i j = 1 , X i j ; γ ) or (ii) f ( Y i j ∣ A i , X i j , R i j = 1 ; β ) and f ( X i j ∣ A i , R i j = 1 ) are correctly specified; then the DR estimator μ ˆ a α d r is consistent for μ a , where μ ˆ a α d r is obtained from equation (17), and

K ( θ ˆ d r − θ ) ⟶ d N ( 0 , Σ d r ) ,

as K goes to infinity, where Σ d r = U ( θ ) − 1 V ( θ ) { U ( θ ) − 1 } T , U ( θ ) = E { − ∂ ψ d r ( O i ; θ ) ∕ ∂ θ T } , and V ( θ ) = E { ψ d r ( O i ; θ ) ψ d r ( O i ; θ ) T } .

The proof of Theorem 3 is given in the Supplementary materials.

4 Simulation

In this section, we conduct a simulation study to illustrate the proposed IPW, regression, and DR estimator. First, under setting 1 (S1), we generate a population of size N = 10,000 , and randomly classify the whole population into K = 100 mutually exclusive groups with N i = 100 individuals in each group. Our model generation process contains the generation of ( X , A , R , Y ) in order. First, we generate the confounders X , followed by generating A using the specified model for the propensity score of treatment. Next, we generate the missingness indicator using a model for the missingness mechanism with A and X . Finally, we generate the outcome Y using an outcome regression model that incorporates A and X . More specifically, for each individual, the covariate X 1 is generated from Bernoulli distribution with probability 0.5, and the covariate X 2 is generated from the standard normal distribution. To generate treatment and missing indicators, we assume two logistic models. We assume a mixed-effect logistic model logit { Pr ( A i j = 1 ∣ R i j = 1 , X i j = x i j , b i ) } = 0.1 + 0.2 x 1 i j − 0.1 x 2 i j + b i for the propensity score of treatment, where b i is group level random effect terms generated from N ( 0 , σ 2 ) , i.e., the normal distribution with mean 0 and variance σ 2 . We assume a logistic model logit { Pr ( R i j = 1 ∣ A i j = a i j , X i j = x i j ) } = 1 + a i j + 1.2 x 1 i j − 0.5 x 2 i j for the propensity score model of missingness. Then the joint propensity score for treatment and missingness with random effects Pr ( a i j , r i j = 1 ∣ x i j , b i ) is calculated according to equation (3) by Chen [16], and the missingness indicator R i j and the treatment indicator A i j are generated from the joint propensity score, which is given by the following equation:

Pr ( A i , R i j = 1 ∣ X i ) = ∫ ∏ j ′ = 1 n i { h i j ′ 11 ( b i ) } A i j { h i j ′ 01 ( b i ) } ( 1 − A i j ) f b ( b i ; σ 2 ) d b i ,

where h i j a r ( b i ) = Pr ( a i j = a , r i j = r ∣ x i j , b i ) . Finally, we generate the outcome Y i j from Y i j = 1 + 2 x 1 i j − 3 x 2 i j + 0.5 x 1 i j x 2 i j + 2 a i j + 3 p i ( a i ) + ε i j , where p i ( a i ) is the proportion of units in group i that receive the treatment, and { ε i j } 1 ≤ i ≤ K , 1 ≤ j ≤ N i are i.i.d. random noise terms generated by N ( 0 , σ 2 ) . We consider four settings for the number of observations and the variance of the group effect:

N = 10,000 , K = 50 , and σ 2 = 0.25 ;
N = 10,000 , K = 50 , and σ 2 = 0.16 ;
N = 12,000 , K = 200 , and σ 2 = 0.25 ;
N = 12,000 , K = 200 , and σ 2 = 0.16 .

Second, letting α = 0.5 , the propensity score models are correctly specified as Pr ( a i j = 1 ∣ r i j = 1 , x i j ; δ ) = logit − 1 ( δ 1 + δ 2 x 1 i j + δ 3 x 2 i j + b i ) and Pr ( r = 1 ∣ a , x ; γ ) = logit − 1 ( γ 1 + γ 2 a + γ 1 x 1 i j + γ 2 x 2 i j ) . The parameter δ = ( δ 1 , δ 2 , δ 3 ) and γ = ( γ 1 , γ 2 , γ 3 , γ 4 ) are estimated with lme4 [18] package in R [19]. The regression model is correctly specified as E ( y i j ∣ a i , x i , r i j = 1 ) = β 0 + β 1 a i j + β 2 p i ( a i ) + β 3 x 1 i j + β 4 x 2 i j + β 5 x 1 i j x 2 i j . The parameters β are estimated by MLE. Then, the IPW estimator, regression estimator, and doubly robust estimators (DR-TT: when all models are correctly specified), i.e., μ ˆ a α ipw , μ ˆ a α reg , μ ˆ a α d r , are calculated according to equations (2), (12), and (17), respectively.

Third, in the scenario when there exist model misspecification, DR estimator DR-TF is calculated when the outcome model is misspecified as E ( y i j ∣ a i , x i , r i j = 1 ) = β 0 + β 1 a i j + β 3 x 1 i j + β 4 x 2 i j + β 5 x 1 i j x 2 i j , and the propensity score models are correctly specified.

The DR estimator DR-FT is calculated when the propensity score of treatment is incorrectly specified as Pr ( a i j = 1 ∣ r i j = 1 , x i j ; δ ) = logit − 1 ( δ 1 + δ 2 x 1 i j + b i ( 1 ) ) , and the the outcome model is correctly specified.

Finally, the DR estimator DR-FF is calculated when the propensity score for treatment is incorrectly specified as Pr ( a i j = 1 ∣ r i j = 1 , x i j ; δ ) = logit − 1 ( δ 1 + δ 2 x 1 i j + b i ( 1 ) ) , and the outcome model is misspecified as E ( y i j ∣ a i , x i , r i j = 1 ) = β 0 + β 1 a i j + β 3 x 1 i j + β 4 x 2 i j + β 5 x 1 i j x 2 i j .

The simulation study is repeated 1,000 times, and the estimators of α , γ , β , μ 1 α , μ 0 α , and D E ¯ ( α ) are summarized in the following tables.

The bias and empirical coverages of the proposed estimators are shown in Tables 1, 2, and Figure 2, where the 95% Wald-type confidence intervals are constructed according to the asymptotic distribution of the proposed estimators in Theorems 1–3. When both the IPW models and the regression model are correctly specified, the IPW, regression, and DR estimators all perform well in terms of having small bias and variances, and the DR robust estimator DR-TT has the smallest variance among all the estimators, which supports our theoretical result that the associated DR estimator achieves the semiparametric efficiency bound when confounders are fully observed and all the working models are correctly specified. When the outcome regression model is correctly specified, the regression estimator has the smallest variance among all the proposed estimators. When both the model of propensity score for treatment and the regression model are misspecified, the bias of μ ˆ 0 α d r has the same magnitude with the other estimators when either the IPW or the regression model, but not both, is correctly specified. As shown in Table 2, all three estimators achieve nominal levels when corresponding models are correctly specified, which indicates the standard error formulas proposed in Theorem 1, 2, and 3 are valid; when both propensity score and regression models are misspecified, the coverage probability of the DR estimator decreases to zero.

Table 1

Bias, empirical standard error (se), and standard deviation [in brackets] for IPW estimators, regression estimators, and doubly robust estimators under S1,S2, S3, and S4

		IPW	REG	DR-TT	DR-TF	DR-FT	DR-FF
S1	μ 1 , 0.5	0.005[0.051]	‒ 0.037 [0.013]	‒ 0.062 [0.073]	0.043[0.072]	0.081[0.046]	‒ 0.102 [0.191]
	μ 0 , 0.5	0.003[0.059]	‒ 0.003 [0.047]	‒ 0.023 [0.082]	0.023[0.080]	0.030[0.067]	0.054[0.072]
	DE ¯ ( 0.5 )	0.002[0.082]	‒ 0.033 [0.130]	‒ 0.040 [0.072]	0.020 [0.068]	0.050[0.091]	‒ 0.156 [0.104]
	s e ( μ 1 , 0.5 )	0.152	0.030	0.078	0.103	0.102	0.181
	s e ( μ 0 , 0.5 )	0.110	0.032	0.102	0.110	0.085	0.138
	s e ( DE ¯ ( 0.5 ) )	0.114	0.069	0.080	0.075	0.113	0.166
S2	μ 1 , 0.5	0.035[0.109]	0.034[0.059]	‒ 0.018 [0.042]	0.017[0.046]	‒ 0.023 [0.047]	0.060[0.121]
	μ 0 , 0.5	0.018[0.122]	‒ 0.008 [0.057]	‒ 0.005 [0.042]	‒ 0.007 [0.042]	0.041[0.061]	‒ 0.796 [0.092]
	DE ¯ ( 0.5 )	0.016[0.118]	0.042[0.077]	0.020[0.052]	0.019 [0.058]	‒ 0.019 [0.080]	0.860[0.142]
	s e ( μ 1 , 0.5 )	0.302	0.079	0.054	0.052	0.041	0.084
	s e ( μ 0 , 0.5 )	0.111	0.073	0.057	0.057	0.069	0.094
	s e ( DE ¯ ( 0.5 ) )	0.234	0.069	0.067	0.064	0.088	0.171
S3	μ 1 , 0.5	‒ 0.005 [0.180]	0.019[0.046]	‒ 0.009 [0.041]	0.037[0.027]	‒ 0.004 [0.031]	‒ 0.105 [0.081]
	μ 0 , 0.5	‒ 0.012 [0.041]	0.021[0.042]	‒ 0.014 [0.029]	‒ 0.020 [0.028]	0.028[0.032]	0.049[0.081]
	DE ¯ ( 0.5 )	0.007[0.099]	‒ 0.002 [0.010]	0.006[0.035]	0.057 [0.032]	‒ 0.03 [0.041]	‒ 0.155 [0.072]
	s e ( μ 1 , 0.5 )	0.154	0.036	0.041	0.037	0.029	0.090
	s e ( μ 0 , 0.5 )	0.045	0.038	0.042	0.037	0.018	0.110
	s e ( DE ¯ ( 0.5 ) )	0.137	0.089	0.041	0.051	0.033	0.080
S4	μ 1 , 0.5	0.110[0.162]	0.022[0.046]	‒ 0.030 [0.018]	0.010[0.022]	‒ 0.011 [0.032]	0.053[0.085]
	μ 0 , 0.5	‒ 0.018 [0.035]	0.013[0.038]	‒ 0.002 [0.017]	‒ 0.001 [0.017]	0.028[0.034]	0.053[0.085]
	DE ¯ ( 0.5 )	0.128[0.185]	0.008[0.019]	‒ 0.028 [0.024]	0.011 [0.033]	‒ 0.040 [0.042]	0.160[0.074]
	s e ( μ 1 , 0.5 )	0.225	0.048	0.037	0.027	0.037	0.103
	s e ( μ 0 , 0.5 )	0.149	0.048	0.037	0.027	0.028	0.120
	s e ( DE ¯ ( 0.5 ) )	0.249	0.015	0.041	0.040	0.038	0.090

Table 2

Coverage probability (%) of IPW estimators, regression estimators, and doubly robust estimators under S1, S2, S3, and S4

		IPW	REG	DR-TT	DR-TF	DR-FT	DR-FF
S1	μ 1 , 0.5	97.0	91.0	94.7	92.8	98.5	0
	μ 0 , 0.5	96.0	91.0	91.2	94.0	95.5	0
	DE ¯ ( 0.5 )	95.5	89.5	90.5	92.5	93.0	0
S2	μ 1 , 0.5	98.5	89.5	98.5	93.5	90.5	0
	μ 0 , 0.5	95.5	92.5	91.5	95.5	90.5	0
	DE ¯ ( 0.5 )	98.5	92.0	93.0	92.0	88.5	0
S3	μ 1 , 0.5	95.5	93.0	96.2	94.8	96.8	59.8
	μ 0 , 0.5	96.0	93.0	95.0	93.6	94.8	69.8
	DE ¯ ( 0.5 )	97.0	95.5	97.2	96.0	89.2	56.6
S4	μ 1 , 0.5	93.0	90.0	97.0	92.0	92.5	50.6
	μ 0 , 0.5	97.0	96.0	98.5	96.5	91.5	57.4
	DE ¯ ( 0.5 )	92.0	92.5	96.5	92.5	85.5	56.0

$Figure 2 Bias of the (1) IPW, (2) regression, and DR estimators for μ 1 , 0.5 {\mu }_{1,0.5} and μ 0 , 0.5 {\mu }_{0,0.5} under four scenarios: (3) both propensity and outcome regression models are correctly specified; (4) propensity score models are correctly specified; (5) when only outcome model is correctly specified; (6) when neither the outcome regression model nor the IPW models are correctly specified. (The boxplot of μ ˆ 0 , 0.5 d r {\hat{\mu }}_{0,0.5}^{dr} in the second scenario is dropped because the absolute value of the bias is greater than 0.2.).$

Figure 2

Bias of the (1) IPW, (2) regression, and DR estimators for μ 1 , 0.5 and μ 0 , 0.5 under four scenarios: (3) both propensity and outcome regression models are correctly specified; (4) propensity score models are correctly specified; (5) when only outcome model is correctly specified; (6) when neither the outcome regression model nor the IPW models are correctly specified. (The boxplot of μ ˆ 0 , 0.5 d r in the second scenario is dropped because the absolute value of the bias is greater than 0.2.).

5 Application

The particulate matter 2.5 (PM2.5) refers to tiny particles or droplets in the air that can affect people’s short-term or even long-term health conditions such as respiratory issues, increased mortality from lung cancer, and heart disease. A primary strategy to achieve the reduction of ambient PM2.5 is the installation of flue-gas desulfurization equipment or controls (“scrubbers”) to reduce the sulfur dioxide ( SO 2 ), nitrogen dioxide ( NO x ), and carbon dioxide ( CO 2 ) emissions, which are three main air pollutants that mediate the changes of PM2.5. In 1990, the U.S. Clean Air Act (CAA) amendments launched the ARP to reduce the emissions of the ambient SO 2 , NO x , and CO 2 by regulating those power plants to install scrubbers on coal-fired electricity-generating units (EGUs).

In this section, we apply the proposed estimators on a dataset from the U.S. EPA’s Air Quality System (AQS) to estimate the causal effect of installing the scrubbers on the reduction of ambient NO x emissions. The dataset contains monthly emissions data from 1218 EGUs in the United States in 2004. The 1218 EGUs are classified into 40 clusters through the linkage algorithm by Zigler et al. [5]. Among 1218 EGUs, 913 EGUs installed the scrubbers and 205 EGUs did not install the scrubbers. Six important characteristics of the EGUs and the atmosphere are included in the data analysis: the heat input rate, the capacity of the EGU, the number of scrubbers installed, the sulfur content, the operation time, and the average temperature in the previous year. As these characteristics affect both the emissions of NO x and the installation of scrubbers, they are treated as confounders of the causal relationship between the outcome and the treatment. For the treatment (“scrubbers” installation) coverage probability, the average of the proportion of treated units among all the groups is 66.79%. For the missingness in confounders, there are 15.76% missingness in sulfur content, 2.25% missingness in operation time, 21.69% missingness in heat input rate, 0.99% in capacity, and 24.90% missingness in total. The missingness can be caused by multiple reasons such as monitor errors (e.g., the operation time exceeds a certain time), the failure of recording, and the changes of filter. Since the units with fewer NO x emissions are less likely to disclose the baseline characteristics, and records of the baseline characteristics were measured long before the emissions of NO x took place, it is plausible that the outcome-independence assumption holds.

We assume a linear fixed-effect regression model for the outcome, and assume different mixed-effects logistic regression models for the propensity score of treatment and missingness, respectively. To account for the dependency of the treatment allocation strategy on the baseline characteristics, following Papadogeorgou et al. [10], we assume P α , x ( a i ) as the logistic regression model: logit { P α , x ( a i j ) } = ξ i α + ∑ j = 1 6 β i X i j , where ξ is estimated by solving the following equation,

1 N i ∑ j = 1 N i expit ( ξ i α + β j X i j ) = α .

Since in the dataset there are over 80% of units that lying in the clusters with average proportion of units with scrubbers installed ranges from 0.3 to 0.8, we consider values of α varying from 0.3 to 0.8.

Figure 3 presents the results of IPW, regression, and DR estimators for the direct effect DE( α ) across different values of α . When α = 0.3 , DE( α ) of the IPW, regression, and DR estimators are − 29.45 , − 49.25 , and − 34.72 , respectively. When α = 0.8 , DE( α ) of the IPW, regression and DR estimators are − 19.55 , − 21.99 , and − 17.44 , respectively. The estimated DE( α ) are negative for all three estimators across different α values, indicating that the intervention of installing scrubbers has a positive effect on reducing the emissions of the tons of NO x , and there would be approximately 30 tons of NO x emissions fewer per unit among the units with scrubbers installed than that among the units without scrubbers installed. As α increases, DE( α ) has an increasing trend, which implies that the intervention at one EGU is beneficial for the reduction of the emissions of NO x , but the effect is smaller when the proportion of the units within the same cluster that have scrubbers installed becomes larger.

$Figure 3 Estimates and 95% Wald-type confidence intervals of D E ¯ ( α ) \overline{DE}\left(\alpha ) of scrubber installation for (1) IPW, (2) regression, and (3) DR estimator with α ∈ ( 0.3 , 0.8 ) \alpha \in \left(0.3,0.8) , where the shadow area represents the pointwise confidence intervals.$

Figure 3

Estimates and 95% Wald-type confidence intervals of D E ¯ ( α ) of scrubber installation for (1) IPW, (2) regression, and (3) DR estimator with α ∈ ( 0.3 , 0.8 ) , where the shadow area represents the pointwise confidence intervals.

Let α be the average treatment coverage probability among 40 clusters, i.e., α 0 = 0.67 ; in this setting, IPW, regression, and DR estimates for the indirect ( I E ¯ ( 0.6 , α 1 ) ), total effect ( T E ¯ ( 0.6 , α 1 ) ), and overall effect ( O E ¯ ( α 1 ) ) are given in Figure 4. The indirect effect has a decreasing trend when α 1 − α 0 increases, and it is positive when α 1 < 0.67 and negative when α 1 > 0.67 . For example, when α 1 = 0.46 , the DR estimate is 15.14, and 95% confidence interval (CI) is (11.19, 19.76), which suggests that there would be 15.14 more tons of NO x emissions in units without scrubbers installed within groups with 46% coverage compared to that with groups with 67% coverage of scrubber installation; when α 1 = 0.75 , the estimate is − 6.84 for DR estimator, and the corresponding 95% CI is (-8.82, -5.40), which implies that we would expect 3.90 tons of NO x emissions fewer among units without scrubbers installed within groups with 75% coverage compared to groups with average coverage probability. It is also worth noting that the confidence intervals would decrease as the difference between α 0 and α 1 decreases, which indicates that the decrease in the variance of the estimator of μ 0 α and μ 1 α cannot offset the increase in the correlation between the two estimators, because both the estimators are dependent on the treatment allocation function π ( a i ( − j ) ; α ) . The total effect of the proposed estimators, which combine both the direct and the indirect effects, have a slightly decreasing trend when α increases. For example, when α = 0.5 , the IPW, regression, and DR estimates are − 18.15 , − 25.70 , and − 16.91 , and the corresponding 95% CIs are ( − 24.38 , − 14.43 ), ( − 30.27 , − 21.64 ), and ( − 18.78 , − 7.66 ), respectively; when α = 0.75 , the IPW, regression, and DR estimates are − 25.56 , − 28.35 , and − 22.29 , and the corresponding 95% CIs are ( − 29.44 , − 21.71 ), ( − 31.56 , − 24.85 ), and ( − 27.28 , − 19.83 ), respectively. Therefore, all three estimators indicate that there would be fewer NO x emissions per unit with scrubbers within groups with higher coverage probability compared to the unit without scrubbers installed in groups with lower coverage probability. The estimates of the overall effect of scrubber installation, which quantify the difference in the tons of NO x emissions under two treatment allocation strategies, suggest that there would be fewer NO x emissions within groups with higher average percentage of the coverage of scrubber installation.

$Figure 4 Estimates and 95% Wald-type confidence intervals of I E ¯ ( α ) \overline{IE}\left(\alpha ) , T E ¯ ( α ) \overline{TE}\left(\alpha ) , and O E ¯ ( α ) \overline{OE}\left(\alpha ) of scrubber installation for (1) IPW, (2) regression, and (3) DR estimator with α ∈ ( 0.3 , 0.8 ) \alpha \in \left(0.3,0.8) , where the shadow area represents the pointwise confidence intervals.$

Figure 4

Estimates and 95% Wald-type confidence intervals of I E ¯ ( α ) , T E ¯ ( α ) , and O E ¯ ( α ) of scrubber installation for (1) IPW, (2) regression, and (3) DR estimator with α ∈ ( 0.3 , 0.8 ) , where the shadow area represents the pointwise confidence intervals.

6 Discussion

In this article, we constructed three consistent estimators: IPW, regression, and DR estimators for four types of network causal effects: the direct, indirect, total, and overall effects when the confounders are missing not at random. Under the outcome-independent missingness assumption, the IPW and regression estimators are consistent and asymptotically normal if the corresponding models are correctly specified, and the consistency of the DR estimator requires that either the joint modelling of the propensity score of treatment and missingness or the outcome regression model, but not necessarily both, are correctly specified.

The designed doubly robust estimator is based on the conventional DR estimator proposed by Kang and Schafer [20] in the general setting without interference. In the setting when interference exists, the methodology avoids the assumption of SUTVA and can recover the causal effect in the whole population with observed data. In the real application, we compared the performance of three proposed estimators and showed that the intervention of installing scrubbers has a positive effect on reducing the emissions of NO x which, in turn, may potentially reduce the ambient PM2.5 and the concentrations of ozone.

One limitation of the proposed methods is that it can cause increased computational burden when the number of units in each group increases, and it may not be suitable to implement the estimators when the confounders are high-dimensional. Adapting the proposed methods to handle high-dimensional data is an interesting direction for future research. Another limitation lies in that the estimators do not work well for the cases when the proportion of missingness is large. In this study, we only consider the setting under the partial interference only; In the scenarios with more general interference patterns, where the outcome of one unit can be affected by the confounders or outcomes of other units, relying solely on linear models for the regression estimator and regression component in the doubly robust estimator may not be sufficient. In such scenarios, alternative methods, such as deep neural networks, may be necessary to obtain consistent estimates of the average potential outcome. These methods can capture the complex dependencies and non-linear relationships that may arise in the presence of complex interference patterns. In this article, we assume a binary missingness indicator for each individual. As pointed out by reviewers, this assumption may not always reflect reality. We leave the investigation of more general missingness patterns for future research.

Funding information: Dr. Zhu’s research was supported by Grant Number RGPIN-2017-04064 from the Natural Sciences and Engineering Research Council of Canada. Dr. Dubin’s research was supported by Grant Number RGPIN-2020-04382 from the Natural Sciences and Engineering Research Council of Canada.
Author contributions: ZS proposed the methodology and conducted all simulation and data analyses, and did most of the writing of the article, while all authors (ZS, YZ, JAD) contributed modifications to the methods, suggestions for simulation runs and the data analysis, as well as proofread and edited the paper.
Conflict of interest: The author states no conflict of interest.
Data availability statement: The datasets analyzed during the current study are available at https://campd.epa.gov/data/custom-data-download.

References

[1] Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. 10.1093/biomet/63.3.581Search in Google Scholar

[2] Yang S, Wang L, Ding P. Causal inference with confounders missing not at random. Biometrika. 2019;106(4):875–88. 10.1093/biomet/asz048Search in Google Scholar

[3] Rubin DB. Randomization analysis of experimental data: The Fisher randomization test comment. J Amer Stat Assoc. 1980;75(371):591–3. 10.2307/2287653Search in Google Scholar

[4] Cox DR. Planning of experiments. New York: Wiley; 1958. Search in Google Scholar

[5] Zigler C, Kim C, Choirat C, Hansen J, Wang Y, Hund L, et al. Causal inference methods for estimating long-term health effects of air quality regulations. Res Report (Health Effects Institute). 2016;187:5–49. 10.1289/isee.2016.4081Search in Google Scholar

[6] Hyland A, Ambrose BK, Conway KP, Borek N, Lambert E, Carusi C, et al. Design and methods of the population assessment of Tobacco and health (PATH) study. Tobacco Control. 2017;26(4):371–8. 10.1136/tobaccocontrol-2016-052934Search in Google Scholar PubMed PubMed Central

[7] Tchetgen Tchetgen EJ, VanderWeele TJ. On causal inference in the presence of interference. Stat. Methods Med Res. 2012;21:55–75. 10.1177/0962280210386779Search in Google Scholar PubMed PubMed Central

[8] Hudgens MG, Halloran ME. Toward causal inference with interference. J Amer Stat Assoc. 2008;103:832–42. 10.1198/016214508000000292Search in Google Scholar PubMed PubMed Central

[9] Liu L, Hudgens MG, Saul B, Clemens JD, Ali M, Emch ME. Doubly robust estimation in observational studies with partial interference. Stat. 2019;8(1):e214. 10.1002/sta4.214Search in Google Scholar PubMed PubMed Central

[10] Papadogeorgou G, Mealli F, Zigler CM. Causal inference with interfering units for cluster and population level treatment allocation programs. Biometrics. 2019;75(3):778–87. 10.1111/biom.13049Search in Google Scholar PubMed PubMed Central

[11] Sun Z, Liu L. Semiparametric inference of causal effect with nonignorable missing confounders. Stat Sinica. 2021;31:1–20. 10.5705/ss.202018.0466Search in Google Scholar

[12] Perez-Heydrich C, Hudgens MG, Halloran ME, Clemens JD, Ali M, Emch ME. Assessing effects of cholera vaccination in the presence of interference. Biometrics. 2014;70(3):731–41. 10.1111/biom.12184Search in Google Scholar PubMed PubMed Central

[13] Ogburn EL, Sofrygin O, Diaz I, Van Der Laan MJ. Causal inference for social network data. 2017. arXiv: http://arXiv.org/abs/arXiv:170508527. Search in Google Scholar

[14] Ding P, Geng Z. Identifiability of subgroup causal effects in randomized experiments with nonignorable missing covariates. Stat Med. 2014;33:1121–33. 10.1002/sim.6014Search in Google Scholar PubMed

[15] Miao W, Tchetgen Tchetgen EJ. On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika. 2016;103:475–82. 10.1093/biomet/asw016Search in Google Scholar PubMed PubMed Central

[16] Chen YH. A semiparametric odds ratio model for measuring association. Biometrics. 2007;63(2):413–21. 10.1111/j.1541-0420.2006.00701.xSearch in Google Scholar PubMed

[17] van der Vaart A. Asymptotic statistics. Cambridge: Cambridge University Press; 1998. 10.1017/CBO9780511802256Search in Google Scholar

[18] Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. J Stat Software. 2015;67(1):1–48. 10.18637/jss.v067.i01Search in Google Scholar

[19] R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2020. https://www.R-project.org/. Search in Google Scholar

[20] Kang JDY, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci. 2007;22:523–39. 10.1214/07-STS227Search in Google Scholar PubMed PubMed Central

Received: 2022-12-01

Revised: 2023-12-15

Accepted: 2024-01-31

Published Online: 2024-08-10

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary material

Articles in the same Issue

https://doi.org/10.1515/jci-2022-0079

Keywords for this article

causal inference; missing data; partial interference; air pollution

Creative Commons

BY 4.0

Estimation of network treatment effects with non-ignorable missing confounders

Article

Abstract

1 Introduction

2 Notation and assumptions

3 Estimation

3.1 Inverse probability weighting

Theorem 1

3.2 Regression

Theorem 2

3.3 Doubly robust estimation

Theorem 3

4 Simulation

5 Application

6 Discussion

References

Supplementary Material

Articles in the same Issue

Articles in the same Issue

Articles in the same Issue