Direct, indirect, and interaction effects based on principal stratification with a binary mediator

Myoung-jae Lee

doi:10.1515/jci-2023-0025

Article Open Access

Direct, indirect, and interaction effects based on principal stratification with a binary mediator

Myoung-jae Lee

Published/Copyright: June 21, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Causal Inference Volume 12 Issue 1

Abstract

Given a binary treatment and a binary mediator, mediation analysis decomposes the total effect of the treatment on an outcome variable into various sub-effects, and there appeared two-, three-, and four-way decompositions in the literature. Using “principal stratification” based on the potential mediator types, we consider sub-treatment effects for “mediative never-takers, compliers, defiers, and always takers.” In this approach, although it is difficult to pick any one decomposition over the others in general, a particular three-way decomposition becomes well suited, which is thus advocated to use. We present identification conditions for the effects using conditional means, which is then followed by simple estimators that are applicable to any outcome variable (binary, count, continuous, etc.). We also provide simulation and empirical studies.

Keywords: direct effect; indirect effect; interaction effect; mediation; principal stratification

MSC 2010: 62D20

1 Introduction

Given a binary treatment D , a binary mediator M , and an outcome/response variable Y , the causal chain of interest in mediation analysis is

D → Y and D → M → Y .

The total effect of D on Y consists of the direct effect of D on Y and the indirect effect through M . This is an important issue in many disciplines of science, as reviewed by MacKinnon et al. [1], Pearl [2], Imai et al. [3], TenHave and Joffe [4], Preacher [5], VanderWeele [6,7], and Nguyen et al. [8], among others.

The total effect of D on Y can be found in various ways, such as matching, regression adjustment and inverse probability weighting (see, e.g., Lee and Lee [9] and Choi and Lee [10] for reviews on treatment effect estimators). Traditionally, the total effect has been decomposed into sub-effects with linear structural-form (SF) models for M as a function of D and Y as a function of ( D , M ) , e.g., with some α and β parameters,

(1.1) M = α 1 + α d D + ε and Y = β 1 + β d D + β m M + β d m D M + U ,

where ( ε , U ) are the error terms with E ( ε ∣ D ) = 0 and E ( U ∣ D , M ) = 0 , α d is the effect of D on M , β d is the direct effect of D on Y , β m is the direct effect of M on Y , and β d m is the interaction effect of D M on Y . Here, β m α d is the “indirect effect” of D on Y through M , which is the “effect of M on Y ” times the “effect of D on M .” The standard Baron-Kenny approach (Baron and Kenny [11]) does not include the interaction part β d m D M , which turns out to be a major complicating term for total effect decomposition.

Linear SFs such as (1.1) are intuitive and easy to understand, but they are subject to specification errors and not grounded in the modern counterfactual causal framework. Once we adopt nonparametric approaches to avoid SF misspecifications and use potential versions of ( M , Y ) , total effect decomposition is no longer straightforward as with (1.1). Note that when exogenous covariates X need to be controlled, an easy way to do it generalizing (1.1) is

(1.2) M = α 1 ( X ) + α d ( X ) D + ε , Y = β 1 ( X ) + β d ( X ) D + β m ( X ) M + β d m ( X ) D M + U ,

where α ( X ) s and β ( X ) s are the functions of X . Specifying them as linear, ordinary least-squares estimator (OLS) can be applied to (1.2). This approach turns out to be the main estimator proposed in this article, although (1.2) will be derived formally using the potential versions of M and Y .

Consider two potential versions M d , d = 0 , 1 , of M corresponding to D = 0 , 1 , and the four potential responses Y d m of Y for D = 0 , 1 and M = 0 , 1 . Also, define the potential responses Y d , d = 0 , 1 , corresponding to D = 0 , 1 “when M is allowed to take its natural course given D = d ”:

Y d ≡ Y d , M d .

Then, the mean/average total effect is

(1.3) Total effect : E ( Y 1 − Y 0 ) = E ( Y 1 , M 1 − Y 0 , M 0 ) .

In the literature, two-way decompositions of the total effect appeared first, as can be seen in Pearl [12] and Robins [13], among others:

(1.4) ( a ) : E ( Y 1 , M 1 − Y 1 , M 0 ) + E { Y 1 , M 0 − Y 0 , M 0 } , ( b ) : E { Y 1 , M 1 − Y 0 , M 1 } + E ( Y 0 , M 1 − Y 0 , M 0 ) .

The terms in { ⋅ } are called the “natural direct effects” of D on Y , and the terms in (⋅) are the “natural indirect effects.” In contrast, the “controlled direct effect” with M = m is E ( Y 1 , m − Y 0 , m ) , but there is no definition of “controlled indirect effect” that is generally agreed on. One glaring problem with (1.4) is that the decomposition is not unique: which one to take between (a) and (b)? Another problem is that the presence of the interaction effect is not clear in (1.4).

VanderWeele [14] proposed a three-way decomposition separating the aforementioned effects, and VanderWeele [15] proposed a four-way decomposition, which includes the existing two- and three-way decompositions in the literature as special cases by merging different terms in the four-way decomposition.

In this article, we define four subject types with ( M 0 , M 1 ) :

(1.5) Mediative Never Takers (NT) : M 0 = 0 , M 1 = 0 , Mediative ComPliers (CP) : M 0 = 0 , M 1 = 1 , Mediative DeFiers (DF) : M 0 = 1 , M 1 = 0 , Mediative Always Takers (AT) : M 0 = 1 , M 1 = 1 .

When D is endogenous with a binary instrumental variable δ for D , Imbens and Angrist [16] and Angrist et al. [17] classified the subjects using the potential treatments ( D 0 , D 1 ) corresponding to δ = 0 , 1 , and the classification is analogous to (1.5), e.g., compliers are those with ( D 0 = 0 , D 1 = 1 ) . Frangakis and Rubin [18] called the classification based on ( D 0 , D 1 ) “principal stratification,” and doing analogously, we call (1.5) “mediative principal stratification” based on ( M 0 , M 1 ) . We address only binary M in this article; allowing non-binary M is not straightforward and thus left for a future research.

With many decompositions of the total effect in the literature, the question is which one to use. There seems no single best answer to this question, but once the mediator types are taken into account, the following three-way decomposition seems well suited:

E ( Y 10 − Y 00 ) + E { ( Y 01 − Y 00 ) ( M 1 − M 0 ) } + E { ( Y 11 − Y 01 − Y 10 + Y 00 ) M 1 } . ( M 3 M )

“M3M” stands for “Main 3-way Mediator-based decomposition.” M3M appeared in VanderWeele [15] with different notation, who also referred to VanderWeele and Tchetgen Tchetgen [19]. The following explains the three terms/effects in M3M.

First, the direct effect Y 10 − Y 00 occurs when D changes while the presence of M is nullified by m = 0 in Y d m . Second, the indirect effect through M is the product of the “direct effect Y 01 − Y 00 of M on Y (with D nullified by d = 0 in Y d m )” times the “effect M 1 − M 0 of D on M ,” which occurs only to CP and DF (with the opposite signs) as only they have M 1 ≠ M 0 . Third, the interaction effect of D M is the “net effect of D M ,” which is the “gross effect of D M ” minus the “direct/partial” effects of D and M (Choi and Lee [20]):

(1.6) Δ Y ± ≡ Y 11 − Y 01 − Y 10 + Y 00 = Y 11 − Y 00 − ( Y 01 − Y 00 ) − ( Y 10 − Y 00 ) = ( gross effect of D M ) − ( direct effect of M ) − ( direct effect of D ) .

This effect occurs only to CP and AT because only they have D M = D M 1 = 1 , which is why M 1 ( = 1 only for CP and AT) is attached to Δ Y ± in M3M. Note that the interaction effect can be viewed either as the direct effect of D moderated by the level of M or as the direct effect of M moderated by the level of D .

Our approach based on “mediative principal stratification” looks at which effects are associated with which type, so that each type’s contributions to the total effect suggest the appropriate decomposition of the total effect, as was seen just above. Our approach will further show that, despite the apparent “symmetry” in (1.4)(a) and (b), (1.4)(b) is better than (1.4)(a) because (1.4)(a) does not identify any effect in M3M, whereas (1.4)(b) identifies the indirect effect and the sum of the direct and interaction effects.

In the remainder of this article, Section 2 shows the details of M3M. Section 3 addresses identifying all sub-effects in M3M, and maps out our estimation strategy. Section 4 presents the effect estimators. Sections 5 and 6 provide simulation and empirical studies. Finally, Section 6 concludes this article. Appendix contains most proofs.

2 Main three-way mediator-based decomposition

In this section, first, we formally put forth M3M as a preferred decomposition and explain why. Second, we compare M3M to other decompositions in the literature. Third, we illustrate various decompositions, using simple SFs as in (1.1).

2.1 Mediator types and decompositions

Our starting point is the definitional equations for ( Y 0 , Y 1 ) in (1.3):

(2.1) ( Y 0 ≡ ) Y 0 , M 0 = Y 00 + ( Y 01 − Y 00 ) M 0 , ( Y 1 ≡ ) Y 1 , M 1 = Y 10 + ( Y 11 − Y 10 ) M 1 , Y 0 , M 1 = Y 00 + ( Y 01 − Y 00 ) M 1 , Y 1 , M 0 = Y 10 + ( Y 11 − Y 10 ) M 0 ;

these equalities can be verified by setting M 0 = 0 , 1 and M 1 = 0 , 1 . We substitute these into the total effect (1.3) to obtain M3M, and we then explain why M3M is advocated.

Proposition 1

The total effect (1.3) can be rewritten as our preferred three-way mediator-based decomposition M3M using the mediator types in (1.5), where (i) the interaction effect is for M 1 = 1 (CP or AT), (ii) the indirect effect is for M 1 − M 0 = 1 or − 1 (CP or DF), and (iii) the direct effect is for every subject.

Our approach using mediator types considers which effects are associated with which mediator types. For this, we examine the last term (interaction effect) in M3M first, followed by the indirect effect (the second term) and the direct effect (the first term).

First, the interaction effect (i.e., the effect of D M ) requires D M = D M 1 = 1 , which occurs only to the mediator types AT and CP because only they have M 1 = 1 . Hence, M 1 in Δ Y ± M 1 works as a “qualification indicator” for the interaction effect. Although we assume that M is binary to take advantage of the mediator types in (1.5), if M were non-binary, M 1 would work also as the level of the “causal intensity” of D . Interpreting Δ Y ± in (1.6) as the net effect of D M (i.e., interaction effect) follows from the definition that whatever is left in Y 11 − Y 00 after the direct/partial effects of D and M are subtracted is the interaction effect.

Second, the indirect effect requires M 0 ≠ M 1 , which occurs only to CP and DF because only they have M 0 ≠ M 1 . Hence, M 1 − M 0 works as a qualification indicator for the indirect effect, much as M 1 does in the interaction effect Δ Y ± M 1 . M 1 − M 0 works also as the effect magnitude of D on M , so that when it is multiplied to the effect Y 01 − Y 00 of M on Y , the indirect effect is obtained. For example, if D is education for health knowledge, Y is health, and M is exercising, then the CP’s are those who exercise due to D = 1 with the indirect effect ( Y 01 − Y 00 ) ( M 1 − M 0 ) = Y 01 − Y 00 , whereas the DF’s are those who stop exercising due to D = 1 with the indirect effect Y 00 − Y 01 because of M 1 − M 0 = − 1 . Under Y 01 − Y 00 > 0 , the indirect effect is positive for CP, but negative for DF because DF stops exercising; DF may not exist in this example though.

Third, since no qualification indicator involving M appears in the first term Y 10 − Y 00 of M3M, the direct effect Y 10 − Y 00 is associated with all types.

2.2 Comparisons to other decompositions

The most general decomposition yet is the four-way one in VanderWeele [15]:

(2.2) (i) interact no, mediate no : direct effect for all types E ( Y 10 − Y 00 ) , (ii) interact no, mediate yes : indirect effect for CP,DF E { ( Y 01 − Y 00 ) ( M 1 − M 0 ) } , (iii) interact yes, mediate no : interaction effect for AT,DF E ( Δ Y ± M 0 ) , (iv) interact yes, mediate yes : interaction effect for CP,DF E { Δ Y ± ( M 1 − M 0 ) } ,

using our notation. The type classifications are new, not in VanderWeele [15].

In (i), the direct effect of D needs neither interaction nor mediation, which is relevant for all types as no M appears there. In (ii) that is the indirect effect for CP and DF, the effect needs mediation ( M 1 − M 0 ), but not interaction (no Δ Y ± ). In (iii) that is the interaction effect (with Δ Y ± ) for AT and DF because only they have M 0 = 1 , no mediation is needed (i.e., no M 1 − M 0 ). In (iv) that is the interaction effect for CP and DF because only they have M 1 ≠ M 0 , the effect needs both interaction ( Δ Y ± ) and mediation ( M 1 − M 0 ). Note that VanderWeele [15] called (iii) “reference interaction” as M 0 appears, and (iv) “mediated interaction” as M 1 − M 0 appears. In the following, we compare (2.2) to other decompositions.

First, the four-way decomposition reduces to M3M, when the two interaction effects (iii) and (iv) in (2.2) are merged to yield the last term of M3M:

E ( Δ Y ± M 0 ) + E { Δ Y ± ( M 1 − M 0 ) } = E ( Δ Y ± M 1 ) .

M3M is preferred to the four-way decomposition, because the interaction effect in M3M is only for CP and AT whereas (2.2)(iv) includes DF in the interaction effect despite that DF has D M = D M 1 = 0 . Also, (2.2)(iv) can be taken as part of either the interaction effect (due to Δ Y ± ) or the indirect effect (due to M 1 − M 0 ), but M3M merges (2.2)(iv) into (iii) to obtain the interaction effect only for CP and AT, which is appropriate because only they have D M = D M 1 = 1 .

Second, instead of merging (iii) and (iv) as was done just now, merge (ii) and (iv):

E { ( Y 01 − Y 00 ) ( M 1 − M 0 ) } + E { ( Y 11 − Y 10 − Y 01 + Y 00 ) ( M 1 − M 0 ) } = E { ( Y 11 − Y 10 ) ( M 1 − M 0 ) } .

This then yields a three-way decomposition similar to, yet different from, M3M:

(2.3) E ( Y 10 − Y 00 ) + E { ( Y 11 − Y 10 ) ( M 1 − M 0 ) } + E ( Δ Y ± M 0 ) .

The difference is that the middle indirect effect here has Y 11 − Y 10 , not Y 01 − Y 00 in M3M, and the last interaction effect here is for AT and DF because M 0 = 1 only for AT and DF despite that the interaction effect should be zero for DF as they have D M = D M 1 = 0 . Hence, (2.3) illustrates an inappropriate three-way decomposition.

Third, VanderWeele [14] proposed yet another three-way decomposition:

(2.4) E ( Y 1 , M 0 − Y 0 , M 0 ) + E ( Y 0 , M 1 − Y 0 , M 0 ) + E { Δ Y ± ( M 1 − M 0 ) } ,

using our notation; the three terms are direct, indirect, and interaction effects. In view of mediator types, however, the third term is inappropriate, because it is zero for AT due to M 1 − M 0 = 0 despite that the interaction effect do occur to AT in (2.2)(iii).

Fourth, substitute (2.1) into (1.4)(a):

(2.5) E [ { Y 10 + ( Y 11 − Y 10 ) M 1 } − { Y 10 + ( Y 11 − Y 10 ) M 0 } + { Y 10 + ( Y 11 − Y 10 ) M 0 } − { Y 00 + ( Y 01 − Y 00 ) M 0 } ] = E { ( Y 11 − Y 10 ) ( M 1 − M 0 ) } + E ( Y 10 − Y 00 + Δ Y ± M 0 ) .

Hence, (1.4)(a) consists of two terms: the first is an indirect effect differing from (2.2)(ii) because Y 11 – Y 10 appears instead of Y 01 – Y 00 , and the second is the sum of the direct effect in (2.2)(i) and the interaction effect for AT and DF in (2.2)(iii) with the interaction effect for CP in (2.2)(iv) omitted. Hence, (1.4)(a) is inappropriate.

Now, substitute (2.1) into (1.4)(b):

(2.6) E [ { Y 10 + ( Y 11 − Y 10 ) M 1 } − { Y 00 + ( Y 01 − Y 00 ) M 1 } + { Y 00 + ( Y 01 − Y 00 ) M 1 } − { Y 00 + ( Y 01 − Y 00 ) M 0 } ] = E ( Y 10 − Y 00 + Δ Y ± M 1 ) + E { ( Y 01 − Y 00 ) ( M 1 − M 0 ) } .

This is the same as M3M, except that the direct and interaction effects of M3M are merged into the first term. Hence, despite the apparent “symmetry” between (1.4)(a) and (1.4)(b), (1.4)(b) is the better decomposition than (1.4)(a), as long as we are aware that the direct effect in (1.4)(b) is inclusive of the interaction effect.

If the control and treatment are switched, then (1.4)(b) can be written as

E { Y 0 , M 0 − Y 1 , M 0 } + E ( Y 1 , M 0 − Y 1 , M 1 ) = − ( 1.4 ) ( a ) ,

which raises the question whether (1.4)(b) becomes inappropriate with the switch. The answer is that our approach is not “symmetric” due to the definitions of CP and DF: the interaction effect becomes relevant now for DF and AT, as they have M 0 = 1 (i.e., M = 1 when actively treated). This makes (2.3), not M3M, the appropriate three-way decomposition, and then, the last display becomes the appropriate two-way decomposition (with the minus sign due to the treatment reversal).

2.3 Illustration of decompositions

To illustrate various decompositions, we use the following potential variable models:

(2.7) M d = 1 [ 0 < α 1 + α d d + X ′ α x − e ] , Y d m = β 1 + β d d + β m m + β d m d m + X ′ β x + U ,

where 1 [ A ] ≡ 1 if A holds and 0 otherwise, and ( e , U ) are the error terms independent of X , and e ∼ Uni ( 0 , 1 ) with Uni ( 0 , 1 ) standing for the uniform distribution on ( 0 , 1 ) . The models are tightly specified, but they will help understand the aforementioned decompositions. This non-essential subsection may be skipped.

Assuming 0 < α 1 + α d d + X ′ α x < 1 for all X , it holds that

(2.8) E ( M d ∣ X ) = P ( e < α 1 + α d d + X ′ α x ∣ X ) = α 1 + α d d + X ′ α x ⇒ M d = α 1 + α d d + X ′ α x + ε d where ε d ≡ M d − α 1 − α d d − X ′ α x { E ( ε d ∣ X ) = 0 by construction } .

The linear M d model in (2.8) is a “reduced form (RF)” as opposed to its SF in (2.7). We use the linear RF for M d in (2.8) and the linear SF for Y d m in (2.7) in the following.

The Y d m model gives Y 01 − Y 00 = β m and Y 11 − Y 10 = β m + β d m , which yield

E ( Y 0 , M 0 ) = E { Y 00 + ( Y 01 − Y 00 ) M 0 } = β 1 + E ( X ′ ) β x + β m { α 1 + E ( X ′ ) α x } , E ( Y 1 , M 1 ) = β 1 + β d + E ( X ′ ) β x + ( β m + β d m ) { α 1 + α d + E ( X ′ ) α x } ,

where (2.1) is used. Subtracting the former from the latter renders the total effect:

(2.9) E ( Y 1 , M 1 − Y 0 , M 0 ) = β d + β m α d + β d m { α 1 + α d + E ( X ′ ) α x } .

Different decompositions shuffle (2.9) in different ways as follows; note that

Δ Y ± = Y 11 − Y 10 − ( Y 01 − Y 00 ) = β m + β d m − β m = β d m .

First, the four-way decomposition that is the sum of the four terms in (2.2) is

E ( Y 10 − Y 00 ) + E { ( Y 01 − Y 00 ) ( M 1 − M 0 ) } + E ( Δ Y ± M 0 ) + E { Δ Y ± ( M 1 − M 0 ) } = β d + β m α d + β d m { α 1 + E ( X ′ ) α x } + β d m α d .

Here, the last term of (2.9) is split into β d m { α 1 + E ( X ′ ) α x } and β d m α d .

Second, the three-way decomposition in M3M is the same as (2.9):

E ( Y 10 − Y 00 ) + E { ( Y 01 − Y 00 ) ( M 1 − M 0 ) } + E { ( Y 11 − Y 10 − Y 01 + Y 00 ) M 1 } = β d + β m α d + β d m E { α 1 + α d + E ( X ′ ) α x } ,

which are the direct, indirect, and interaction effects, respectively.

Third, for (1.4)(a), substitute (2.7) and (2.8) into (2.5):

E { ( Y 11 − Y 10 ) ( M 1 − M 0 ) } + E ( Y 10 − Y 00 + Δ Y ± M 0 ) = ( β m + β d m ) α d + [ β d + β d m { α 1 + E ( X ′ ) α x } ] .

This is the same as the total effect in (2.9), but both terms here include the interaction effect β d m , revealing again why the two-way decomposition (1.4)(a) is inappropriate.

Fourth, for (1.4)(b), substitute (2.7) and (2.8) into (2.6):

E ( Y 10 − Y 00 + Δ Y ± M 1 ) + E { ( Y 01 − Y 00 ) ( M 1 − M 0 ) } = [ β d + β d m { α 1 + α d + E ( X ′ ) α x } ] + β m α d .

The second term is the indirect effect of M3M, whereas the first term is the sum of the direct and interaction effects, revealing again why (1.4)(b) is better than (1.4)(a).

3 Identification and estimation strategy

Our identification conditions for M3M with X are (“ ∐ ” stands for independence):

C(a) : (i) D ∐ ( M 0 , M 1 ) ∣ X (ii) D ∐ ( Y 00 , Y 01 , Y 10 , Y 11 ) ∣ X ; C(b) : ( M 0 , M 1 ) ∐ ( Y 00 , Y 01 , Y 10 , Y 11 ) ∣ ( D , X ) ; C(c) : 0 < P ( D = d , M = m ∣ X ) for all d , m = 0 , 1 and X .

Conditions C(a) and C(b) are the “ignorability” of confounders in the treatment-mediator, treatment-outcome, and mediator-outcome relationships. C(a) and C(b) operate in two stages: as D precedes M , which, in turn, precedes Y , the first stage is D being independent of all potential future variables given X , and the second stage is ( M 0 , M 1 ) being independent of all potential future versions of Y given ( D , X ) . C(c) is a support-overlap condition for D ∣ X and M ∣ ( D , X ) ; for example, C(c) implies

P ( D = d ∣ X ) = P ( D = d , M = 0 ∣ X ) + P ( D = d , M = 1 ∣ X ) > 0 , for d = 0 , 1 , P ( M = m ∣ D = d , X ) = P ( D = d , M = m ∣ X ) ∕ P ( D = d ∣ X ) > 0 , for d , m = 0 , 1 .

Slightly different conditions from C(a) and C(b) appeared in Imai et al. [3]:

D ∐ ( M d , Y d ′ m ) ∣ X and M d ∐ Y d ′ m ∣ ( D , X ) , for all d , d ′ , m = 0 , 1

where the joint distributions of ( M 0 , M 1 ) and ( Y 00 , Y 01 , Y 10 , Y 11 ) do not appear, differently from C(a) and C(b). Also, Petersen et al. [21] assumed

D ∐ M d ∣ X , D ∐ Y d m ∣ X , M ∐ Y d m ∣ ( D , X ) , for all d , m = 0 , 1 .

Using marginal independence instead of joint independence, we can relax C(a) and C(b), but we continue to assume C(a) and C(b) for simplicity. In the following, we present “causal reduced forms (CRF’s)” for M and Y , which form the basis for identification.

Proposition 2

Under C(a) to C(c), the following CRF’s hold for M and Y :

(3.1) M = ψ 1 ( X ) + ψ d ( X ) D + U m , where U m ≡ M − E ( M ∣ D , X ) , ψ 1 ( X ) ≡ E ( M 0 ∣ X ) , ψ d ( X ) ≡ E ( M 1 − M 0 ∣ X ) ;

(3.2) Y = μ 1 ( X ) + μ d ( X ) D + μ m ( X ) M + μ d m ( X ) D M + U y , U y ≡ Y − E ( Y ∣ D , M , X ) , μ 1 ( X ) ≡ E ( Y 00 ∣ X ) , μ d ( X ) ≡ E ( Y 10 − Y 00 ∣ X ) , μ m ( X ) ≡ E ( Y 01 − Y 00 ∣ X ) , a n d μ d m ( X ) ≡ E ( Δ Y ± ∣ X ) ,

where ψ d ( X ) , μ d ( X ) , μ m ( X ) , and μ d m ( X ) are the X-conditional causal effects of interest.

Regarding (3.1), it holds for any form of M as long as M 1 − M 0 makes sense. However, since M3M is based on binary M , we continue to assume binary M . There are two “cells” D = 0 , 1 , and ψ 1 ( X ) and ψ d ( X ) are nonparametrically identified using

E ( M ∣ X , D = 0 ) = E ( M 0 ∣ X ) = ψ 1 ( X ) , and E ( M ∣ X , D = 1 ) = E ( M 1 ∣ X ) = ψ 1 ( X ) + ψ d ( X ) .

This point can be understood by considering a nonparametric estimation of M on X for the D = 0 , 1 groups, separately.

Analogously to (3.1), (3.2) holds also for any form of Y , as long as Y 10 − Y 00 , Y 01 − Y 00 , and Δ Y ± make sense; For example, Y can be binary, continuous, etc. There are four cells formed by D = 0 , 1 and M = 0 , 1 , and analogously to the identification of { ψ 1 ( X ) , ψ d ( X ) } , { μ 1 ( X ) , μ d ( X ) , μ m ( X ) , μ d m ( X ) } are nonparametrically identified using the X -conditional means on the cells.

A model is a SF, if it is a data-generating process with parameters of interest governing the behaviors of subjects. A model is a RF, if it is derived from some SF’s; e.g., a SF for Y has D and M on the right-hand side, and substituting out the SF for M yields the RF for Y with only D on the right-hand side. The parameters in a RF are not of direct interest, as they are functions of the underlying SF parameters. CRF falls between SF and RF, as it is a derived form or RF but with causal parameters of interest, such as μ d ( X ) , μ m ( X ) , and μ d m ( X ) in (3.2). CRF’s similar to (3.1) and (3.2) appeared in Lee [22,23], Mao and Li [24], Choi et al. [25], and Lee et al. [26].

To understand the effect decomposition better, substitute (3.1) into (3.2):

(3.3) Y = μ 1 ( X ) + μ m ( X ) ψ 1 ( X ) + [ μ d ( X ) + μ m ( X ) ψ d ( X ) + μ d m ( X ) { ψ 1 ( X ) + ψ d ( X ) } ] D + { μ m ( X ) + μ d m ( X ) D } U m + U y .

The slope of D is the “ X -conditioned total effect” consisting of direct ( μ d ( X ) ), indirect ( μ m ( X ) ψ d ( X ) ) and interaction ( μ d m ( X ) { ψ 1 ( X ) + ψ d ( X ) } ) effects. Compare these to the constant-effect versions in (2.9): β d + β m α d + β d m { α 1 + E ( X ′ ) α x + α d } .

If desired, { ψ 1 ( X ) , ψ d ( X ) , μ 1 ( X ) , μ d ( X ) , μ m ( X ) , μ d m ( X ) } in (3.1) and (3.2) can be estimated nonparametrically. However, more practical would be specifying those functions as, e.g., linear functions of X , and then, apply OLS to the resulting M and Y models. The details of this approach will be seen in the next section.

The OLS just mentioned is reminiscent of the traditional approach with (1.1), which raises the question: how much is the aforementioned OLS different from the traditional OLS to (1.1)? The answer is that there are three critical differences.

First, whereas the SF’s in (1.1), i.e., the data-generating processes, hold only for continuous M and Y in general, the aforementioned CRF’s for M and Y hold more generally for any forms of M and Y .

Second, (1.1) can be generalized to account for effect heterogeneity as in (1.2). However, proceeding with the SF’s in (1.2) makes it unclear what kind of direct, indirect, and interaction effects are actually estimated: e.g., are they (2.3) or M3M?

Third, if one starts off with (1.2), then it is not clear whether α d ( X ) , β d ( X ) , β m ( X ) , and β d m ( X ) in (1.2) should be restricted or not. In contrast, the definitions of the unknown functions of X in (3.1) and (3.2) show how they might be restricted. For example, since ψ 1 ( X ) ≡ E ( M 0 ∣ X ) , for binary M as in our setup, ψ 1 ( X ) cannot be specified just as X ′ α for a parameter α ; rather, ψ 1 ( X ) = Φ ( X ′ α ) is more suitable for a distribution function Φ ( · ) . Another example is μ d ( X ) ≡ E ( Y 10 − Y 00 ∣ X ) : if Y is binary, E ( Y 10 − Y 00 ∣ X ) should be bounded by [ − 1 , 1 ] , which can be accommodated by a smooth function with the range on [ − 1 , 1 ] , such as the arctan function or Φ ( X ′ α 10 ) − Φ ( X ′ α 00 ) , where Φ ( X ′ α 10 ) is for E ( Y 10 ∣ X ) and Φ ( X ′ α 00 ) is for E ( Y 00 ∣ X ) . If these nonlinear functions are used, then the aforementioned OLS should be replaced by a nonlinear least-squares estimator, which will be further discussed in the next section.

4 Effect estimators using OLS and generalized method of moment (GMM)

To estimate the effects in M3M, we linearly approximate all ψ ( X ) and μ ( X ) terms in the CRF’s (3.1) and (3.2), and then apply OLS to (3.1) and (3.2). This is summarized in Proposition 3, which is followed by discussions on nonlinear estimation when some ψ ( X ) and μ ( X ) terms are nonlinearly specified.

Let X α , X 1 , X d , X m , X d m consist of elements of X and their functions, with X j of dimension k j × 1 , j = α , 1 , d , m , d m . Linearly approximate the functions of X in (3.1):

(4.1) M = α 1 ′ X α + α d ′ X α D + U m = α m ′ Q m + U m , α m ≡ ( α 1 ′ , α d ′ ) ′ , Q m ≡ ( X α ′ , X α ′ D ) ′ ;

α 1 ′ X α is for ψ 1 ( X ) in (3.1) and α d ′ X α is for ψ d ( X ) . For simplicity, we use the same X α in α 1 ′ X α and α d ′ X α , but we can certainly use different covariates, if desired.

Doing analogously for (3.2), we have

(4.2) Y = β 1 ′ X 1 + β d ′ X d D + β m ′ X m M + β d m ′ X d m D M + U y = β y ′ Q y + U y , β y ≡ ( β 1 ′ , β d ′ , β m ′ , β d m ′ ) ′ , Q y = ( X 1 ′ , X d ′ D , X m ′ M , X d m ′ D M ) ′ ,

where β 1 ′ X 1 , β d ′ X d , β m ′ X m , and β d m ′ X d m are for μ 1 ( X ) , μ d ( X ) , μ m ( X ) , and μ d m ( X ) .

One might use different covariates X α , X 1 , X d , X m , and X d m as in (4.1) and (4.2), but this approach would be following the conventional SF view as in (1.2). Rather, since our CRF’s are not SF’s and the X -conditional causal parameters are of RF type (e.g., ψ d ( X ) ≡ E ( M 1 − M 0 ∣ X ) is a RF function), it is better to control for all X to ensure C(a) to C(c), i.e., setting X = X α = X 1 = X d = X m = X d m in (4.1) and (4.2) would be fine, as will be done in our simulation and empirical studies.

For asymptotic inference, we condition on X ¯ ; an upper bar denotes the sample average. This is to ignore errors of the form X m X α ′ ¯ − E ( X m X α ′ ) relevant for the indirect and interaction effects, as accounting for such errors requires vectorizing the matrix X m X α ′ ¯ − E ( X m X α ′ ) — unnecessary complications. What is gained by conditioning on X ¯ is ease in doing asymptotic inference, and what is lost is some “external validity,” as the findings conditioned on X ¯ apply only to X -fixed designs in principle. Our simulation study with random X will demonstrate, however, that not accounting for errors of the form X m X α ′ ¯ − E ( X m X α ′ ) makes little difference.

Let 0 a × b be the a × b null vector and N be the sample size of independent and identically distributed observations; “ β ˆ d ” denotes an estimator for β d . All three effects in M3M are found by the two OLS’s of M on Q m and Y on Q y . Appendix provides missing details in Proposition 3. Recall (3.3): the indirect effect is μ m ( X ) ⋅ ψ d ( X ) and the interaction effect is μ d m ( X ) ⋅ { ψ 1 ( X ) + ψ d ( X ) } .

Proposition 3

The direct effect estimator is X ¯ d ′ β ˆ d with
N X ¯ d ′ ( β ˆ d − β d ) → d N ( 0 , Λ d ) , Λ ˆ d ≡ 1 N ∑ i λ ˆ d i 2 → p Λ d , λ ˆ d i ≡ G ˆ d 1 N ∑ i Q y i Q y i ′ − 1 Q y i U ˆ y i , G ˆ d ≡ ( 0 1 × k 1 , X ¯ d ′ , 0 1 × ( k m + k d m ) ) , U ˆ y i ≡ Y i − β ˆ y ′ Q y i .
The indirect effect estimator is β ˆ m ′ X m X α ′ ¯ α ˆ d with
N ( β ˆ m ′ X m X α ′ ¯ α ˆ d − β m ′ X m X α ′ ¯ α d ) → d N ( 0 , Λ m ) , Λ ˆ m ≡ 1 N ∑ i λ ˆ m i 2 → p Λ m , λ ˆ m i ≡ G ˆ m 1 1 N ∑ i Q y i Q y i ′ − 1 Q y i U ˆ y i + G ˆ m 2 1 N ∑ i Q m i Q m i ′ − 1 Q m i U ˆ m i , G ˆ m 1 ≡ ( 0 1 × ( k 1 + k d ) , α ˆ d ′ X α X m ′ ¯ , 0 1 × k d m ) , G ˆ m 2 ≡ ( 0 1 × k α , β ˆ m ′ X m X α ′ ¯ ) , U ˆ m i ≡ M i − α ˆ m ′ Q m i .
The interaction effect estimator is β ˆ d m ′ X d m X α ′ ¯ ( α ˆ 1 + α ˆ d ) with
N { β ˆ d m ′ X d m X α ′ ¯ ( α ˆ 1 + α ˆ d ) − β d m ′ X d m X α ′ ¯ ( α 1 + α d ) } → d N ( 0 , Λ d m ) , Λ ˆ d m ≡ 1 N ∑ i λ ˆ d m i 2 → p Λ d m , λ ˆ d m i ≡ G ˆ d m 1 1 N ∑ i Q y i Q y i ′ − 1 Q y i U ˆ y i + G ˆ d m 2 1 N ∑ i Q m i Q m i ′ − 1 Q m i U ˆ m i , G ˆ d m 1 ≡ ( 0 1 × ( k 1 + k d + k m ) , ( α ˆ 1 + α ˆ d ) ′ X α X d m ′ ¯ ) , G ˆ d m 2 ≡ ( β ˆ d m ′ X d m X α ′ ¯ , β ˆ d m ′ X d m X α ′ ¯ ) .
The total effect estimator is X ¯ d ′ β ˆ d + β ˆ m ′ X m X α ′ ¯ α ˆ d + β ˆ d m ′ X d m X α ′ ¯ ( α ˆ 1 + α ˆ d ) , which is asymptotically normal with the variance estimated by N − 1 ∑ i ( λ ˆ d i + λ ˆ m i + λ ˆ d m i ) 2 .

For continuous Y , linear approximations for { μ 1 ( X ) , μ d ( X ) , μ m ( X ) , μ d m ( X ) } should be fine, but linear approximations for ψ 1 ( X ) and ψ d ( X ) can result in biases because M is binary; the same is true of { μ 1 ( X ) , μ d ( X ) , μ m ( X ) , μ d m ( X ) } if Y is binary. In these cases, as was already noted, using functions with the ranges on [0,1] or [ − 1 , 1 ] would improve the approximations. However, the price to pay is the ensuing complications in estimation, because a nonlinear estimator such as GMM (Hansen [27]) is necessary.

For GMM to the M -CRF, consider a nonlinear moment condition:

(4.3) E { θ ( Z ; α ) } = 0 , θ ( Z ; α ) ≡ [ M − Φ ( α c ′ X α ) − { Φ ( α t ′ X α ) − Φ ( α c ′ X α ) } D ] ( X α ′ , X α ′ D ) ′ ,

where Z ≡ ( M , X α ′ , D ) ′ , Φ ( α c ′ X α ) = E ( M 0 ∣ X α ) , Φ ( α t ′ X α ) ≡ E ( M 1 ∣ X α ) , and α ≡ ( α c ′ , α t ′ ) ′ ; the parameters α c and α t are for the control and treatment groups, when M is taken as the outcome. Appendix provides the GMM details.

Once the GMM for the M -CRF is done, E ( M 1 − M 0 ∣ X α ) = X α ′ α d for the indirect effect and E ( M ∣ X , D = 1 ) = E ( M 1 ∣ X ) = X α ′ ( α 1 + α d ) for the interaction effect in Proposition 3 should be replaced by the estimates for Φ ( α t ′ X α ) − Φ ( α c ′ X α ) and Φ ( α t ′ X α ) , respectively. Appendix also addresses binary Y , for which E ( Y j k ∣ X ) = Φ ( X ′ β j k ) with a parameter β j k , j , k = 0 , 1 , is adopted; if Y is a count or zero-censored, then E ( Y j k ∣ X ) = exp ( X ′ β j k ) can be used instead.

The asymptotic distributions for the effects with nonlinear approximations can be derived as in Proposition 3. However, since Y can take diverse forms, finding the asymptotic distributions of the effects for all forms of Y is cumbersome. Instead, we use bootstrap for asymptotic inference.

5 Simulation study

Recalling (2.7), we use four designs in our simulation study with D randomized, P ( D = 0 ) = P ( D = 1 ) = 0.5 , N = 250 , 1,000, and 5,000 simulation repetitions:

Design 1: M d = 1 [ 0 < α 1 + α d d + X ′ α x − Uni ( 0 , 1 ) ] , X ∼ Uni ( 0 , 1 ) , continuous Y d m ; Design 2: M d = 1 [ 0 < α 1 + α d d + X ′ α x − Uni ( 0 , 1 ) ] , X ∼ Uni ( 0 , 1 ) , probit Y d m ; Design 3: M d = 1 [ 0 < α 1 + α d d + X ′ α x + N ( 0 , 1 ) ] , X ∼ N ( 0 , 2 2 ) , continuous Y d m ; Design 4: M d = 1 [ 0 < α 1 + α d d + X ′ α x + N ( 0 , 1 ) ] , X ∼ N ( 0 , 2 2 ) , probit Y d m .

The error terms for M d and Y d m are independent of each other and X . “Probit Y d m ” means Y d m = 1 [ 0 < continuous Y d m ], where “continuous Y d m ” is the Y d m in (2.7) with U ∼ N ( 0 , 1 ) . In Designs 1 and 2, E ( M d ∣ X ) is linear as was seen in (2.8).

As for the parameter values, we set

α 1 = 0 , α d = α x = 0.5 ; β 1 = 0 , β d = β m = β d m = 0.5 , and β x = − 1 .

β x = − 1 prevents binary Y 11 from having too many 0s. We generate M and Y with

M = ( 1 − D ) M 0 + D M 1 , Y = ( 1 − D ) ( 1 − M ) Y 00 + ( 1 − D ) M Y 01 + D ( 1 − M ) Y 10 + D M Y 11 .

In Design 1, the true effects are all constant: omitting “effect,”

total = direct + indirect + interaction = β d + β m α d + β d m ( α 1 + α d + 0.5 α x ) ,

where 0.5 comes from E ( X ) = E { Uni ( 0 , 1 ) } = 0.5 . In the other designs, however, the true effects are heterogeneous, and they are found numerically. For example, the true direct, indirect, and interaction effects in Design 2 are, recalling Δ Y ± for interaction effect,

E { Φ ( β 1 + β d + X ′ β x ) − Φ ( β 1 + X ′ β x ) } , E { Φ ( β 1 + β m + X ′ β x ) − Φ ( β 1 + X ′ β x ) } α d , E [ { Φ ( β 1 + β d + β m + β d m + X ′ β x ) − Φ ( β 1 + β d + X ′ β x ) − Φ ( β 1 + β m + X ′ β x ) + Φ ( β 1 + X ′ β x ) } ⋅ ( α 1 + α d + X ′ α x ) ] .

The effects are complicated for designs 3 and 4 due to the N ( 0 , 1 ) error in M d .

We use two OLS’s with X = X α = X 1 = X d = X m = X d m : OLS v 1 uses the random variable X as the single covariate, and OLS v 2 uses X 2 additionally. Since OLS v 2 uses one more covariate than OLS v 1 does, OLS v 2 is likely to be less biased but more dispersed than OLS v 1 . Only in Design 4, we use “ OLS v 3 ” that uses one more covariate Φ ( X ) than OLS v 2 does to improve the linear approximation; GMM is also used in Design 4.

Table 1 presents the Design 1 (left-half) and Design 2 (right-half) results. Each entry has four numbers: ∣ Bias ∣ , standard deviation (Sd), root-mean-squared error (Rmse), and the average of 5,000 asymptotic Sd’s to see how accurate the variance formulas in Proposition 3 are, compared with the actual simulation Sd. Since the effects vary across the designs, we divide each number by the absolute effect magnitude for standardization.

Table 1

∣ Bias/effect ∣ , Sd ∕ ∣ effect ∣ , (Rmse/∣effect∣), and asymptotic Sd/∣effect∣

	Design 1, N = 250	Design 1, N = 1,000	Design 2, N = 250	Design 2, N = 1,000
OLS v1
tot	0.00 0.12 (0.12) 0.12	0.00 0.06 (0.06) 0.06	0.00 0.15 (0.15) 0.14	0.00 0.07 (0.07) 0.07
dir	0.00 0.50 (0.50) 0.47	0.00 0.24 (0.24) 0.24	0.00 0.67 (0.67) 0.63	0.01 0.32 (0.32) 0.32
ind	0.02 0.53 (0.53) 0.50	0.01 0.25 (0.25) 0.25	0.01 0.70 (0.70) 0.65	0.00 0.33 (0.33) 0.33
int	0.01 0.73 (0.73) 0.70	0.00 0.35 (0.35) 0.35	0.00 1.14 (1.14) 1.08	0.00 0.54 (0.54) 0.53
OLS v2
tot	0.00 0.12 (0.12) 0.12	0.00 0.06 (0.06) 0.06	0.00 0.15 (0.15) 0.14	0.00 0.07 (0.07) 0.07
dir	0.00 0.64 (0.64) 0.51	0.00 0.27 (0.27) 0.26	0.02 0.82 (0.82) 0.68	0.00 0.35 (0.35) 0.34
ind	0.00 0.67 (0.67) 0.55	0.00 0.28 (0.28) 0.27	0.02 0.89 (0.89) 0.71	0.00 0.36 (0.36) 0.35
int	0.00 0.93 (0.93) 0.77	0.01 0.39 (0.39) 0.38	0.02 1.45 (1.45) 1.19	0.00 0.59 (0.59) 0.58
tru	1.125, 0.500, 0.250, 0.375		0.395, 0.184, 0.092, 0.118

OLS v 1 uses X & OLS v 2 uses ( X , X 2 ) for X -heterogeneous effect linear approximations; tot: total effect; dir: direct; ind: indirect; int: interaction; tru: true effect.

In Design 1 with N = 250 , all biases are almost zero, and OLS v 1 does better than OLS v 2 that is more dispersed than OLS v 1 . With N = 1,000 , both OLS’s improve, and the performance differences narrow by much. In Design 2 with binary Y , OLS v 1 still does better than OLS v 2 . The second and fourth numbers in each entry of Table 1 are almost the same when N = 1,000 , showing that the asymptotic variance formulas are accurate; for this, not accounting for the errors of the form X ¯ − E ( X ) hardly matters.

Table 2 presents the Design 3 (left-half) and Design 4 (right-half) results. In Design 3 with continuous Y , OLS v 1 still does better than OLS v 2 as in Table 1. Both OLS v 1 and OLS v 2 are little biased in Design 3, despite that only linear functions are used for the nonlinear E ( M 0 ∣ X ) and E ( M 1 ∣ X ) .

Table 2

∣Bias/effect∣, Sd/effect, (Rmse/∣effect∣), and asymptotic Sd/∣effect∣

	Design 3, N = 250	Design 3, N = 1,000	Design 4, N = 500	Design 4, N = 2,000
OLS v1
tot	0.00 0.15 (0.15) 0.15	0.00 0.07 (0.07) 0.07	0.00 0.19 (0.19) 0.18	0.00 0.09 (0.09) 0.09
dir	0.01 0.53 (0.53) 0.50	0.00 0.25 (0.25) 0.25	0.74 0.88 (1.15) 0.84	0.72 0.42 (0.84) 0.42
ind	0.01 0.63 (0.63) 0.63	0.01 0.29 (0.29) 0.30	0.28 0.61 (0.67) 0.61	0.28 0.29 (0.41) 0.29
int	0.02 0.78 (0.78) 0.74	0.00 0.37 (0.37) 0.37	0.92 1.10 (1.43) 1.06	0.92 0.53 (1.06) 0.53
OLS v2
tot	0.00 0.16 (0.16) 0.15	0.00 0.07 (0.07) 0.07	0.00 0.18 (0.18) 0.18	0.00 0.09 (0.09) 0.09
dir	0.02 0.71 (0.71) 0.62	0.01 0.32 (0.32) 0.31	0.15 1.06 (1.07) 0.97	0.15 0.50 (0.52) 0.49
ind	0.01 0.70 (0.70) 0.71	0.01 0.30 (0.30) 0.31	0.14 0.90 (0.91) 0.87	0.08 0.43 (0.44) 0.42
int	0.03 1.09 (1.09) 0.94	0.01 0.48 (0.48) 0.46	0.23 1.38 (1.40) 1.26	0.22 0.65 (0.68) 0.64
OLS v3
tot			0.00 0.18 (0.18) 0.18	0.00 0.09 (0.09) 0.09
dir			0.04 0.98 (0.98) 0.82	0.05 0.36 (0.36) 0.35
ind			0.05 0.64 (0.64) 0.65	0.04 0.29 (0.29) 0.29
int			0.06 1.31 (1.31) 1.09	0.07 0.47 (0.48) 0.46
GMM v1
tot			0.01 0.18 (0.18) 0.00	0.01 0.09 (0.09) 0.00
dir			0.04 0.66 (0.66) 0.00	0.02 0.31 (0.31) 0.00
ind			0.00 0.57 (0.57) 0.00	0.00 0.27 (0.27) 0.00
int			0.06 0.82 (0.83) 0.00	0.03 0.39 (0.40) 0.00
tru	0.888, 0.500, 0.069, 0.319		0.168, 0.088, 0.015, 0.064

OLS v 1 , OLS v 2 and OLS v 3 use X , ( X , X 2 ) and { X , X 2 , Φ ( X ) } , respectively; tot: total effect; dir: direct; ind: indirect; int: interaction; tru: true effect.

In Design 4, we set N = 500 , 2,000 because the GMM often incurred a singularity problem with N = 250 . OLS v 1 has much larger biases than OLS v 2 . Since even the biases of OLS v 2 are not negligible, to see if they can be reduced, we use Φ ( X ) as an extra covariate, which is OLS v 3 . Indeed, the biases of OLS v 3 are much smaller than those of OLS v 1 and OLS v 2 , and OLS v 3 does better than OLS v 1 and OLS v 2 in terms of Rmse when N = 2 , 000 . We also applied the GMM described in Appendix to the M and Y CRF’s. GMM v 1 using only X as OLS v 1 does performs best, with its biases clearly decreasing when N quadruples to 2,000, whereas the OLS biases do not.

In summary, first, OLS v 2 with two covariates ( X , X 2 ) has higher Sd’s than OLS v 1 with only X , but depending on the true model, OLS v 2 can be better than OLS v 1 for a large N when the Sd of OLS v 2 becomes much smaller while OLS v 1 remains highly biased. Second, the asymptotic Sd formulas of our estimators work well, and are not affected by ignoring errors of the form X ¯ − E ( X ) . Third, the binary nature of M seems to hardly matter, but the binary nature of Y does, for which GMM with nonlinear approximations is better than OLS with linear approximations. For this, of course, the nonlinear functions in GMM should be correctly specified.

6 Empirical analysis

Our empirical analysis uses the National Longitudinal Survey data in Card [28], which have been used also in Tan [29] and Wang et al. [30], among others. With N = 3,010 , Y is ln (wage in 1976), D is the dummy for black, M is the dummy for some college education (i.e., schooling years being 13 or greater), and X consists of age, dummies (“r2, r3, …”) for 8 residence regions in 1966, dummy for living in a standard metropolitan statistical area (SMSA 66 ) in 1966, dummy for living in SMSA in 1976 (“SMSA”), and dummy for living in South in 1976 (“south”). In the original data, there were nine residence region dummies, but the dummy for region 8 was dropped due to a singularity problem in our OLSs, i.e., with only age being not binary, we set

X = ( 1, age, r2, r3, r4, r5, r6, r7, r9, SMSA 66 , SMSA, South ) ′ .

The data set is old, but this suits well our purpose of finding racial discrimination effect on wage, which consists of the direct effect, the indirect effect through college education, and the interaction effect of black and college education. When gender discrimination cases were argued in court, often the counter-argument was that females were less educated/qualified, but lower education/qualification itself might have been due to gender discrimination. Hence, it is important to account for the indirect discrimination through missed education opportunities, but doing so with recent data would be difficult because discrimination due to denied education opportunities is unlikely to be present in recent data. For this reason, using an old data set as ours is advantageous.

Table 3 presents the estimation results with X = X α = X 1 = X d = X m = X d m . OLS v 1 uses X for all unknown ψ ( X ) and μ ( X ) functions in (3.1) and (3.2), whereas OLS v 2 uses additionally the interaction terms between age and the other elements (r2∼South) of X . OLS v 2 uses almost twice as many covariates as OLS v 1 does, but both estimates are almost the same in Table 1, and all effects are statistically significant. Table 3 also presents the results for OLS v 0 that is the usual OLS for constant effect models with “intercept” X ′ β x . OLS v 0 differs much from OLS v 1 and OLS v 2 in terms of the direct (and thus total) effect.

Table 3

Effect ( t -value) of being black on wage with college education M

	OLS v 0 (tv)	OLS v 1 (tv)	OLS v 2 (tv)	GMM v 1 (tv)
Total effect	‒ 0.319 ( ‒ 17 )	‒ 0.223 ( ‒ 9.0 )	‒ 0.220 ( ‒ 8.5 )	‒ 0.222 ( ‒ 9.0 )
Direct	‒ 0.336 ( ‒ 15 )	‒ 0.242 ( ‒ 7.7 )	‒ 0.248 ( ‒ 8.0 )	‒ 0.242 ( ‒ 7.7 )
Indirect	‒ 0.028 ( ‒ 5.1 )	‒ 0.029 ( ‒ 5.2 )	‒ 0.023 ( ‒ 3.5 )	‒ 0.029 ( ‒ 6.1 )
Interaction	0.045 (3.6)	0.049 (3.2)	0.051 (3.1)	0.049 (3.4)

OLS v 0 is the conventional constant-effect OLS with “intercept” X ′ β x ; OLS v 1 approximates X -heterogeneous effects with linear functions of X ; OLS v 2 uses additionally the interactions between age & the other covariates; GMM v 1 taking M ∈ [ 0 , 1 ] into account uses the same set of X as in OLS v 1 .

Table 3 further presents GMM v 1 for the M -CRF that uses the same X as OLS v 1 does. GMM v 1 specifies ψ 1 ( X ) = Φ ( α c ′ X ) and ψ d ( X ) = Φ ( α t ′ X ) − Φ ( α c ′ X ) to take “ M ∈ [ 0 , 1 ] ” into account; GMM v 1 still applies OLS to the Y -CRF. For GMM v 1 , we used bootstrap to obtain 95% asymptotic confidence intervals (CIs) with 2,000 bootstrap repetitions, and then obtained the ad-hoc “implied Sd” by dividing the CI width by 2 × 1.96 due to “CI width ≃ 2 × 1.96 × S d .” The GMM v 1 results are almost the same as those of OLS v 1 . We also tried “ GMM v 2 ,” but omitted it, as its bootstrap ran into a singularity problem too often.

The total effect of being black on wage is − 22 % , with the direct effect − 24 to − 25 % , the indirect effect − 2.3 to − 2.9 % through missed college education, and the interaction effect 4.9 to 5.1%. Had it not been for the indirect effect through college education, the wage discrimination would have been lesser by 2.3 to 2.9%, and college education alleviated the racial discrimination by 4.9 to 5.1% through the interaction effect.

7 Conclusions

A treatment D can affect an outcome Y directly, as well as indirectly through a mediator M . D can also interact with M to affect Y . In the literature of mediation analysis, various (two-way, three-way, and four-way) decompositions of the total effect into sub-effects appeared, and recommending one decomposition over the others is difficult or groundless in general.

In this article, based on “mediative principal stratification” classifying the mediator into four potential types (never taker, complier, defier, and always taker), we found that a particular three-way decomposition for direct, indirect, and interaction effects is appropriate in the sense that the three effects are associated with the right types.

We further showed how to identify the three effects, and in the process, we obtained “CRF’s” for M that is linear in D , and for Y that is linear in ( D , M , D M ) , despite no explicit linearity assumptions. The CRF’s hold for any Y (binary, count, continuous, …), and ( D , M , D M ) carry unknown functions of X as the slopes, which are the basis for the desired direct, indirect, and interaction effects. A practical estimation scenario is specifying the unknown functions of X as linear, or better yet, nonlinear ones depending on the form and range of M and Y . Then, OLS or GMM can be used to estimate the CRF’s.

myoungjae@korea.ac.kr

Acknowledgement

The author is grateful to the associate editor and two anonymous reviewers for their helpful comments. The author is also grateful to Bora Kim for her help in proof-reading the paper.

Funding information: This research has been supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2024-00337766).
Author contributions: The author confirms the sole responsibility for the conception of the study, presented results, and manuscript preparation.
Compliance with ethical standard: No human or animal subject is involved in this research.
Data availability statement: The data used in this article are publicly available at http://davidcard.berkeley.edu/data_sets.html.

Appendix

Proof for Proposition 1

Substituting (2.1) into the total effect E ( Y 1 , M 1 − Y 0 , M 0 ) in (1.3) renders

E [ Y 10 + ( Y 11 − Y 10 ) M 1 − { Y 00 + ( Y 01 − Y 00 ) M 0 } ] .

The existing decompositions are obtained by rewriting this expression differently. To obtain M3M, put Y 10 − Y 00 together and move ( Y 11 − Y 10 ) M 1 to the last place:

□ E { Y 10 − Y 00 − ( Y 01 − Y 00 ) M 0 + ( Y 11 − Y 10 ) M 1 } = E { Y 10 − Y 00 + ( Y 01 − Y 00 ) ( M 1 − M 0 ) − ( Y 01 − Y 00 ) M 1 + ( Y 11 − Y 10 ) M 1 } = E { Y 10 − Y 00 + ( Y 01 − Y 00 ) ( M 1 − M 0 ) + Δ Y ± M 1 } .

Proof for Proposition 2

Take E ( ⋅ ∣ D , X ) on the observed M = M 0 + ( M 1 − M 0 ) D : due to C(a)(i),

E ( M ∣ D , X ) = E ( M 0 ∣ D , X ) + E ( M 1 − M 0 ∣ D , X ) D = ψ 1 ( X ) + ψ d ( X ) D .

Then, defining U m ≡ M − E ( M ∣ D , X ) renders (3.1).

Before we address (3.2), observe that, using f ( ⋅ ∣ X ) to denote densities/probabilities:

f ( M 0 , M 1 , Y 00 , Y 01 , Y 10 , Y 11 , D ∣ X ) = f ( M 0 , M 1 , Y 00 , Y 01 , Y 10 , Y 11 ∣ D , X ) ⋅ f ( D ∣ X ) = f ( M 0 , M 1 ∣ D , X ) ⋅ f ( Y 00 , Y 01 , Y 10 , Y 11 ∣ D , X ) ⋅ f ( D ∣ X ) {due to C(b)} = f ( M 0 , M 1 ∣ D , X ) ⋅ f ( Y 00 , Y 01 , Y 10 , Y 11 ∣ X ) ⋅ f ( D ∣ X ) {due to C(a)(ii)} = f ( D , M 0 , M 1 ∣ X ) ⋅ f ( Y 00 , Y 01 , Y 10 , Y 11 ∣ X ) .

The first and last expressions of this display yield

C(d) : ( D , M 0 , M 1 ) ∐ ( Y 00 , Y 01 , Y 10 , Y 11 ) ∣ X { ⇒ ( D , M ) ∐ ( Y 00 , Y 01 , Y 10 , Y 11 ) ∣ X } .

The implication arrow holds as M = M 0 + ( M 1 − M 0 ) D is determined by ( M 0 , M 1 , D ) . Now, note that the observed Y and E ( Y ∣ D , M , X ) are, using C(d) just above,

Y = ( 1 − D ) ( 1 − M ) Y 00 + ( 1 − D ) M Y 01 + D ( 1 − M ) Y 10 + D M Y 11 = Y 00 + ( Y 10 − Y 00 ) D + ( Y 01 − Y 00 ) M + Δ Y ± D M ; E ( Y ∣ D , M , X ) = μ 1 ( X ) + μ d ( X ) D + μ m ( X ) M + μ d m ( X ) D M .

Defining U y ≡ Y − E ( Y ∣ D , M , X ) then yields (3.2).□

Proof for Proposition 3

Rewrite the indirect effect E [ E { ( Y 01 − Y 00 ) ( M 1 − M 0 ) ∣ D , X } ∣ X ] as

E { E ( Y 01 − Y 00 ∣ D , X ) E ( M 1 − M 0 ∣ D , X ) ∣ X } = E { E ( Y 01 − Y 00 ∣ X ) ⋅ E ( M 1 − M 0 ∣ X ) ∣ X } = β m ′ X m ⋅ X α ′ α d ,

where C(b) and C(a) are used, respectively. It holds that

N ( β ˆ m ′ X m X α ′ ¯ α ˆ d − β m ′ X m X α ′ ¯ α d ) = N ( β ˆ m ′ X m X α ′ ¯ α ˆ d − β m ′ X m X α ′ ¯ α ˆ d + β m ′ X m X α ′ ¯ α ˆ d − β m ′ X m X α ′ ¯ α d ) = α d ′ X α X m ′ ¯ N ( β ˆ m − β m ) + β m ′ X m X α ′ ¯ N ( α ˆ d − α d ) + o p ( 1 ) .

This yields the asymptotic distribution for the indirect effect in Proposition 3.

Analogously, for the interaction effect E { E ( Δ Y ± M 1 ∣ D , X ) ∣ X } , we obtain

E { E ( Δ Y ± ∣ D , X ) E ( M 1 ∣ D , X ) ∣ X } = μ d m ( X ) ⋅ E ( M 1 ∣ X ) = β d m ′ X d m ⋅ X α ′ ( α 1 + α d ) .

This yields the asymptotic distribution for the interaction effect in Proposition 3. The total effect part follows from the sum of the three sub-effects.□

GMM estimation Recall the moment condition E { θ ( Z ; α ) } = 0 in (4.3). Set X = X α = X 1 = X d = X m = X d m to simplify exposition with little loss of generality, as was already mentioned; then, Z = ( M , X ′ , D ) ′ . The number of the moments is the same as the dimension of α ≡ ( α c ′ , α t ′ ) ′ , which is a “just-, not over-, identified” case. The GMM minimizes

(7.1) 1 N ∑ i θ ( Z i ; a ) ′ ⋅ W N − 1 ⋅ 1 N ∑ i θ ( Z i ; a ) , and W N ≡ 1 N ∑ i θ ( Z i ; a ) θ ( Z i ; a ) ′ ,

with respect to (wrt) a ; W N is to estimate E { θ ( Z ; α ) θ ( Z ; α ) ′ } .

With the derivative θ a ( Z i ; a 0 ) of θ ( Z i ; a ) wrt a evaluated at a 0 (stacked row-wise for each moment), the GMM is implemented by iterating with

(7.2) a 1 = a 0 − ∑ i θ a ( Z i ; a 0 ) W N − 1 ∑ i θ a ′ ( Z i ; a 0 ) − 1 ∑ i θ a ( Z i ; a 0 ) W N − 1 ∑ i θ ( Z i ; a 0 ) .

W N is evaluated at a 0 , and a 1 is to be replaced by a 0 at each iteration until ∣ a 1 − a 0 ∣ becomes negligibly small or the minimum of (7.1) is attained.

In our just-identified case with E { θ a ( Z ; α ) } invertible, the iteration reduces to

(7.3) a 1 = a 0 − ∑ i θ a ′ ( Z i ; a 0 ) − 1 ∑ i θ ( Z i ; a 0 ) .

Denoting the GMM estimator for α as α ˆ g m m , with E − 1 ( ⋅ ) standing for { E ( ⋅ ) } − 1 , N ( α ˆ g m m − α ) is asymptotically normal and has the asymptotic variance:

(7.4) [ E { θ a ( Z ; α ) } ⋅ E − 1 { θ ( Z ; α ) θ ( Z ; α ) ′ } ⋅ E { θ a ′ ( Z ; α ) } ] − 1 = E − 1 { θ a ′ ( Z ; α ) } ⋅ E { θ ( Z ; α ) θ ( Z ; α ) ′ } ⋅ E − 1 { θ a ( Z ; α ) } .

One “dilemma” is that (7.3) follows also from minimizing (7.1) with W N removed:

(7.5) 1 N ∑ i θ ( Z i ; a ) ′ ⋅ 1 N ∑ i θ ( Z i ; a ) .

The GMM minimizing this is called the “unweighted GMM,” compared with the optimal GMM minimizing (7.1). The minimand matters, because it is used in stopping the iteration and picking up the final estimate. We adopt the unweighted GMM in this article, because, despite the theoretical superiority of the optimal GMM in combining the moment conditions, weighting often results in numerical instability in practice, which was also the case in our simulation study, although not reported there. The asymptotic variance of the unweighted GMM is still (7.4), as can be conjectured from (7.3).

When Y is binary (other forms of non-continuous Y can be dealt with analogously), to apply GMM to the Y -CRF, we can set, for some β parameters:

μ 1 ( X ) ≡ E ( Y 00 ∣ X ) = Φ ( β 00 ′ X ) , μ d ( X ) ≡ E ( Y 10 − Y 00 ∣ X ) = Φ ( β 10 ′ X ) − Φ ( β 00 ′ X ) , μ m ( X ) ≡ E ( Y 01 − Y 00 ∣ X ) = Φ ( β 01 ′ X ) − Φ ( β 00 ′ X ) , μ d m ( X ) ≡ E ( Δ Y ± ∣ X ) = Φ ( β 11 ′ X ) − Φ ( β 01 ′ X ) − Φ ( β 10 ′ X ) + Φ ( β 00 ′ X ) .

The moment condition for the Y -CRF is E { κ ( H ; β ) } = 0 , where H ≡ ( Z ′ , Y ) ′ and

κ ( H ; β ) ≡ [ Y − Φ ( β 00 ′ X ) − { Φ ( β 10 ′ X ) − Φ ( β 00 ′ X ) } D − { Φ ( β 01 ′ X ) − Φ ( β 00 ′ X ) } M − { Φ ( β 11 ′ X ) − Φ ( β 01 ′ X ) − Φ ( β 10 ′ X ) + Φ ( β 00 ′ X ) } D M ] ⋅ ( X ′ , X ′ D , X ′ M , X ′ D M ) ′ .

The remaining steps are analogous to the aforementioned unweighted GMM steps for the M -CRF.

References

[1] MacKinnon D, Fairchild A, Fritz M. Mediation analysis. Ann Rev Psychol. 2007;58:593–614. doi: https://doi.org/10.1146/annurev.psych.58.110405.085542. 10.1146/annurev.psych.58.110405.085542Search in Google Scholar PubMed PubMed Central

[2] Pearl J. Causality. 2nd ed. Cambridge: Cambridge University Press; 2009. doi: https://doi.org/10.1017/CBO9780511803161. 10.1017/CBO9780511803161Search in Google Scholar

[3] Imai K, Keele L, Yamamoto T. Identification, inference, and sensitivity analysis for causal mediation effects. Stat Sci. 2010;25:51–71. doi: https://doi.org/10.1214/10-STS321. 10.1214/10-STS321Search in Google Scholar

[4] TenHave T, Joffe M. A review of causal estimation of effects in mediation analyses. Stat Methods Med Res. 2012;21:77–107. doi: https://doi.org/10.1177/0962280210391076. 10.1177/0962280210391076Search in Google Scholar PubMed

[5] Preacher K. Advances in mediation analysis: a survey and synthesis of new developments. Ann Rev Psychol. 2015;66:825–52. doi: https://doi.org/10.1146/annurev-psych-010814-015258. 10.1146/annurev-psych-010814-015258Search in Google Scholar PubMed

[6] VanderWeele T. Explanation in Causal Inference: Methods for Mediation and Interaction. Oxford University Press; 2015. 10.1093/ije/dyw277Search in Google Scholar PubMed PubMed Central

[7] VanderWeele T. Mediation analysis: a practitioner’s guide. Ann Rev Public Health. 2016;37:17–32. doi: https://doi.org/10.1146/annurev-publhealth-032315-021402. 10.1146/annurev-publhealth-032315-021402Search in Google Scholar PubMed

[8] Nguyen T, Schmid I, Stuart E. Clarifying causal mediation analysis for the applied researcher: defining effects based on what we want to learn. Psychol Methods. 2021;26:255–71. doi: https://psycnet.apa.org/10.1037/met0000299. 10.1037/met0000299Search in Google Scholar PubMed PubMed Central

[9] Lee M, Lee S. Review and comparison of treatment effect estimators using propensity and prognostic scores. Int J Biostat. 2022;18:357–80. doi: https://doi.org/10.1515/ijb-2021-0005. 10.1515/ijb-2021-0005Search in Google Scholar PubMed

[10] Choi J, Lee M. Overlap weight and propensity score residual for heterogeneous effects: a review with extensions. J Stat Plan Inference. 2023;222:22–37. doi: https://doi.org/10.1016/j.jspi.2022.04.003. 10.1016/j.jspi.2022.04.003Search in Google Scholar

[11] Baron R, Kenny D. The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J Personality Soc Psychol. 1986;51:1173–82. doi: https://psycnet.apa.org/10.1037/0022-3514.51.6.1173. 10.1037//0022-3514.51.6.1173Search in Google Scholar PubMed

[12] Pearl J. Direct and Indirect Effects. San Francisco, CA: Morgan Kaufmann; 2001. p. 411–20. Search in Google Scholar

[13] Robins J. Semantics of causal DAG models and the identification of direct and indirect effects. Highly Structured in Highly structured stochastic systems. Green P, Hjort N, Richardson S, (Eds.) Oxford: Oxford University Press; 2003. p. 70–81. doi: https://doi.org/10.1093/oso/9780198510550.003.0007. 10.1093/oso/9780198510550.003.0007Search in Google Scholar

[14] VanderWeele T. A three-way decomposition of a total effect into direct, indirect, and interactive effects. Epidemiology. 2013;24:224–32. doi: https://doi.org/10.1097/EDE.0b013e318281a64e. 10.1097/EDE.0b013e318281a64eSearch in Google Scholar PubMed PubMed Central

[15] VanderWeele T. A unification of mediation and interaction: a four-way decomposition. Epidemiology. 2014;25:749–61. doi: https://doi.org/10.1097/EDE.0000000000000121. 10.1097/EDE.0000000000000121Search in Google Scholar PubMed PubMed Central

[16] Imbens G, Angrist J. Identification and estimation of local average treatment effects. Econometrica. 1994;62:467–75. doi: https://doi.org/10.2307/2951620. 10.2307/2951620Search in Google Scholar

[17] Angrist J, Imbens G, Rubin D. Identification of causal effects using instrumental variables. J Amer Stat Assoc. 1996;91:444–55. doi: https://doi.org/10.1080/01621459.1996.10476902. 10.1080/01621459.1996.10476902Search in Google Scholar

[18] Frangakis C, Rubin D. Principal stratification in causal inference. Biometrics. 2002;58:21–9. doi: https://doi.org/10.1111/j.0006-341X.2002.00021.x. 10.1111/j.0006-341X.2002.00021.xSearch in Google Scholar

[19] VanderWeele T, Tchetgen Tchetgen E. Attributing effects to interactions. Epidemiology. 2014;25:711–22. doi: https://doi.org/10.1097/EDE.0000000000000096. 10.1097/EDE.0000000000000096Search in Google Scholar PubMed PubMed Central

[20] Choi J, Lee M. Regression discontinuity with multiple running variables allowing partial effects. Political Anal. 2018;26:258–74. doi: https://doi.org/10.1017/pan.2018.13. 10.1017/pan.2018.13Search in Google Scholar

[21] Petersen M, Sinisi S, van der Laan M. Estimation of direct causal effects. Epidemiology. 2006;17:276–84. doi: https://doi.org/10.1097/01.ede.0000208475.99429.2d. 10.1097/01.ede.0000208475.99429.2dSearch in Google Scholar PubMed

[22] Lee M. Simple least squares estimator for treatment effects using propensity score residuals. Biometrika. 2018;105:149–4. doi: https://doi.org/10.1093/biomet/asx062. 10.1093/biomet/asx062Search in Google Scholar

[23] Lee M. Instrument residual estimator for any response variable with endogenous binary treatment. J R Stat Soc (Series B). 2021;83:612–35. doi: https://doi.org/10.1111/rssb.12442. 10.1111/rssb.12442Search in Google Scholar

[24] Mao H, Li L. Flexible regression approach to propensity score analysis and its relationship with matching and weighting. Stat Med. 2020;39:2017–34. doi: https://doi.org/10.1002/sim.8526. 10.1002/sim.8526Search in Google Scholar PubMed PubMed Central

[25] Choi J, Lee G, Lee M. Endogenous treatment effect for any response conditional on control propensity score. Stat Probability Letters. 2023;196:109747. doi: https://doi.org/10.1016/j.spl.2022.109747. 10.1016/j.spl.2022.109747Search in Google Scholar

[26] Lee G, Choi J, Lee M. Minimally capturing heterogeneous complier effect of endogenous treatment for any outcome variable. J Causal Inference. 2023;11:20220036. doi: https://doi.org/10.1515/jci-2022-0036. 10.1515/jci-2022-0036Search in Google Scholar

[27] Hansen L. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50:1029–1054. doi: https://doi.org/https://doi.org/10.2307/1912775. 10.2307/1912775Search in Google Scholar

[28] Card D. Using geographic variation in college proximity to estimate the return to schooling. in: Aspects of labor market behavior: essays in honour of John Vanderkamp. Christofides L, Grant E, Swidinsky R. (Eds.), Toronto: University of Toronto Press; 1995. p. 201–22. Search in Google Scholar

[29] Tan Z. Marginal and nested structural models using instrumental variables. J Amer Stat Assoc. 2010;105:157–9. doi: https://doi.org/10.1198/jasa.2009.tm08299. 10.1198/jasa.2009.tm08299Search in Google Scholar

[30] Wang L, Robins J, Richardson T. On falsification of the binary instrumental variable model. Biometrika. 2017;104:229–36. doi: https://doi.org/10.1093/biomet/asw064. 10.1093/biomet/asw064Search in Google Scholar PubMed PubMed Central

Received: 2023-04-27

Revised: 2024-04-14

Accepted: 2024-04-15

Published Online: 2024-06-21

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2023-0025

Keywords for this article

direct effect; indirect effect; interaction effect; mediation; principal stratification

Creative Commons

BY 4.0