Minimally capturing heterogeneous complier effect of endogenous treatment for any outcome variable

Goeun Lee; Jin-young Choi; Myoung-jae Lee

doi:10.1515/jci-2022-0036

Article Open Access

Minimally capturing heterogeneous complier effect of endogenous treatment for any outcome variable

Goeun Lee , Jin-young Choi and Myoung-jae Lee

Published/Copyright: July 14, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Journal of Causal Inference Volume 11 Issue 1

Abstract

When a binary treatment D is possibly endogenous, a binary instrument δ is often used to identify the “effect on compliers.” If covariates X affect both D and an outcome Y , X should be controlled to identify the “ X -conditional complier effect.” However, its nonparametric estimation leads to the well-known dimension problem. To avoid this problem while capturing the effect heterogeneity, we identify the complier effect heterogeneous with respect to only the one-dimensional “instrument score” E ( δ ∣ X ) for non-randomized δ . This effect heterogeneity is minimal, in the sense that any other “balancing score” is finer than the instrument score. We establish two critical “reduced-form models” that are linear in D or δ , even though no parametric assumption is imposed. The models hold for any form of Y (continuous, binary, count, …). The desired effect is then estimated using either single index model estimators or an instrumental variable estimator after applying a power approximation to the effect. Simulation and empirical studies are performed to illustrate the proposed approaches.

Keywords: endogenous treatment; complier effect; instrument score; propensity score; single index model; instrumental variable estimator

MSC 2010: 62G05; 62P10; 62P25

1 Introduction

A typical treatment effect analysis [1–6] involves a binary treatment D , an outcome Y , and covariates X . If the potential versions of Y for D = 0 , 1 are denoted as ( Y 0 , Y 1 ) , we can typically find E ( Y 1 − Y 0 ) or E ( Y 1 − Y 0 ∣ X ) when D is exogenous. However, the identification of E ( Y 1 − Y 0 ) or E ( Y 1 − Y 0 ∣ X ) is difficult when D is endogenous. Suppose a binary instrument variable (IV) δ is available for D , which meets some conditions including δ ∐ ( Y 0 , Y 1 ) ∣ X , with ∐ for independence. Denoting the potential treatments of D for δ = 0 , 1 as ( D 0 , D 1 ) , let the compliers (CP) be those with ( D 0 = 0 , D 1 = 1 ) [7,8]. For endogenous D , we can identify E ( Y 1 − Y 0 ∣ CP ) or E ( Y 1 − Y 0 ∣ CP , X ) . Note that D { ≡ D 0 + ( D 1 − D 0 ) δ } = δ holds for CP.

The instrumental variable estimator (IVE) for Y on D with δ as the IV can estimate E ( Y 1 − Y 0 ∣ CP , X = x ) for a discrete X using only the subpopulation X = x . However, for a high-dimensional X , this subsample approach runs into the well-known dimension problem as in Frölich [9] who estimated E ( Y 1 − Y 0 ∣ CP , X ) nonparametrically. To avoid this problem, Frölich [9] considered conditioning on “instrument score (IS),” but did not pursue it owing to an efficiency concern. Abadie [10] parametrized a local average response function (and the IS), and Tan [11] parametrized E ( Y ∣ δ , X ) or E ( δ ∣ X ) . Ogburn et al. [12] estimated E ( Y 1 − Y 0 ∣ CP , X ˜ ) for some X ˜ ∈ X . Estimators based on a parametrization for Y , as in the work of Ogburn et al. [12], are valid only for certain types of Y (e.g., cardinal Y ).

Treatment effect heterogeneity is an important problem that has been addressed by several researchers such as Imai and Ratkovic [13] and Künzel et al. [14]. Athey and Imbens [15] noted the following three problems: (i) estimating heterogeneous effects, (ii) finding an optimal policy allocating subjects to the treatment or control based on X (e.g., [16,17]), and (iii) low dimensionally representing the effect heterogeneity as in Athey and Imbens [18]. We address (iii) in this article by representing the effect heterogeneity in a unidimensional manner.

We find the X -heterogeneous effect to recommend D to those who would benefit from D . Consider a fat ( Y ) reducing drug D . We can search for heterogeneous effects for obese people; however, obesity has many dimensions. Hence, for simplicity, we often make recommendations based on only the body mass index (BMI), e.g., “take the drug if BMI ≥ 25 .” However, since E ( Y 1 − Y 0 ∣ BMI ) is an average, E ( Y 1 − Y 0 ∣ BMI, age ) may take considerably different values, leading to different recommendations depending on (BMI, age). We may further consider (BMI, age, gender), and so on. There exists no limit to augmenting the conditioning set for a maximal heterogeneity. Instead, we seek a minimal heterogeneity representation.

A potential candidate for the one-dimensional heterogeneity function of X is the propensity score (PS) E ( D ∣ X ) when D is exogenous. However, the PS is inappropriate if D is endogenous. Instead, if a binary IV δ is available for D , then the IS λ X ≡ E ( δ ∣ X ) can be used. Since IS is of no use when δ is randomized, we focus on non-randomized IV δ and E ( Y 1 − Y 0 ∣ CP , λ X ) in this article. In a related study, Choi et al. [19] addressed the randomized IV case by using the “control PS” E ( D ∣ δ = 0 , X ) as a dimension-reducing device, whereas we use E ( D ∣ CP , X ) = λ X ; this equality is proven below.

The main motivation to condition on λ X among all functions of X is that λ X summarizes the information in X for the relationship between δ and ( D 0 , D 1 , Y 0 , Y 1 ) in the sense that

δ ∐ ( D 0 , D 1 , Y 0 , Y 1 ) ∣ X ⇒ δ ∐ ( D 0 , D 1 , Y 0 , Y 1 ) ∣ λ X ,

which is proven later. In other words, λ X is a “sufficient statistic” for the parameter capturing the relationship between δ and ( D 0 , D 1 , Y 0 , Y 1 ) . Taking δ as the underlying treatment, as in this study, is well rooted in the literature, particularly in the ratio form “Wald estimator.”

Another attractive property of λ X is that it is a “balancing score” (i.e., δ ∐ X ∣ λ X ). If ω X is also a balancing score ( δ ∐ X ∣ ω X ), then ω X should be finer than λ X , as λ X = f ( ω X ) for a function f . In this sense, λ X minimally captures the effect heterogeneity. If there is no problem accepting PS as a minimal dimension-reducing device because PS is the coarsest balancing score with D ∐ X ∣ PS [20] for exogenous D , then there should be no problem accepting IS with δ ∐ X ∣ IS for endogenous D .

To see how informative λ X is relative to X , note that E ( Y 1 − Y 0 ∣ CP , λ X ) being a non-constant function of λ X indicates at least the presence of effect heterogeneity. Furthermore, knowing λ X is as good as knowing X for CP in the sense E ( D ∣ CP , X ) = IS :

E ( D ∣ CP , X ) = E ( δ ∣ CP , X ) = E ( δ ∣ X ) ≡ λ X under δ ∐ ( D 0 , D 1 , Y 0 , Y 1 ) ∣ X .

For us, this is the primary advantage (along with the weak requisite assumptions) of the CP effect. The disadvantages are that it is specific to each IV, it is not a generally interesting effect, the “monotonicity condition” ( D 0 ≤ D 1 ) can be violated in observational data [21], and the CPs are not the policy-relevant population unless the policy is the same as δ [22].

Let 1 [ A ] ≡ 1 if A holds and 0 otherwise. For parameters ( β δ , ζ x ) with β δ > 0 and an error term U ∐ ( δ , X ) , suppose D = 1 [ 0 < Y 1 − Y 0 + β δ δ ] and Y 1 − Y 0 = X ′ ζ x − U : D = 1 if the “gain” Y 1 − Y 0 (plus β δ δ ) from the treatment is greater than 0, which is plausible. Since D 0 = 1 [ 0 < Y 1 − Y 0 ] and D 1 = 1 [ 0 < Y 1 − Y 0 + β δ ] , the CPs satisfy X ′ ζ x < U < X ′ ζ x + β δ . Then, E ( Y 1 − Y 0 ∣ CP , X ) depends on X only through X ′ ζ X because

E ( X ′ ζ x − U ∣ X ′ ζ x < U < X ′ ζ x + β δ , X ) = X ′ ζ x − ∫ X ′ ζ x X ′ ζ x + β δ u d F u ( u ) / { F u ( X ′ ζ x + β δ ) − F u ( X ′ ζ x ) } ,

F u is the distribution function of U . It would be difficult to extend the CP effect to non-CPs, as E ( Y 1 − Y 0 ∣ CP , X ) is nonlinear in X ′ ζ x , whereas E ( Y 1 − Y 0 ∣ X ) = X ′ ζ x .

In the aforementioned example, although λ X captures the effect heterogeneity minimally, it can capture it “maximally” if E ( ⋅ ∣ CP , λ X ) = E ( ⋅ ∣ CP , X ′ ζ x ) , because E ( Y 1 − Y 0 ∣ CP , λ X ) = E { E ( Y 1 − Y 0 ∣ CP , X ) ∣ CP , λ X } . This can happen if D is designed to benefit some subjects and δ is encouraging them to take D , as both D and δ would be based on the treatment gain Y 1 − Y 0 = X ′ ζ x − U .

To accomplish the goal of this research, we use two nonparametric reduced forms (RFs):

(1) Y = μ 1 ( λ X ) D + μ 0 ( λ X ) + U 1 , E ( U 1 ∣ δ , λ X ) = 0 , μ 1 ( λ X ) ≡ E ( Y 1 − Y 0 ∣ CP , λ X ) , μ 0 ( λ X ) ≡ E { ( Y 1 − Y 0 ) D 0 + Y 0 ∣ λ X } − μ 1 ( λ X ) E ( D 0 ∣ λ X ) ,

(2) Y = μ 1 ( λ X ) P ( CP ∣ λ X ) δ + μ 2 ( λ X ) + U 2 , E ( U 2 ∣ δ , λ X ) = 0 , P ( CP ∣ λ X ) ≡ E ( D 1 − D 0 ∣ λ X ) , μ 2 ( λ X ) ≡ E { ( Y 1 − Y 0 ) D 0 + Y 0 ∣ λ X } ,

where μ 1 , μ 0 , P ( CP ∣ λ X ) , and μ 2 are unknown functions, and ( U 1 , U 2 ) are error terms. These RFs hold for any Y , as long as Y 1 − Y 0 makes sense. For example, for categorical Y , Y 1 − Y 0 does not make sense, but Y j 1 − Y j 0 does for the dummy variable Y j for category j .

We can estimate λ X ≡ E ( δ ∣ X ) nonparametrically, but to make our approaches practical, we either adopt the single index assumption λ X = Λ ( X ′ α ) for an unknown function Λ ( ⋅ ) and parameter α or specify λ X = Φ ( X ′ θ ) for the N ( 0 , 1 ) distribution function Φ ( ⋅ ) and a parameter θ . Clearly, the latter strategy is more restrictive than the former. Considering these aspects, we propose the following three estimators for the λ X -heterogeneous CP effect.

The first estimator is based on the single index assumption λ X = Λ ( X ′ α ) and

(3) E ( Y ∣ λ X , δ = 1 ) − E ( Y ∣ λ X , δ = 0 ) E ( D ∣ λ X , δ = 1 ) − E ( D ∣ λ X , δ = 0 ) = μ 1 ( λ X ) P ( CP ∣ λ X ) P ( CP ∣ λ X ) = μ 1 ( λ X ) { from (2) } .

Hence, we can identify μ 1 ( λ X ) ≡ E ( Y 1 − Y 0 ∣ CP , λ X ) with the ratio on the left side. However, instead of conditioning on λ X = Λ ( X ′ α ) as in (3), we condition on X ′ α because estimating Λ ( ⋅ ) is more challenging than estimating α .

In our second estimator, we assume that λ X = Φ ( X ′ θ ) to avoid estimating Λ ( ⋅ ) . In this case, conditioning on λ X = Φ ( X ′ θ ) as in (3) than on X ′ θ is more advantageous because Φ ( ⋅ ) is a known function well bounded by [ 0 , 1 ] . Because of the assumption λ X = Φ ( X ′ θ ) , the assumptions of the second estimator are more restrictive than those of the first. Both the first and second estimators converge in distribution more slowly than N .

The two ratio estimators suffer from the “excessively small denominator” problem: A near-zero denominator can blow up the ratio. To avoid this, our third estimator applies a power approximation to μ 1 ( λ X ) in (1). Then, the IVE can estimate μ 1 ( λ X ) as the slope of D . Although power approximation is a nonparametric method, we regard the approximation to be exact to make our proposal more practical, and we adopt λ X = Φ ( X ′ θ ) as in the second estimator. Hence, the third estimator is the most restrictive among our three estimators, but it is N -consistent. Because we need only the RF of λ X and not the structural form (SF), i.e., because we only use the scalar λ X (i.e., X ′ α or X ′ θ ) and not individual elements of α or θ , misspecification problems in λ X = Φ ( X ′ θ ) are not considerably worrisome.

If desired, E ( Y 1 − Y 0 ∣ CP ) can be found from μ 1 ( λ X ) . Note that the CPs are identified conditionally on λ X under δ ∐ ( D 0 , D 1 , Y 0 , Y 1 ) ∣ λ X and the monotonicity D 0 ≤ D 1 :

P ( D = 1 ∣ δ = 1 , λ X ) − P ( D = 1 ∣ δ = 0 , λ X ) = P ( AT or CP ∣ λ X ) − P ( AT ∣ λ X ) = P ( CP ∣ λ X ) ,

where AT indicates “always takers” with ( D 0 = 1 , D 1 = 1 ) . Moreover, integrating out λ X renders P ( CP ) . Then, denoting the distribution of A ∣ B as F A ∣ B , we can obtain

E ( Y 1 − Y 0 ∣ CP ) = ∫ μ 1 ( l ) d F λ X ∣ CP ( l ) = ∫ μ 1 ( l ) P ( CP ∣ λ X = l ) d F λ X ( l ) / P ( CP ) .

In the remainder of this article, Sections 2 and 3 explain the two ratio estimators and IVE, respectively; Sections 4 and 5 present simulation and empirical studies, respectively; and Section 6 concludes this article. Most proofs are presented in the appendix. As usual, we assume independent and identically distributed observations ( δ i , D i , X i , Y i ) , i = 1 , … , N .

2 Ratio approaches

2.1 Identification

Since (2) motivates the ratio estimators, we first prove (2) in Theorem 1, and (1) is proven in the next section. As preliminaries, we present our assumptions on IV:

(4) ( i ) : δ ∐ ( D 0 , D 1 , Y 0 , Y 1 ) ∣ λ X for all λ X (IV exogeneity) , ( ii ) : D 0 ≤ D 1 ∣ λ X for all λ X (Monotonicity) , ( iii ) : E ( D 1 − D 0 ∣ λ X ) ≠ 0 for all λ X (IV relevance).

The “IV exclusion restriction” is implicit in the notation ( Y 0 , Y 1 ) , because if δ directly affected Y , then the potential responses would be double-indexed by ( δ , D ) .

The appendix proves the main heterogeneity-dimension reduction idea:

(5) δ ∐ ( D 0 , D 1 , Y 0 , Y 1 ) ∣ X ⇒ δ ∐ ( D 0 , D 1 , Y 0 , Y 1 ) ∣ λ X .

Although (4)(i) is enough for our estimators below, bear in mind that “ λ X = E ( D ∣ CP , X ) ” noted in the preceding section to better motivate/interpret λ X requires “ δ ∐ ( D 0 , D 1 , Y 0 , Y 1 ) ∣ X ” which is stronger than (4)(i) as (5) shows.

The proof for (5) also establishes the “balancing score” property of IS: The distribution of X is the same across the δ = 0 , 1 groups once λ X is conditioned on. Then, it follows from Theorem 2 of Rosenbaum and Rubin [20] that λ X is the coarsest balancing score. In other words, a function, e.g., ω X , of X is a balancing score (i.e., δ ∐ X ∣ ω X ), iff λ X = f ( ω X ) for some function f ( ⋅ ) . In this sense, μ 1 ( λ X ) ≡ E ( Y 1 − Y 0 ∣ CP , λ X ) minimally captures the CP effect heterogeneity, thereby avoiding the dimension problem in E ( Y 1 − Y 0 ∣ CP , X ) .

Theorem 1

Under (4)(i), (4)(ii), and the support-overlap condition 0 < λ X < 1 , (2) holds for any form of Y as long as Y 1 − Y 0 makes sense.

The qualifier “for all λ X ” in (4)(i) and (ii) is sufficient but not necessary, because if the conditions hold only for some values of λ X , then Theorem 1 holds only for those values of λ X for which the CP effect can still be identified.

2.2 Ratio estimator under single index assumption

Our single index assumption for dimension reduction with an unknown Λ ( ⋅ ) and α is

(6) λ X ≡ E ( δ ∣ X ) = Λ ( X ′ α ) .

To estimate Λ ( ⋅ ) and α , we minimize ∑ i { δ i − L ( X i ′ a ) } 2 with respect to (wrt) { L ( ⋅ ) , a } as Ichimura [23] did, which raises identification issues. First, the intercept for X ′ α is not identified, because Λ ( α 0 + X ′ α ) can be written as Λ 0 ( X ′ α ) , where Λ 0 ( ⋅ ) ≡ Λ ( α 0 + ⋅ ) ; i.e., we can identify both { Λ ( ⋅ ) , ( α 0 , α ) } and { Λ 0 ( ⋅ ) , α } equally well. Second, since Λ ( X ′ α ) = Λ ( c ⋅ X ′ α / c ) for any c ≠ 0 , instead of identifying { Λ ( ⋅ ) , α } , we can identify { Λ c ( ⋅ ) , α / c } equally well, where Λ c ( ⋅ ) ≡ Λ ( c ⋅ ) . Clearly, the scale and sign of α are not identified.

One approach to overcome the identification problems is to assume a continuous regressor with a non-zero slope, e.g., the last regressor X k . Then, divide X ′ α by α k ≠ 0 to obtain the identified parameter ( α 1 / α k , … , α k − 1 / α k , 1 ) . If there is no such regressor, then X ′ α would be discrete, and we would not be able to trace the entire shape of Λ ( ⋅ ) with X ′ α .

Since we further assume a strictly increasing Λ ( ⋅ ) , we can identify the sign of α k . We divide X ′ α by ∣ α k ∣ and not by α k to obtain { α 1 / ∣ α k ∣ , … , α k − 1 / ∣ α k ∣ , s i g n ( α k ) } , where s i g n ( α k ) = 1 if α k > 0 , 0 if α k = 0 , and − 1 if α k < 0 . Then, we try both ( α 1 / ∣ α k ∣ , … , α k − 1 / ∣ α k ∣ , 1 ) and ( α 1 / ∣ α k ∣ , … , α k − 1 / ∣ α k ∣ , − 1 ) to select the one that minimizes ∑ i { δ i − L ( X i ′ a ) } 2 .

Let S i ≡ X i ′ α , and let G j denote the group with δ = j , j = 0 , 1 . Given kernel K , bandwidth h , and the sample size N j for G j , and recalling μ 1 ( λ X ) ≡ E ( Y 1 − Y 0 ∣ CP , λ X ) , our first ratio estimator for μ 1 ( s ) = E ( Y 1 − Y 0 ∣ CP , S = s ) is

(7) μ ˆ 1 ( s ) ≡ b ˆ ( s ) a ˆ ( s ) where a ˆ ( s ) ≡ a ˆ 1 ( s ) − a ˆ 0 ( s ) , b ˆ ( s ) ≡ b ˆ 1 ( s ) − b ˆ 0 ( s ) , a ˆ j ( s ) ≡ ( N j h ) − 1 ∑ i ∈ G j K { ( S i − s ) / h } D i ( N j h ) − 1 ∑ i ∈ G j K { ( S i − s ) / h } ≡ a ˆ j d ( s ) c ˆ j ( s ) , j = 0 , 1 , b ˆ j ( s ) ≡ ( N j h ) − 1 ∑ i ∈ G j K { ( S i − s ) / h } Y i ( N j h ) − 1 ∑ i ∈ G j K { ( S i − s ) / h } ≡ b ˆ j y ( s ) c ˆ j ( s ) , j = 0 , 1 .

The numerator of a ˆ j ( s ) is defined as a ˆ j d ( s ) , where subscript d refers to D in a ˆ j ( s ) , and the numerator of b ˆ j ( s ) is defined as b ˆ j y ( s ) for the analogous reason. The common denominator of a ˆ j ( s ) and b ˆ j ( s ) is c ˆ j ( s ) → p f S j ( s ) , which is the density of S ∣ ( δ = j ) .

There are six estimators in μ ˆ 1 ( s ) : { a ˆ 0 d ( s ) , b ˆ 0 y ( s ) , c ˆ 0 ( s ) } and { a ˆ 1 d ( s ) , b ˆ 1 y ( s ) , c ˆ 1 ( s ) } . Computing μ ˆ 1 ( s ) is not involved, but estimating the asymptotic variance of μ ˆ 1 ( s ) is: it consists of six variances, and six covariances among { a ˆ 0 d ( s ) , b ˆ 0 y ( s ) , c ˆ 0 ( s ) } and { a ˆ 1 d ( s ) , b ˆ 1 y ( s ) , c ˆ 1 ( s ) } . Regarding the selection of h , since only one-dimensional nonparametric estimators appear in μ ˆ 1 ( s ) , it is preferable to select h in practice through “eye-balling” or cross-validation.

The probability limits of { a ˆ j ( s ) , b ˆ j ( s ) } and those of { a ˆ ( s ) , b ˆ ( s ) , μ ˆ 1 ( s ) } can be seen in

(8) a j ( s ) ≡ E ( D ∣ S = s , δ = j ) , b j ( s ) ≡ E ( Y ∣ S = s , δ = j ) , ⇒ μ 1 ( s ) = b ( s ) a ( s ) ≡ b 1 ( s ) − b 0 ( s ) a 1 ( s ) − a 0 ( s ) = E ( Y ∣ S = s , δ = 1 ) − E ( Y ∣ S = s , δ = 0 ) E ( D ∣ S = s , δ = 1 ) − E ( D ∣ S = s , δ = 0 ) .

That α in S = X ′ α has to be estimated can be ignored, as α can be estimated N -consistently. In other words, α is as good as known for μ ˆ 1 ( s ) that is N h -consistent. This is the reason why we condition on S = X ′ α instead of Λ ( X ′ α ) . To make conditioning on S = X ′ α equivalent to that on Λ ( X ′ α ) , we require Λ ( ⋅ ) to be strictly increasing. In the following, we often write “ S = s ” or “ P = p ” in conditioning sets simply as “ s ” or “ p ”.

Theorem 2

N h { μ ˆ 1 ( s ) − μ 1 ( s ) } → d N { 0 , ∑ j = 1 6 ( V j + 2 C j ) } , where the V j ’s and C j ’s are in (A6)–(A7) of the appendix, with κ ≡ ∫ K ( t ) 2 d t and π j ≡ lim N → ∞ N j / N , and the requisite assumptions are as follows. (i) (4) holds for all λ X ≡ Λ ( X ′ α ) and 0 < λ X < 1 ; (ii) Λ ( ⋅ ) is strictly increasing and differentiable; (iii) a continuous X k with α k ≠ 0 exists; (iv) α is interior to a compact parameter space A α ; (v) K ( ⋅ ) is symmetric about 0 and twice continuously differentiable with ∫ K ( t ) d t = 1 , the second derivative bounded, and K ( t ) = 0 for ∣ t ∣ > 1 ; (vi) N h 8 → 0 and N h 3 + ν / ( − ln h ) → ∞ for an arbitrarily small ν > 0 as N → ∞ ; (vii) f S j ( s ) > 0 for j = 0 , 1 and s = x ′ α , with x on a compact set X ˘ ; also, f S j ( s ) , E ( Y ∣ s , δ = j ) , E ( D ∣ s , δ = j ) , E ( Y 2 ∣ s , δ = j ) , and E ( Y D ∣ s , δ = j ) are three times continuously differentiable wrt s = x ′ α , with the third derivatives bounded uniformly on A α .

In Theorem 2, (i) is assumed to identify α ; note that (4)(iii) implies a ( s ) > 0 in view of (3), because E ( D 1 − D 0 ∣ λ X ) = P ( CP ∣ λ X ) . The strictly increasing Λ ( ⋅ ) part in (ii) and (iii) is aimed at identifying Λ ( ⋅ ) , and the differentiability in (ii) pertains to Assumption 4.1 of Ichimura [23, p. 81]. Also, (iv) is a standard assumption for asymptotic normality. In (v), “ K ( t ) = 0 for ∣ t ∣ > 1 ” is the same as Assumption 5.6 (4) of Ichimura [23, p. 88]. The conditions for h in (vi) pertain to the N -consistency and asymptotic normality of α ˆ in Ichimura’s Theorem 5.2 [23, p. 94] with his m = ∞ for binary D as the dependent variable. Part of the conditions in (vii) is used to satisfy Assumptions 5.3–5.5 of Ichimura [23, p. 87].

Although not mentioned in Theorem 2, Assumption 4.2 of Ichimura [23, p. 82] must be introduced if several components of X are deterministically determined by other components in X . We do not mention this assumption in our Theorem 2, as the incentive to use such components is weak in single index estimation. For instance, Λ ( X ′ α ) = ( X ′ α ) + ( X ′ α ) 2 means that all squared components of X are used in Λ ( X ′ α ) , and thus, these squared components do not need to be used as part of X .

Theorem 2 and μ ˆ 1 ( s ) may look daunting, but they can be simply summarized as follows. The first step is the single index estimation in which { L ( ⋅ ) , a } is chosen to minimize ∑ i { δ i − L ( X i ′ a ) } 2 . We use the algorithm proposed by Hastie et al. [24, p. 391], which rapidly converges to a local minimum, if not the global minimum. The second step is to obtain the six kernel estimators constituting μ ˆ 1 ( s ) . Hence, the only complication in our proposal is estimating the asymptotic variance of μ ˆ 1 ( s ) . Our simulation study will demonstrate that the asymptotic variance formula works fairly well with N = 500 , and very well with N = 5,000 .

For our third estimator using (1) under λ X = Φ ( X ′ θ ) , let θ ˆ be the probit estimator and θ ˆ k be the slope of X k . In comparing μ ˆ 1 ( s ) to the third estimator, we condition on X ′ α ˆ ⋅ ∣ θ ˆ k ∣ and not on X ′ α ˆ to ensure that the slope of X k in X ′ α ˆ ⋅ ∣ θ ˆ k ∣ becomes θ ˆ k as in X ′ θ ˆ . To show this aspect, suppose θ k > 0 . Then, α k = 1 , and thus, the slope of X k in X ′ α ˆ ⋅ ∣ θ ˆ k ∣ is θ ˆ k as in X ′ θ ˆ . Now, suppose θ k < 0 . Then, α k = − 1 , so that the slope of X k in X ′ α ˆ ⋅ ∣ θ ˆ k ∣ is − ∣ θ ˆ k ∣ = θ ˆ k as in X ′ θ ˆ .

2.3 Ratio estimator under the probit IS

The requisite conditions for the preceding ratio estimator and its asymptotic variance are not simple. Our second ratio estimator is simpler in terms of the requisite conditions and asymptotic variance, although it requires imposing the probit assumption λ X = Φ ( X ′ θ ) . Since θ can be estimated N -consistently by the probit estimator θ ˆ , we treat θ as known for the second ratio estimator conditioning on Φ ( X ′ θ ˆ ) . Although the assumption λ X = Φ ( X ′ θ ) may appear restrictive, it is not so, because we need only the RF λ X . For instance, suppose δ = 1 [ 0 < ξ ( X ) + σ ( X ) ε ] , where ξ ( X ) and σ ( X ) ≠ 0 are unknown functions and ε ∼ N ( 0 , 1 ) ∐ X . Then, E ( δ ∣ X ) = Φ { ξ ( X ) / σ ( X ) } , and we estimate the linearized version X ′ θ ≃ ξ ( X ) / σ ( X ) to use only Φ ( X ′ θ ) and not the individual components of θ .

Under λ X = Φ ( X ′ θ ) , the appendix proves that the ratio in (3) with λ X = p is

(9) E ( B ∣ λ X = p ) E ( A ∣ λ X = p ) where B ≡ Y ( δ − p ) and A ≡ D ( δ − p ) .

Then, with P i ≡ λ X i , our estimator for (9) is

μ ˜ 1 ( p ) ≡ ∑ i K { ( P i − p ) / h } B i ∑ i K { ( P i − p ) / h } ∑ i K { ( P i − p ) / h } A i ∑ i K { ( P i − p ) / h } = ∑ i K { ( P i − p ) / h } B i ∑ i K { ( P i − p ) / h } A i ;

the notation μ ˜ 1 ( ⋅ ) is used for our second ratio estimator.

Theorem 3

N h { μ ˜ 1 ( p ) − μ 1 ( p ) } is asymptotically normal with variance

κ f λ ( p ) { E ( A ∣ p ) } 4 [ E ( A 2 ∣ p ) { E ( B ∣ p ) } 2 + E ( B 2 ∣ p ) { E ( A ∣ p ) } 2 − 2 E ( A ∣ p ) E ( B ∣ p ) E ( A B ∣ p ) ]

under the following assumptions, where κ ≡ ∫ K ( t ) 2 d t , and f λ is the density of λ X . (i) (4) holds for all λ X ≡ Φ ( X ′ θ ) , and 0 < λ X < 1 ; (ii) a continuous X k with θ k ≠ 0 exists; (iii) θ is interior to a compact parameter space A θ ; (iv) K ( ⋅ ) is symmetric about 0 and twice continuously differentiable with ∫ K ( t ) d t = 1 ; (v) N h 4 → 0 and N h → ∞ as N → ∞ ; and (vi) f λ ( ⋅ ) , E ( A ∣ λ X ) , E ( A 2 ∣ λ X ) , E ( B ∣ λ X ) , E ( B 2 ∣ λ X ) , and E ( A B ∣ λ X ) are twice continuously differentiable with E ( A ∣ λ X ) , f λ ( λ X ) > 0 for all λ X .

The aforementioned assumptions are weaker than those in Theorem 2 because single index estimation is not needed. Moreover, (ii) is not essential, because no continuous covariate means that nonparametric estimation is not needed. If D is exogenous such that δ = D , then we can replace A with 1 in μ ˜ 1 ( p ) to obtain the usual kernel estimator for E ( B ∣ p ) . In this case, the asymptotic variance in Theorem 3 is simplified to κ V ( B ∣ p ) / f λ ( p ) . Note that although we use the same notation X for probit, X should be augmented by the constant 1 for probit.

3 Power approximation approach

3.1 Identification

We apply a power approximation to μ 1 ( λ X ) ≡ E ( Y 1 − Y 0 ∣ CP , λ X ) in (1) to obtain

(10) E ( Y 1 − Y 0 ∣ CP , λ X ) = M ′ β , M ≡ ( 1 , λ X , … , λ X J ) ′ , β ≡ ( β 0 , β 1 , … , β J ) ′ ⇒ Y = D M ′ β + μ 0 ( λ X ) + U 1 .

This yields a moment condition E ( IV × error ) = 0 with the IV ( δ − λ X ) M :

0 = E [ ( δ − λ X ) M ⋅ { μ 0 ( λ X ) + U 1 } ] = E { ( δ − λ X ) M ⋅ ( Y − D M ′ β ) } .

With E − 1 ( ⋅ ) denoting { E ( ⋅ ) } − 1 , we solve E { ( δ − λ X ) M ⋅ ( Y − D M ′ β ) } = 0 for β :

(11) β = E − 1 { ( δ − λ X ) DMM ′ } E { ( δ − λ X ) M Y } , E ( Y 1 − Y 0 ∣ CP , λ X ) = M ′ β .

The main identification finding in this section is Theorem 4.

Theorem 4

Under (4)(i) and (ii), (1) holds for any Y as long as Y 1 − Y 0 makes sense. If (4)(iii), 0 < λ X < 1 , and μ 1 ( λ X ) ≡ E ( Y 1 − Y 0 ∣ CP , λ X ) = ∑ j = 0 J β j λ X j = M ′ β hold additionally, then β = ( β 0 , β 1 , … , β J ) ′ is identified.

The proof for Theorem 4 in the appendix reveals

(12) E { ( δ − λ X ) DMM ′ } = E { E ( D 1 − D 0 ∣ λ X ) ( 1 − λ X ) λ X M M ′ } .

Hence, even if E ( D 1 − D 0 ∣ λ X ) ≠ 0 only for some values of λ X in (4)(iii), Theorem 4 holds as long as E { ( δ − λ X ) DMM ′ } is invertible.

3.2 Power approximation estimator

Before estimation, we introduce a modification to Y = μ 1 ( λ X ) D + μ 0 ( λ X ) + U 1 in (1) in case λ X is misspecified. When we use δ − λ X as the IV, we can ignore μ 0 ( λ X ) in the composite error μ 0 ( λ X ) + U 1 , because the IV δ − λ X is orthogonal to μ 0 ( λ X ) . However, λ X may be misspecified in practice, which can result in a bias because the misspecified δ − λ X may be correlated with μ 0 ( λ X ) . Hence, it is better to remove μ 0 ( λ X ) from the error. To this end, we use Y − E ( Y ∣ λ X ) as the outcome variable as follows (see Chernozhukov et al. [25] and Lee [26,27] for closely related ideas).

Take E ( ⋅ ∣ λ X ) on (1) to obtain E ( Y ∣ λ X ) = μ 1 ( λ X ) E ( D ∣ λ X ) + μ 0 ( λ X ) . Subtract this expression from (1) to remove μ 0 ( λ X ) and obtain

(13) Y − E ( Y ∣ λ X ) = μ 1 ( λ X ) D − μ 1 ( λ X ) E ( D ∣ λ X ) + U 1 .

We use (13) instead of (1) to implement our IVE, where − μ 1 ( λ X ) E ( D ∣ λ X ) + U 1 is the composite error. Accounting for E ( Y ∣ λ X ) explicitly in (13) tends to improve the finite sample performance of the following IVE by decreasing the error term standard deviation (SD) and making the IVE relatively insensitive to misspecifications of λ X .

We present two versions of the power approximation estimator: one conditioned on X ′ θ and the other conditioned on Φ ( X ′ θ ) . The former estimator is simpler, but the latter estimator is likely to perform better when X ′ θ is large, thereby causing the power functions of X ′ θ to become even larger to generate outliers. In contrast, power functions of Φ ( X ′ θ ) become smaller as X ′ θ becomes large, and thus, no outlier problem arises.

First, we condition on X ′ θ and apply a power approximation to E ( Y ∣ X ′ θ ) and μ 1 ( X ′ θ ) : For some γ parameters and error term U ,

Y − W ′ γ = D W ′ β − μ 1 ( X ′ θ ) E ( D ∣ X ′ θ ) + U , where W = { 1 , X ′ θ , … , ( X ′ θ ) J } ′ and γ ≡ ( γ 0 , … , γ J ) ′ .

Using the same J -order power approximation for both W ′ γ and W ′ β is not necessary, because W ′ γ in Y − W ′ γ can be replaced by Y ¯ (with J = 0 ) or even zero. If desired, we can also obtain E ( Y 1 − Y 0 ∣ CP , λ X = p ) from W ′ β : Replacing X ′ θ in W with Φ − 1 ( p ) yields

(14) E ( Y 1 − Y 0 ∣ CP , λ X = p ) = [ 1 , Φ − 1 ( p ) , … , { Φ − 1 ( p ) } J ] β .

Let θ ˆ be the probit estimator of δ on X . Replace X ′ θ with X ′ θ ˆ , and then replace E ( Y ∣ X ′ θ ) with the ordinary least-squares estimator (OLS) predicted value W ˆ ′ γ ˆ , where

γ ˆ is the OLS of Y on W ˆ ≡ W ( θ ˆ ) ≡ { 1 , X ′ θ ˆ , … , ( X ′ θ ˆ ) J } ′ .

Then, we obtain the IVE of Y − W ˆ ′ γ ˆ on D W ˆ with ε ˆ W ˆ ≡ { δ − Φ ( X ′ θ ˆ ) } W ˆ as the IV:

(15) β ˆ ≡ ∑ i ε ˆ i D i W ˆ i W ˆ i ′ − 1 ∑ i ε ˆ i W ˆ i ( Y i − W ˆ i ′ γ ˆ ) , ε ˆ i ≡ δ i − Φ ( X i ′ θ ˆ ) .

The appendix proves the next two theorems.

Theorem 5

The IVE in (15) is asymptotically normal with variance Ω 1 :

N ( β ˆ − β ) → d N ( 0 , Ω 1 ) , Ω 1 ≡ E − 1 ( ε DWW ′ ) E ( η 1 η 1 ′ ) E − 1 ( ε DWW ′ ) , η 1 ≡ V ε W − E { D ∇ W ′ β ε W X ′ + V ϕ ( X ′ θ ) W X ′ − V ε ∇ W X ′ } η θ ˆ , ε ≡ δ − Φ ( X ′ θ ) , V ≡ Y − W ′ γ − D W ′ β , ∇ W ≡ { 0 , 1 , 2 ( X ′ θ ) , … , J ( X ′ θ ) J − 1 } ′ ,

and η θ ˆ is the influence function for N ( θ ˆ − θ ) . Ω 1 can be consistently estimated with

Ω ˆ 1 ≡ 1 N ∑ i ε ˆ i D i W ˆ i W ˆ i ′ − 1 1 N ∑ i η ˆ 1 i η ˆ 1 i ′ 1 N ∑ i ε ˆ i D i W ˆ i W ˆ i ′ − 1 , ε ˆ i ≡ δ i − Φ ( X i ′ θ ˆ ) , η ˆ 1 i ≡ V ˆ i ε ˆ i W ˆ i − 1 N ∑ k { D k ∇ W ˆ k ′ β ˆ ε ˆ k W ˆ k X k ′ + V ˆ k ϕ ( X k ′ θ ˆ ) W ˆ k X k ′ − V ˆ k ε ˆ k ∇ W ˆ k X k ′ } η ˆ θ ˆ i , V ˆ i ≡ Y i − W ˆ i ′ γ ˆ − W ˆ i ′ D i ′ β ˆ , ∇ W ˆ i ≡ { 0 , 1 , 2 ( X i ′ θ ˆ ) , … , J ( X i ′ θ ˆ ) J − 1 } ′ , η ˆ θ ˆ i ≡ 1 N ∑ k s ˆ k s ˆ k ′ − 1 s ˆ i , where s ˆ i ≡ { δ i − Φ ( X i ′ θ ˆ ) } ϕ ( X i ′ θ ˆ ) Φ ( X i ′ θ ˆ ) { 1 − Φ ( X i ′ θ ˆ ) } X i .

Now, we condition on λ X and apply power approximations to E { Y ∣ Φ ( X ′ θ ) } and μ 1 { Φ ( X ′ θ ) } :

Y − M ′ γ = D M ′ β − μ 1 { Φ ( X ′ θ ) } E { D ∣ Φ ( X ′ θ ) } + U , M = { 1 , Φ ( X ′ θ ) , … , Φ ( X ′ θ ) J } ′ .

Replace E { Y ∣ Φ ( X ′ θ ) } with the OLS-predicted value M ˆ ′ γ ˜ , where

γ ˜ is the OLS of Y on M ˆ ≡ M ( θ ˆ ) ≡ { 1 , Φ ( X ′ θ ˆ ) , … , Φ ( X ′ θ ˆ ) J } ′ .

Obtain the IVE of Y − M ˆ ′ γ ˜ on D M ˆ with ε ˆ M ˆ ≡ { δ − Φ ( X ′ θ ˆ ) } M ˆ as the IV:

(16) β ˜ ≡ ∑ i ε ˆ i D i M ˆ i M ˆ i ′ − 1 ∑ i ε ˆ i M ˆ i ( Y i − M ˆ i ′ γ ˜ ) .

Theorem 6

The IVE in (16) is asymptotically normal with variance Ω 2 :

N ( β ˜ − β ) → d N ( 0 , Ω 2 ) , Ω 2 ≡ E − 1 ( ε DMM ′ ) E ( η 2 η 2 ′ ) E − 1 ( ε DMM ′ ) , η 2 ≡ Γ ε M − E { D ∇ M ′ β ε M X ′ + Γ ϕ ( X ′ θ ) M X ′ − Γ ε ∇ M X ′ } η θ ˆ , Γ ≡ Y − M ′ γ − D M ′ β , ∇ M ≡ { 0 , ϕ ( X i ′ θ ) , 2 ϕ ( X ′ θ ) , … , J ϕ ( X ′ θ ) J − 1 } ′ , Ω ˆ 2 ≡ 1 N ∑ i ε ˆ i D i M ˆ i M ˆ i ′ − 1 1 N ∑ i η ˆ 2 i η ˆ 2 i ′ 1 N ∑ i ε ˆ i D i M ˆ i M ˆ i ′ − 1 → p Ω 2 , η ˆ 2 i ≡ Γ ˜ i ε ˆ i M ˆ i − 1 N ∑ k { D k ∇ M ˆ k ′ β ˜ ε ˆ k M ˆ k X k ′ + Γ ˜ k ϕ ( X k ′ θ ˆ ) M ˆ k X k ′ − Γ ˜ k ε ˆ k ∇ M ˆ k X k ′ } η ˆ θ ˆ i , Γ ˜ i ≡ Y i − M ˆ i ′ γ ˜ − M ˆ i ′ D i ′ β ˜ , ∇ M ˆ i ≡ { 0 , ϕ ( X i ′ θ ˆ ) , 2 ϕ ( X i ′ θ ˆ ) , … , J ϕ ( X i ′ θ ˆ ) J − 1 } ′ .

4 Simulation study

With N = 500 , 5,000, and 10,000 simulation repetitions, our simulation setting is

δ = 1 [ 0 < π 1 + π 2 X 2 + π 3 X 3 + ξ ] , X 2 discrete uniform on { − 0.5 , − 0.25 , 0 , 0.25 , 0.5 } , X 3 ∼ U [ − 0.5 , 0.5 ] , ξ ∼ N ( 0 , 1 ) ∐ ( X 2 , X 3 ) , π 1 = 0 , π 2 = π 3 = 1 , D = 1 [ 0 < τ 1 + τ 2 X 2 + τ 3 X 3 + τ δ δ + ψ ] , ψ ∼ N ( 0 , 1 ) ∐ ( ξ , X 2 , X 3 ) , τ 1 = 0 , τ 2 = τ 3 = τ δ = 1 , Y 0 ∗ = 1 + X 2 + X 3 + U , U = e + ψ , e ∼ N ( 0 , 1 ) ∐ ( ξ , ψ , X 2 , X 3 ) , Y 1 ∗ = Y 0 ∗ + ρ 1 ( X 2 + X 3 ) + ρ 2 ( X 2 + X 3 ) 2 , ρ 1 = 1 , ρ 2 = 0 , − 0.5 (linear, quadratic effect), Y = Y 0 + ( Y 1 − Y 0 ) D , where Y d = Y d ∗ for continuous Y , and Y d = 1 [ 0 < Y d ∗ ] for binary Y , d = 0 , 1 .

Since π 2 = π 3 = 1 in the δ model, X ′ π = X 2 + X 3 has the range [ − 1 , 1 ] owing to the range of ( X 2 , X 3 ) . The error term U in Y 0 is correlated with the error term ψ in D with Cor ( U , ψ ) ≃ 0.71 to ensure that D is endogenous. We use four simulation designs:

Design 1 : Y is continuous, and ρ 2 = 0 ( effect is X ′ π ) ; Design 2 : Y is continuous, and ρ 2 = − 0.5 ( effect is X ′ π − 0.5 ( X ′ π ) 2 ) ; Design 3 : Y is binary, and ρ 2 = 0 ( effect is non-linear in X ′ π ) ; Design 4 : Y is binary, and ρ 2 = − 0.5 ( effect is non-linear in X ′ π − 0.5 ( X ′ π ) 2 ) .

For the first ratio estimator μ ˆ 1 ( s ) , we can set the slope of X 3 as − 1 or 1, but we set the value only to 1 because the sign of the slope of X 3 can be estimated at a rate faster than N − 1 / 2 as only one of the two values is selected. The lack of consideration of − 1 decreases the simulation time by half. With τ δ = 1 in the D model, the CP effect is not zero, but the simulation program crashes when the denominator of μ ˆ 1 and μ ˜ 1 approaches zero. In this case, the simulation run is abandoned, and the data are redrawn. For K ( ⋅ ) of μ ˆ 1 and μ ˜ 1 , we use the simple N ( 0 , 1 ) kernel. The bounded quartic kernel is used as well, but then dropped, as the kernel choice does not result in a significant difference. The bandwidth h is chosen initially by cross-validation for a number of times and then fixed throughout the simulation repetitions, as doing cross-validation at each simulation run is time-consuming.

Let X = ( X 2 , X 3 ) ′ and X + 1 ≡ ( 1 , X 2 , X 3 ) ′ such that E ( δ ∣ X ) = Φ ( X + 1 ′ θ ) , although we use sometimes the expression E ( δ ∣ X ) = Φ ( X ′ θ ) for simplicity. The following tables present

( 1 ) : μ ˆ 1 ( s ) at s = − 0.5 , 0 , and 0.5 for s = x ′ α ˆ (evaluation points) ; ( 2 ) : μ ˜ 1 ( p ) at p = 0.31 = Φ ( − 0.5 ) , 0.5 = Φ ( 0 ) , and 0.69 = Φ ( 0.5 ) for p = Φ ( x + 1 ′ θ ˆ ) ; ( 3 ) : w ˆ J ′ β ˆ ≡ ∑ j = 0 J β ˆ j ( x + 1 ′ θ ˆ ) j for J = 1 , 2 at x + 1 ′ θ ˆ = − 0.5 , 0 , and 0.5 ; ( 4 ) : m ˆ J ′ β ˜ ≡ ∑ j = 0 J β ˜ j Φ ( x + 1 ′ θ ˆ ) j for J = 1 , 2 at Φ ( x + 1 ′ θ ˆ ) = 0.31 , 0.5 , and 0.69 .

Overall, six estimators are compared at the three evaluation points.

For each entry in each following table, four numbers appear at a given evaluation point: the (i) absolute bias (∣Bias∣); (ii) SD; (iii) averaged SD (across 10,000 repetitions) based on the asymptotic variance to be compared with (ii); and (iv) proportion of the 95% point-wise confidence intervals (CI) capturing the true value. We do not present the root mean-squared error (RMSE) to save space: in most cases, the absolute bias is much smaller than the SD, and thus, the RMSE is similar to the SD. The entries with the subscript “avg” indicate the simple averages across the three evaluation points, which are used as summary measures.

To make s = x ′ α ˆ comparable to p = Φ ( x + 1 ′ θ ˆ ) , we set s = x ′ α ˆ ⋅ θ ˆ 3 and not x ′ α ˆ , so that the slope of x 3 in x ′ α ˆ ⋅ θ ˆ 3 becomes θ ˆ 3 as in x + 1 ′ θ ˆ . For μ ˆ 1 ( s ) and μ ˜ 1 ( p ) , we abandon the simulation run when the denominator is smaller than 0.01. In 10 , 000 repetitions with N = 500 , about 1.67% of the runs are abandoned for μ ˜ 1 ( p ) , but no runs are abandoned for μ ˆ 1 ( s ) .

Table 1 shows the results for continuous Y with N = 500 , where the true effect is linear or quadratic in the single index X ′ π . The ratio estimators μ ˆ 1 and μ ˜ 1 perform well for both linear and quadratic effects. μ ˆ 1 tends to be more biased than μ ˜ 1 but has a smaller SD. Since the bias magnitude is considerably smaller, the difference in SD dominates that in the bias, and thus, μ ˆ 1 performs better than μ ˜ 1 . The highest performing estimators are w ˆ 1 ′ β ˆ and m ˆ 1 ′ β ˜ . The biases of w ˆ 2 ′ β ˆ and m ˆ 2 ′ β ˜ are not large; however, their SDs are very high due to multicollinearity problems among the regressors. Moreover, the corresponding asymptotic variance estimates grossly exaggerate the actual SDs. In terms of SD, m ˆ 2 ′ β ˜ does better than w ˆ 2 ′ β ˆ . The SDs of most of μ ˆ 1 , μ ˜ 1 , w ˆ 1 ′ β ˆ , and m ˆ 1 ′ β ˜ match closely with the averaged asymptotic SDs, which demonstrates the correctness of their asymptotic variances. The CI coverage proportion is too small for μ ˆ 1 and μ ˜ 1 and too large for w ˆ 2 ′ β ˆ and m ˆ 2 ′ β ˜ . Overall, the ranking in Table 1 can be summarized as follows, with “ ≻ ” meaning “better than”:

(17) m ˆ 1 ′ β ˜ ≃ w ˆ 1 ′ β ˆ ≻ μ ˆ 1 ≻ m ˆ 2 ′ β ˜ ≻ μ ˜ 1 ≻ w ˆ 2 ′ β ˆ .

Table 1

Continuous Y : ∣Bias∣, SD, Avg.Asy.SD, and CI coverage proportion ( N = 500 )

Effect	Linear	Quadratic	Effect	Linear	Quadratic
μ ˆ 1 ( − 0.5 )	0.056 0.698 0.488 0.83	0.074 0.588 0.447 0.81	μ ˜ 1 ( 0.31 )	0.018 0.735 0.730 0.89	0.006 0.781 0.765 0.89
μ ˆ 1 ( 0 )	0.044 0.364 0.369 0.96	0.067 0.370 0.368 0.96	μ ˜ 1 ( 0.5 )	0.121 0.899 0.966 0.94	0.111 0.897 0.940 0.94
μ ˆ 1 ( 0.5 )	0.114 1.14 0.782 0.85	0.155 1.11 0.771 0.87	μ ˜ 1 ( 0.69 )	0.019 1.93 2.88 0.84	0.077 1.86 2.76 0.84
μ ˆ 1 Avg	0.071 0.73 0.55 0.88	0.099 0.69 0.53 0.88	μ ˜ 1 Avg	0.053 1.2 1.5 0.89	0.065 1.2 1.5 0.89
w ˆ 1 ′ β ˆ ( − 0.5 )	0.041 0.486 0.503 0.93	0.015 0.486 0.496 0.93	w ˆ 2 ′ β ˆ ( − 0.5 )	0.143 1.70 5.69 0.96	0.180 1.53 5.37 0.96
w ˆ 1 ′ β ˆ ( 0 )	0.034 0.345 0.353 0.96	0.118 0.353 0.353 0.96	w ˆ 2 ′ β ˆ ( 0 )	0.034 1.16 5.50 0.98	0.011 1.28 6.60 0.98
w ˆ 1 ′ β ˆ ( 0.5 )	0.027 0.670 0.685 0.98	0.029 0.660 0.680 0.98	w ˆ 2 ′ β ˆ ( 0.5 )	0.019 3.08 17.0 0.99	0.008 3.49 18.6 1.0
w ˆ 1 ′ β ˆ Avg	0.034 0.50 0.51 0.97	0.054 0.50 0.51 0.95	w ˆ 2 ′ β ˆ Avg	0.065 2.0 9.4 0.98	0.066 2.1 10 0.98
m ˆ 1 ′ β ˜ ( 0.31 )	0.053 0.488 0.502 0.95	0.028 0.490 0.497 0.93	m ˆ 2 ′ β ˜ ( 0.31 )	0.113 0.694 0.847 0.96	0.133 0.755 0.911 0.96
m ˆ 1 ′ β ˜ ( 0.5 )	0.032 0.346 0.354 0.96	0.116 0.354 0.353 0.96	m ˆ 2 ′ β ˜ ( 0.5 )	0.030 0.537 0.746 0.98	0.007 0.545 0.764 0.98
m ˆ 1 ′ β ˜ ( 0.69 )	0.010 0.676 0.689 0.98	0.045 0.668 0.682 0.97	m ˆ 2 ′ β ˜ ( 0.69 )	0.023 1.45 2.46 1.0	0.092 1.49 2.46 1.0
m ˆ 1 ′ β ˜ Avg	0.032 0.50 0.52 0.97	0.063 0.50 0.51 0.95	m ˆ 2 ′ β ˜ Avg	0.055 0.90 1.4 0.98	0.077 0.93 1.4 0.98

Avg.Asy.SD: average across 10,000 reps of the asymptotic SD formula in theorems; μ ˆ 1 and μ ˜ 1 : ratio estimates; w ˆ J ′ β ˆ & m ˆ J ′ β ˜ : power-approximation estimates with J = 1 , 2 ; Avg: simple average across three evaluation points.

When N increases to 5,000 in Table 2, all estimators become stable, and the averaged asymptotic SDs are closely matched with the corresponding simulation SDs. μ ˆ 1 outperforms μ ˜ 1 in terms of both the bias and SD. w ˆ 1 ′ β ˆ and m ˆ 1 ′ β ˜ perform the best when the true effect is linear, but exhibit substantially large biases when the true effect is quadratic. The performances of w ˆ 2 ′ β ˆ and m ˆ 2 ′ β ˜ are satisfactory, even though they were the lowest performing estimators in Table 1, with N = 500 ; w ˆ 2 ′ β ˆ and m ˆ 2 ′ β ˜ exhibit the minimum bias under quadratic effects. μ ˆ 1 is comparable to w ˆ 2 ′ β ˆ and m ˆ 2 ′ β ˜ , and μ ˜ 1 has the largest SDs. The CI coverage proportion is close to 95%, except when the bias is large for w ˆ 1 ′ β ˆ and m ˆ 1 ′ β ˜ under quadratic effects. Overall, the ranking in Table 2 is

(18) m ˆ 2 ′ β ˜ ≻ μ ˆ 1 ≃ w ˆ 2 ′ β ˆ ≻ m ˆ 1 ′ β ˜ ≃ w ˆ 1 ′ β ˆ ≻ μ ˜ 1 .

Table 2

Continuous Y : ∣Bias∣, SD, Avg.Asy.SD, and CI coverage proportion ( N = 5,000 )

Effect	Linear	Quadratic	Effect	Linear	Quadratic
μ ˆ 1 ( − 0.5 )	0.007 0.165 0.152 0.91	0.019 0.165 0.154 0.90	μ ˜ 1 ( 0.31 )	0.031 0.231 0.233 0.94	0.033 0.230 0.234 0.92
μ ˆ 1 ( 0 )	0.015 0.126 0.129 0.96	0.030 0.127 0.130 0.95	μ ˜ 1 ( 0.5 )	0.019 0.246 0.257 0.95	0.011 0.244 0.259 0.95
μ ˆ 1 ( 0.5 )	0.026 0.265 0.244 0.91	0.037 0.259 0.240 0.91	μ ˜ 1 ( 0.69 )	0.027 0.621 0.682 0.91	0.054 0.631 0.689 0.92
μ ˆ 1 Avg	0.016 0.19 0.18 0.93	0.029 0.18 0.18 0.92	μ ˜ 1 Avg	0.026 0.37 0.39 0.93	0.032 0.37 0.39 0.93
w ˆ 1 ′ β ˆ ( − 0.5 )	0.007 0.142 0.142 0.95	0.013 0.138 0.139 0.85	w ˆ 2 ′ β ˆ ( − 0.5 )	0.011 0.145 0.147 0.95	0.010 0.145 0.149 0.95
w ˆ 1 ′ β ˆ ( 0 )	0.003 0.101 0.101 0.95	0.081 0.101 0.102 0.86	w ˆ 2 ′ β ˆ ( 0 )	0.005 0.141 0.150 0.95	0.003 0.132 0.142 0.95
w ˆ 1 ′ β ˆ ( 0.5 )	0.001 0.201 0.199 0.96	0.075 0.197 0.198 0.83	w ˆ 2 ′ β ˆ ( 0.5 )	0.018 0.361 0.362 0.97	0.001 0.312 0.329 0.98
w ˆ 1 ′ β ˆ Avg	0.004 0.15 0.15 0.95	0.056 0.15 0.15 0.85	w ˆ 2 ′ β ˆ Avg	0.011 0.22 0.22 0.96	0.005 0.20 0.21 0.96
m ˆ 1 ′ β ˜ ( 0.31 )	0.018 0.142 0.142 0.95	0.002 0.138 0.140 0.84	m ˆ 2 ′ β ˜ ( 0.31 )	0.022 0.141 0.142 0.96	0.033 0.145 0.146 0.95
m ˆ 1 ′ β ˜ ( 0.5 )	0.003 0.101 0.101 0.95	0.081 0.101 0.102 0.85	m ˆ 2 ′ β ˜ ( 0.5 )	0.001 0.129 0.133 0.95	0.005 0.128 0.134 0.96
m ˆ 1 ′ β ˜ ( 0.69 )	0.012 0.201 0.199 0.95	0.085 0.196 0.197 0.83	m ˆ 2 ′ β ˜ ( 0.69 )	0.012 0.283 0.281 0.97	0.009 0.278 0.276 0.98
m ˆ 1 ′ β ˜ Avg	0.011 0.15 0.15 0.95	0.056 0.15 0.15 0.84	m ˆ 2 ′ β ˜ Avg	0.012 0.18 0.19 0.96	0.015 0.18 0.19 0.96

The findings in Table 3 with binary Y and N = 500 are similar to those in Table 1 except for the CI coverage that is considerably lower for μ ˆ 1 . The ranking (17) still holds for Table 3. When N increases to 5,000 in Table 4, the asymptotic variance formulas work well, and the large biases of m ˆ 1 ′ β ˜ and w ˆ 1 ′ β ˆ (slightly larger than the biases in Table 3) and resulting low CI coverage are notable. Overall, the ranking in Table 4 can be expressed as

(19) m ˆ 2 ′ β ˜ ≃ w ˆ 2 ′ β ˆ ≻ μ ˆ 1 ≻ m ˆ 1 ′ β ˜ ≃ w ˆ 1 ′ β ˆ ≻ μ ˜ 1 .

Table 3

Binary Y : ∣Bias∣, SD, Avg.Asy.SD, and CI coverage proportion ( N = 500 )

Effect	Linear	Quadratic	Effect	Linear	Quadratic
μ ˆ 1 ( − 0.5 )	0.029 0.373 0.274 0.83	0.020 0.359 0.269 0.82	μ ˜ 1 ( 0.31 )	0.051 0.578 0.627 0.92	0.019 0.552 0.602 0.92
μ ˆ 1 ( 0 )	0.024 0.149 0.145 0.95	0.034 0.149 0.147 0.94	μ ˜ 1 ( 0.5 )	0.133 0.564 0.626 0.97	0.158 0.589 0.649 0.97
μ ˆ 1 ( 0.5 )	0.026 0.214 0.176 0.68	0.027 0.215 0.179 0.71	μ ˜ 1 ( 0.69 )	0.098 0.869 1.37 0.86	0.108 0.856 1.40 0.86
μ ˆ 1 Avg	0.026 0.25 0.20 0.82	0.027 0.24 0.20 0.82	μ ˜ 1 Avg	0.094 0.67 0.87 0.92	0.095 0.67 0.89 0.91
w ˆ 1 ′ β ˆ ( − 0.5 )	0.013 0.225 0.227 0.93	0.028 0.226 0.226 0.90	w ˆ 2 ′ β ˆ ( − 0.5 )	0.091 1.50 3.96 0.96	0.052 0.621 2.54 0.94
w ˆ 1 ′ β ˆ ( 0 )	0.059 0.121 0.124 0.94	0.083 0.123 0.126 0.91	w ˆ 2 ′ β ˆ ( 0 )	0.001 0.481 2.63 0.97	0.013 0.450 2.98 0.97
w ˆ 1 ′ β ˆ ( 0.5 )	0.030 0.184 0.195 0.98	0.049 0.182 0.198 0.97	w ˆ 2 ′ β ˆ ( 0.5 )	0.007 1.17 8.37 0.99	0.027 1.11 8.90 0.99
w ˆ 1 ′ β ˆ Avg	0.034 0.18 0.18 0.95	0.053 0.18 0.18 0.93	w ˆ 2 ′ β ˆ Avg	0.033 1.1 5.0 0.97	0.031 0.73 4.8 0.97
m ˆ 1 ′ β ˜ ( 0.31 )	0.008 0.228 0.229 0.93	0.023 0.229 0.229 0.90	m ˆ 2 ′ β ˜ ( 0.31 )	0.047 0.377 0.382 0.96	0.038 0.350 0.369 0.95
m ˆ 1 ′ β ˜ ( 0.5 )	0.058 0.121 0.123 0.94	0.082 0.123 0.126 0.91	m ˆ 2 ′ β ˜ ( 0.5 )	0.004 0.192 0.250 0.98	0.010 0.190 0.242 0.98
m ˆ 1 ′ β ˜ ( 0.69 )	0.036 0.185 0.194 0.98	0.056 0.182 0.196 0.97	m ˆ 2 ′ β ˜ ( 0.69 )	0.005 0.314 0.542 0.99	0.003 0.300 0.526 0.99
m ˆ 1 ′ β ˜ Avg	0.034 0.18 0.18 0.95	0.054 0.18 0.18 0.92	m ˆ 2 ′ β ˜ Avg	0.019 0.30 0.39 0.97	0.017 0.28 0.38 0.97

Table 4

Binary Y : ∣Bias∣, SD, Avg.Asy.SD, and CI coverage proportion ( N = 5,000 )

Effect	Linear	Quadratic	Effect	Linear	Quadratic
μ ˆ 1 ( − 0.5 )	0.002 0.096 0.092 0.90	0.005 0.099 0.094 0.90	μ ˜ 1 ( 0.31 )	0.004 0.170 0.190 0.96	0.001 0.176 0.193 0.96
μ ˆ 1 ( 0 )	0.009 0.053 0.053 0.94	0.011 0.053 0.053 0.93	μ ˜ 1 ( 0.5 )	0.010 0.152 0.168 0.97	0.012 0.152 0.169 0.97
μ ˆ 1 ( 0.5 )	0.008 0.061 0.058 0.89	0.003 0.062 0.060 0.89	μ ˜ 1 ( 0.69 )	0.056 0.313 0.351 0.95	0.051 0.310 0.351 0.95
μ ˆ 1 Avg	0.006 0.070 0.068 0.91	0.006 0.072 0.069 0.91	μ ˜ 1 Avg	0.023 0.21 0.24 0.96	0.022 0.21 0.24 0.96
w ˆ 1 ′ β ˆ ( − 0.5 )	0.025 0.066 0.066 0.85	0.035 0.066 0.066 0.68	w ˆ 2 ′ β ˆ ( − 0.5 )	0.014 0.072 0.072 0.95	0.019 0.071 0.073 0.92
w ˆ 1 ′ β ˆ ( 0 )	0.046 0.036 0.036 0.71	0.068 0.037 0.037 0.45	w ˆ 2 ′ β ˆ ( 0 )	0.009 0.048 0.049 0.95	0.014 0.048 0.048 0.93
w ˆ 1 ′ β ˆ ( 0.5 )	0.044 0.058 0.057 0.61	0.073 0.059 0.059 0.32	w ˆ 2 ′ β ˆ ( 0.5 )	0.008 0.059 0.061 0.98	0.015 0.055 0.057 0.98
w ˆ 1 ′ β ˆ Avg	0.038 0.054 0.053 0.72	0.058 0.054 0.054 0.48	w ˆ 2 ′ β ˆ Avg	0.010 0.060 0.061 0.96	0.016 0.058 0.060 0.94
m ˆ 1 ′ β ˜ ( 0.31 )	0.021 0.067 0.067 0.84	0.031 0.066 0.067 0.66	m ˆ 2 ′ β ˜ ( 0.31 )	0.004 0.074 0.073 0.95	0.006 0.073 0.074 0.93
m ˆ 1 ′ β ˜ ( 0.5 )	0.046 0.036 0.036 0.70	0.068 0.037 0.037 0.44	m ˆ 2 ′ β ˜ ( 0.5 )	0.005 0.048 0.049 0.95	0.007 0.049 0.050 0.95
m ˆ 1 ′ β ˜ ( 0.69 )	0.047 0.057 0.056 0.59	0.077 0.057 0.058 0.30	m ˆ 2 ′ β ˜ ( 0.69 )	0.003 0.056 0.057 0.98	0.010 0.052 0.058 0.98
m ˆ 1 ′ β ˜ Avg	0.038 0.053 0.053 0.71	0.058 0.053 0.054 0.47	m ˆ 2 ′ β ˜ Avg	0.004 0.059 0.060 0.96	0.008 0.058 0.061 0.95

Thus far, we examined the CI coverage at each evaluation point separately. For joint coverage across m evaluation points, the 95% joint coverage requires that the CI at each point has a higher confidence level. Solving ( 1 − α ′ ) m = 1 − α for α ′ with α = 0.05 gives about α ′ = 0.05 / m , allowing the confidence band (CB) across the m points to capture the true effect curve in 95 % of the trials. Figure 1 shows 50 CBs randomly selected from our 10 , 000 simulation runs when Y is continuous, the effect is linear, and the sample size is 5,000, using m = 20 equally spaced evaluation points over p ∈ [ 0.25 , 0.75 ] . For each estimator, only 1 ∼ 3 CBs do not capture the true line, resulting in a joint coverage of 94 ∼ 98 % .

$Figure 1 True effect curve and 50 CBs for continuous Y Y with linear effect and N = 5,000 N=\hspace{0.1em}\text{5,000}\hspace{0.1em} .$

Figure 1

True effect curve and 50 CBs for continuous Y with linear effect and N = 5,000 .

Overall, there exist trade-offs among the bias, SD, CI coverage, ease in implementation, and closeness of the asymptotic variance formula to the actual variance. We note that μ ˆ 1 performs reasonably well overall, whereas μ ˜ 1 exhibits a low performance. However, considering the trade-offs, we recommend the use of ( w ˆ 1 ′ β ˆ , m ˆ 1 ′ β ˜ ) for small samples and ( w ˆ 2 ′ β ˆ , m ˆ 2 ′ β ˜ ) for large samples, which are particularly easy to implement with only probit and OLS.

5 Empirical analysis

Our empirical analysis is for the effects of 401(k) retirement programs D on savings Y . Many studies have investigated whether contributions to tax-deferred retirement plans increase savings or simply crowd out other types of savings [10,28–32]. Since D is correlated with unobserved individual preferences for savings, Abadie [10] used the eligibility δ for 401(k) programs as an IV for D to overcome the endogeneity problem.

Because the eligibility for the programs is exogenously set, δ is unlikely to be correlated with the preferences for savings once we control for X such as the income. The IV exclusion restriction is plausible for δ , as δ is likely to affect savings only through the income. Since only the eligible persons can apply for a 401(k) account, monotonicity holds trivially, and the IV relevance condition is verified by the OLS of D on ( δ , X ) .

We use the same data as those used by Poterba et al. [29] and Abadie [10], derived from the Survey of Income and Program Participation for 1991. The observation unit is a household, and the sample is restricted to households with at least one member employed. Table 5 presents the sample mean (SD) of the variables, where Y is the net financial assets in $1,000. X consists of the household income in $1,000 (Inc), age, marital status (Mar), and household size (Hsize). N d is the group size for D = d . 39% of the households are eligible for the 401(k) programs ( δ = 1 ), and 72 % = 100 × ( 0.28 / 0.39 ) of them have D = 1 .

Table 5

Mean (SD) of variables ( N = 9 , 275 ; N 1 = 2 , 562 , N 0 = 6 , 713 )

	Pooled	D = 1	D = 0		Pooled	D = 1	D = 0
Y	19.1 (64)	38.5 (79.3)	11.7 (055.3)	Inc	39.3 (24.1)	49.8 (26.8)	35.2 (21.6)
				Age	41.1 (10.3)	41.5 (9.65)	41.9 (10.0)
D	0.28	1	0	Mar	0.63	0.69	0.60
δ	0.39	1	0.16	Hsize	2.89 (1.53)	2.92 (1.47)	2.87 (1.55)

Y : Net financial assets in $1,000; Inc: family income; Mar: marital status; Hsize: family size.

Before we examine the effect estimates in Table 6, we explain the steps implemented for facilitating the comparison of the estimators. Recall that μ ˜ 1 and M ˆ J ′ β ˜ are conditioned on λ X , whereas μ ˆ 1 and W ˆ J ′ β ˆ are conditioned on the linear indices. Although (14) is used to make W ˆ J ′ β ˆ comparable to μ ˜ 1 and M ˆ J ′ β ˜ , the same step cannot be implemented for μ ˆ 1 because the single index for μ ˆ 1 does not include an intercept. The only way to make μ ˆ 1 comparable to ( μ ˜ 1 , W ˆ J ′ β ˆ , M ˆ J ′ β ˜ ) is to condition μ ˆ 1 on Λ ˆ ( X ′ α ˆ ) , instead of X ′ α ˆ . However, since Λ ˆ ( X ′ α ˆ ) is N h -consistent, this modification requires the derivation of its asymptotic distribution anew, which is prohibitively complicated. Hence, we use bootstrapping (300 bootstrap repetitions) to do inference for μ ˆ 1 conditioned on Λ ˆ ( X ′ α ˆ ) . If a comparison with the other estimators is not required, conditioning μ ˆ 1 on X ′ α ˆ is adequate.

Table 6

Complier effects for financial assets in $1,000 ( N = 9 , 275 )

p = λ X :	0.2 (SE)	0.3 (SE)	0.4 (SE)	0.5 (SE)	0.6 (SE)
μ ˆ 1	4.37 (4.16)	4.56 (4.94)	9.11 (6.61)	∗ 15.9 (6.60)	+ 21.7 (15.0)
μ ˜ 1	5.77 (4.12)	+ 4.69 (2.85)	∗ ∗ 16.9 (3.04)	∗ ∗ 29.4 (5.52)	∗ 33.2 (13.1)
W ˆ 1 ′ β ˆ	6.38 (11.3)	+ 8.24 (4.80)	∗ ∗ 9.82 (1.70)	+ 11.3 (6.42)	12.8 (11.6)
W ˆ 2 ′ β ˆ	− 7.69 (30.7)	∗ 5.64 (2.30)	+ 14.7 (8.86)	∗ ∗ 21.3 (6.70)	∗ 26.1 (10.6)
M ˆ 1 ′ β ˜	5.76 (8.51)	∗ 7.69 (3.92)	∗ ∗ 9.62 (1.67)	∗ 11.6 (5.88)	13.5 (10.5)
M ˆ 2 ′ β ˜	− 5.11 (18.7)	∗ ∗ 5.96 (1.90)	∗ 14.7 (6.44)	∗ ∗ 21.0 (4.40)	∗ 25.0 (10.9)

∗ ∗ , ∗ , +: 1%, 5%, 10% significance levels; μ ˆ 1 & μ ˜ 1 : 1st & 2nd ratio estimators; W ˆ J ′ β ˆ & M ˆ J ′ β ˜ : power-approximation estimators conditioned on X ′ θ ˆ & Φ ( X ′ θ ˆ ) .

For the estimation, first, both age and age 2 are used as regressors in the probit of δ on X for M ˆ J ′ β ˜ . Notably, the use of age 2 introduces certain difficulties for μ ˆ 1 because using a functionally dependent regressor in μ ˆ 1 requires a complex identification assumption, as mentioned in relation to Theorem 2. Second, for K ( ⋅ ) of μ ˆ 1 and μ ˜ 1 , we use the simple N ( 0 , 1 ) kernel. The bounded quartic kernel is not used because the estimators crash if there are no observations around several evaluation points. According to our simulation study, the choice of the two kernels does not considerably affect the results. Third, the bandwidth h is set as h 0 ∈ [ 1 , 2 ] in h = h 0 S D ( λ ˆ X ) N − 1 / 5 through “eye-balling”; the bandwidth for Λ ˆ ( ⋅ ) is the rule-of-thumb bandwidth h Λ = S D { Λ ( X ′ α ˆ ) } N − 1 / 5 . Fourth, for W ˆ J ′ β ˆ and M ˆ J ′ β ˜ , we set J = 1 , 2 , and thus, six estimators are compared, as in our simulation study. The comparison is performed over λ X ∈ [ 0.2 , 0.6 ] , which contains most λ ˆ X values.

For Λ ˆ ( X ′ α ˆ ) in μ ˆ 1 , Inc, Age, and Mar are the significant variables, whereas Inc, Age, and Hsize are the significant variables for Φ ( X ′ θ ˆ ) , with the standard error (SE) in ( ⋅ ).

	Inc	Age	Age 2	Mar	Hsize
α ˆ in Λ ˆ ( X ′ α ˆ )	∗ ∗ 0.046 (0.011)	∗ ∗ 0.055 (0.018)	∗ −0.002 (0.001)	∗ ∗ −0.20 (0.055)	−0.034 (0.028)
θ ˆ in Φ ( X ′ θ ˆ )	∗ ∗ 0.014 (0.001)	∗ ∗ 0.039 (0.005)	∗ ∗ −0.001 (0.0001)	0.019 (0.037)	∗ ∗ −0.034 (0.011)

In the results of ( α ˆ , θ ˆ ) , the intercept is omitted, and ∗ ∗ and ∗ denote the 1% and 5% significance levels, respectively. To make α ˆ and θ ˆ comparable, we normalize α ˆ with ∣ α ˆ hsize ∣ and then multiply α ˆ by ∣ θ ˆ hsize ∣ , as explained at the end of Section 2.2. Income and age appear to be the two main variables driving the variation in IS.

In Table 6, all estimators show increasing effects of D across p ∈ [ 0.2 , 0.6 ] , which become significant for p ≥ 0.3 , except for μ ˆ 1 . Recalling δ ¯ ≃ 0.4 , the effect on CPs at p = 0.4 is 9–17, and the effect based on W ˆ 2 ′ β ˆ and M ˆ 2 ′ β ˜ is 14.7 ($14,700). Recall that W ˆ 2 ′ β ˆ and M ˆ 2 ′ β ˜ achieve the highest performance in our simulation study with large samples.

Table 6 shows no significant difference between μ ˆ 1 and μ ˜ 1 at p = 0.2 , 0.3 ; however, the estimates become much different for p ≥ 0.4 . For a given J , W ˆ J ′ β ˆ and M ˆ J ′ β ˜ are similar, but ( W ˆ 1 ′ β ˆ , M ˆ 1 ′ β ˜ ) differ much from ( W ˆ 2 ′ β ˆ , M ˆ 2 ′ β ˜ ) . With J = 1 , the effect increases gradually from about 6 at p = 0.2 to about 13 at p = 0.6 , but it becomes insignificant as p increases further. With J = 2 , the effect increases dramatically from − 8 ∼ − 5 to 25 ∼ 26 and is significant even at p = 0.6 . As p increases, W ˆ 2 ′ β ˆ and M ˆ 2 ′ β ˜ deviate from μ ˜ 1 that uses the same IS.

Figure 2 shows μ ˆ 1 , μ ˜ 1 , W ˆ 2 ′ β ˆ , and M ˆ 2 ′ β ˜ over 90% of the support points of IS. We omit W ˆ 1 ′ β ˆ and M ˆ 1 ′ β ˜ , as J = 2 is preferable over J = 1 in large samples. In Figure 2, μ ˆ 1 , W ˆ 2 ′ β ˆ , and M ˆ 2 ′ β ˜ show more or less monotonically increasing effects as p increases, whereas μ ˜ 1 is quadratic: μ ˜ 1 increases up to 36 at p = 0.58 and then decreases sharply to 6 at p = 0.67 . However, this decline is implausible, which might be due to the “boundary problem” in kernel estimators. Overall, recalling the BMI example in the introduction, we can state that the 401(k) plan effect on the savings is positive for those with IS greater than approximately 0.3.

$Figure 2 E ( Y 1 ‒ Y 0 ∣ CP , λ x = p ) E\left({Y}^{1}‒{Y}^{0}| \hspace{0.33em}\hspace{0.1em}\text{CP}\hspace{0.1em},{\lambda }_{x}=p) versus p p .$

Figure 2

E ( Y 1 ‒ Y 0 ∣ CP , λ x = p ) versus p .

It is puzzling why μ ˜ 1 yields considerably different results from M ˆ 2 ′ β ˜ , even though both estimators use λ X = Φ ( X ′ θ ) . For this, note that kernel estimation is a local nonparametric method, whereas power approximation is global. Hence, the former approach is superior when the focus is on an evaluation point, whereas the latter approach is superior when the focus is on the shape of a curve. In the former, observations far from the evaluation point are irrelevant, whereas all observations are relevant for all evaluation points in the latter. This aspect leads to adverse results for the kernel method when it does not work well at a chosen evaluation point. For example, if the monotonicity condition is violated at the point, then the kernel estimate would be unsatisfactory. In contrast, because the power approximation method is global, the effect estimates at other points can help mitigate the poor local estimate. Owing to this property, as well as the more significant estimates in Table 6, M ˆ 2 ′ β ˜ and W ˆ 2 ′ β ˆ are the preferred estimators in this empirical analysis.

The benefits of the global nonparametric approach versus the local approach can be further highlighted. The inverted matrix for M ˆ 2 ′ β ˜ is essentially E { E ( D 1 − D 0 ∣ λ X ) ( 1 − λ X ) λ X M M ′ } as (12) shows, whereas the denominator of μ ˜ 1 is essentially E ( D 1 − D 0 ∣ λ X ) = P ( CP ∣ λ X ) as (3) shows. Consequently, if P ( CP ∣ λ X ) approaches zero at some λ X , μ ˜ 1 would suffer, whereas M ˆ 2 ′ β ˜ would suffer less because it uses the averaged version of P ( CP ∣ λ X ) .

To examine how informative E ( Y 1 − Y 0 ∣ CP , λ X ) is, Figure 3 shows the plots of E ( Y 1 − Y 0 ∣ CP , Inc i ) versus Inc i for all i = 1 , … , N . The first row is for μ ˆ 1 and μ ˜ 1 , and the second row is for W ˆ 2 ′ β ˆ and M ˆ 2 ′ β ˜ . Figure 3, which is based on the income, reveals increasing patterns similar to those shown in Figure 2. This trend suggests that the λ X -heterogeneous effect in Figure 2 is mostly driven by the income. We also tried other covariates, but could not find any informative pattern. In practice, our estimators can be tried initially, and if effect heterogeneity is found, then a more extensive analysis with X can be performed.

$Figure 3 E( Y 1 ‒ Y 0 ∣ {Y}^{1}‒{Y}^{0}| CP, income).$

Figure 3

E( Y 1 ‒ Y 0 ∣ CP, income).

The effect of a covariate can be found using the graph for W ′ β ˆ (Figure 2). For example, when X 2 increases by one unit at X ′ θ ˆ = s (i.e., at W ′ β ˆ = ( 1 , s , … , s J ) β ˆ ), W ′ β ˆ increases by θ ˆ 2 times { 0 , 1 , 2 ( X ′ θ ˆ ) , … , J ( X ′ θ ˆ ) J − 1 } β ˆ , and this effect can be found in Figure 2 by moving to the right from ( 1 , s , … , s J ) β ˆ by θ ˆ 2 ⋅ { 0 , 1 , 2 ( X ′ θ ˆ ) , … , J ( X ′ θ ˆ ) J − 1 } β ˆ and then comparing the values of the graph at the two positions.

Specifically, when the income increases by one unit ( $ 1 , 000 ) at the mean of X ¯ ′ θ ˆ ≈ − 0.287 , the IS Φ ( X ¯ ′ θ ˆ ) hardly changes from 0.39, as the change is only from 0.387 to 0.392. Then, because θ ˆ i n c = 0.014 , W 2 ′ β ˆ increases by 0.014 × { 0 , 1 , − 0.574 } β ˆ ≃ 0.53 . Hence, the effect of the increase in the income of $ 1 , 000 on Y is an increase of $530 for the CPs with the IS of approximately 0.39. This value is considerably smaller than the “naive income effect” observed in Table 5: the mean group difference of Y divided by the mean group difference of the income is ( 38.5 − 11.7 ) / ( 49.8 − 35.2 ) = 1.84 .

6 Conclusion

For a binary treatment D , an outcome Y , and covariates X , denoting the potential responses as ( Y 0 , Y 1 ) for D = 0 , 1 , the treatment effect heterogeneity (i.e., E ( Y 1 − Y 0 ∣ X ) not being a constant) is a rule rather than an exception. However, when X is high-dimensional, the nonparametric estimation of E ( Y 1 − Y 0 ∣ X ) runs into the well-known dimension problem. In the PS literature, E ( Y 1 − Y 0 ∣ P S ) has been estimated instead to overcome the dimension problem under the D -exogeneity “ D ∐ ( Y 0 , Y 1 ) ∣ X .”

When D is endogenous/confounded, however, PS matching and other estimators requiring D -exogeneity cannot be used, and at least a binary instrument (e.g., δ ) is needed to overcome the problem. In this article, defining the potential treatments ( D 0 , D 1 ) for δ = 0 , 1 and “CP” as those with ( D 0 = 0 , D 1 = 1 ) , we showed that the role played by PS for exogenous D can be played by the “IS” λ X ≡ E ( δ ∣ X ) for endogenous D with a non-randomized δ . The IS becomes PS for exogenous D because δ = D .

The dimension reduction achieved by PS for exogenous D cannot be realized by an arbitrary function of X . Similarly, the dimension reduction achieved by IS for endogenous D cannot be realized by an arbitrary function of X . By identifying and estimating E ( Y 1 − Y 0 ∣ CP , λ X ) conditioned only on the scalar λ X , we capture the effect heterogeneity while avoiding the dimension problem. The heterogeneity captured by λ X is minimal, in the sense that λ X is the “coarsest balancing score” ( δ ∐ X ∣ λ X ). Since the endogenous D becomes exogenous for CPs because D = δ , and since it is likely that D = 1 when Y 1 − Y 0 is positive, the IS with D = δ can capture the effect heterogeneity well.

We proposed three estimators for E ( Y 1 − Y 0 ∣ CP , λ X ) , motivated by two critical RF equations that are linear in either D or δ , even though no explicit linearity assumption is imposed. The equations and our three estimators hold for any form of Y (continuous, binary, count, …), as long as Y 1 − Y 0 makes sense. The three estimators require progressively more restrictive assumptions to enhance their applicability.

The first estimator is a kernel nonparametric estimator based on a single index estimator for λ X that enables the use of an unknown link function. The second estimator is the same as the first estimator, except for the assumption of the probit link for λ X . The third estimator is an IVE formulated after approximating E ( Y 1 − Y 0 ∣ CP , λ X ) with a power function of λ X . Since we take the power approximation to be exact, the third estimator is N -consistent, whereas the first two estimators converge at a slower rate. Among the three estimators, we recommend the third estimator because it is easy to implement with only OLS and probit, numerically stable, and not subject to the “excessively small denominator problem” inherent in the first two ratio estimators.

We presented an empirical illustration for the effects of 401(k) retirement programs ( D ) on savings ( Y ) with the eligibility δ for the programs as the IV. Our main finding is that the households with the IS greater than approximately 0.3 would increase their savings due to D , whereas those with the IS smaller than 0.3 would not. This kind of finding could be useful when D is a drug whose administration is self-selected by individuals (and is thus endogenous). If there exists an education/encouragement δ based on X for the possible benefits of the drug, we can find an analogous cutoff to prepare an easy-to-follow guideline that the drug would benefit those with the IS greater than the cutoff.

Acknowledgment

The authors are grateful to two anonymous reviewers for their helpful comments.

Funding information: Jin-young Choi’s research has been supported by the Hankuk University of Foreign Studies Research Fund of 2023. Myoung-jae Lee’s research has been supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2020R1A2C1A01007786).
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Conflict of interest: The authors state no conflict of interest.
Ethical approval: The conducted research is not related to either human or animals use.
Data availability statement: The dataset analyzed during this study is from “Introductory Econometrics: A Modern Approach, 7e” by Jeffrey M. Wooldridge. The dataset named “401ksubs” is available in the textbook e-learning resource repository, https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041.

Appendix A Proofs

A.1 Proof for (5) and balancing score property of IS

Proof

Under δ ∐ ( D 0 , D 1 , Y 0 , Y 1 ) ∣ X , observe

E ( δ ∣ D 0 , D 1 , Y 0 , Y 1 , λ X ) = E { E ( δ ∣ D 0 , D 1 , Y 0 , Y 1 , X ) ∣ D 0 , D 1 , Y 0 , Y 1 , λ X } = E { E ( δ ∣ X ) ∣ D 0 , D 1 , Y 0 , Y 1 , λ X } = E ( λ X ∣ D 0 , D 1 , Y 0 , Y 1 , λ X ) = λ X = E ( δ ∣ λ X ) .

The first and last expressions prove that δ is mean-independent of ( D 0 , D 1 , Y 0 , Y 1 ) given λ X , but since δ is binary, the mean-independence is the same as (4)(i) to prove (5).

IS λ X ≡ E ( δ ∣ X ) is a balancing score because, for any fixed t ,

E ( δ 1 [ X ≤ t ] ∣ λ X ) = E { E ( δ 1 [ X ≤ t ] ∣ X ) ∣ λ X } = E { E ( δ ∣ X ) 1 [ X ≤ t ] ∣ λ X } = λ X P ( X ≤ t ∣ λ X ) = E ( δ ∣ λ X ) ⋅ P ( X ≤ t ∣ λ X ) ,

take E ( ⋅ ∣ λ X ) on λ X ≡ E ( δ ∣ X ) to see λ X = E ( δ ∣ λ X ) . Dividing the first and last expressions with E ( δ ∣ λ X ) , for which 0 < λ X < 1 is assumed, gives P ( X ≤ t ∣ δ = 1 , λ X ) = P ( X ≤ t ∣ λ X ) . Since the class of sets { X ≤ t , t ∈ (real space) } is a “probability determining class,” the distribution of X is the same across the δ = 0 , 1 groups given λ X .□

A.2 Proof for Theorem 1 regarding (2)

Proof

With D = ( D 1 − D 0 ) δ + D 0 , due to (4)(i) and (ii), we have

(A1) E ( D ∣ δ , λ X ) = E ( D 1 − D 0 ∣ δ , λ X ) δ + E ( D 0 ∣ δ , λ X ) = E ( D 1 − D 0 ∣ λ X ) δ + E ( D 0 ∣ λ X ) = P ( D 1 = 1 , D 0 = 0 ∣ λ X ) δ + E ( D 0 ∣ λ X ) = P ( CP ∣ λ X ) δ + E ( D 0 ∣ λ X ) .

The first and last expressions reveal that E ( D 0 ∣ λ X ) is the λ X -conditional intercept, and P ( CP ∣ λ X ) is the λ X -conditional slope of δ to show the effect of δ on D given λ X .

Take E ( ⋅ ∣ δ , λ X ) on Y = ( Y 1 − Y 0 ) { ( D 1 − D 0 ) δ + D 0 } + Y 0 : due to (4)(i) and (ii),

(A2) E ( Y ∣ δ , λ X ) = E [ ( Y 1 − Y 0 ) { ( D 1 − D 0 ) δ + D 0 } ∣ δ , λ X ] + E ( Y 0 ∣ δ , λ X ) = E { ( Y 1 − Y 0 ) ( D 1 − D 0 ) ∣ δ , λ X } δ + E { ( Y 1 − Y 0 ) D 0 + Y 0 ∣ δ , λ X } = E { ( Y 1 − Y 0 ) ( D 1 − D 0 ) ∣ λ X } δ + E { ( Y 1 − Y 0 ) D 0 + Y 0 ∣ λ X } = E ( Y 1 − Y 0 ∣ CP , λ X ) P ( CP ∣ λ X ) δ + E { ( Y 1 − Y 0 ) D 0 + Y 0 ∣ λ X } = μ 1 ( λ X ) P ( CP ∣ λ X ) δ + E { ( Y 1 − Y 0 ) D 0 + Y 0 ∣ λ X } .

Define U 2 ≡ Y − E ( Y ∣ δ , λ X ) ⇔ E ( Y ∣ δ , λ X ) = Y − U 2 ⇒ E ( U 2 ∣ δ , λ X ) = 0 to rewrite the first and last expressions of (A2) as (2).□

A.3 Proof for Theorem 2 (first ratio estimator)

Proof

Under the assumptions in Theorem 2, α ˆ is N -consistent [23, Theorem 5.2]. Hence, using α ˆ in μ ˆ 1 ( s ) is as good as using α , and thus, the following deals with the asymptotic distribution for μ ˆ 1 ( s ) with α known.

(1) Linearization

The linearization of μ ˆ 1 ( s ) to be used is

(A3) b ˆ ( s ) a ˆ ( s ) − b ( s ) a ( s ) = − b ( s ) a ( s ) 2 { a ˆ ( s ) − a ( s ) } + 1 a ( s ) { b ˆ ( s ) − b ( s ) } + o p 1 N h = − b ( s ) a ( s ) 2 [ a ˆ 1 ( s ) − a 1 ( s ) − { a ˆ 0 ( s ) − a 0 ( s ) } ] + 1 a ( s ) [ { b ˆ 1 ( s ) − b 1 ( s ) } − { b ˆ 0 ( s ) − b 0 ( s ) } ] + o p 1 N h ;

suppress o p { ( N h ) − 1 / 2 } henceforth. Apply this linearization also to a ˆ j ( s ) and b ˆ j ( s ) :

a ˆ j ( s ) − a j ( s ) = a ˆ j d ( s ) c ˆ j ( s ) − a j d ( s ) c j ( s ) = − a j d ( s ) c j ( s ) 2 { c ˆ j ( s ) − c j ( s ) } + 1 c j ( s ) { a ˆ j d ( s ) − a j d ( s ) } , b ˆ j ( s ) − b j ( s ) = b ˆ j y ( s ) c ˆ j ( s ) − b j y ( s ) c j ( s ) = − b j y ( s ) c j ( s ) 2 { c ˆ j ( s ) − c j ( s ) } + 1 c j ( s ) { b ˆ j y ( s ) − b j y ( s ) } ,

where

a ˆ j d ( s ) ≡ 1 N j h ∑ i ∈ G j K { ( S i − s ) / h } D i , a j d ( s ) ≡ E ( D ∣ s , δ = j ) f S j ( s ) , b ˆ j y ( s ) ≡ 1 N j h ∑ i ∈ G j K { ( S i − s ) / h } Y i , b j y ( s ) ≡ E ( Y ∣ s , δ = j ) f S j ( s ) , c ˆ j ( s ) ≡ 1 N j h ∑ i ∈ G j K { ( S i − s ) / h } , c j ( s ) ≡ f S j ( s ) .

Substitute the linearizations for a ˆ j ( s ) − a j ( s ) and b ˆ j ( s ) − b j ( s ) into (A3) to obtain

b ˆ ( s ) a ˆ ( s ) − b ( s ) a ( s ) = − b ( s ) a ( s ) 2 − a 1 d ( s ) c 1 ( s ) 2 { c ˆ 1 ( s ) − c 1 ( s ) } + 1 c 1 ( s ) { a ˆ 1 d ( s ) − a 1 d ( s ) } + a 0 d ( s ) c 0 ( s ) 2 { c ˆ 0 ( s ) − c 0 ( s ) } − 1 c 0 ( s ) { a ˆ 0 d ( s ) − a 0 d ( s ) } + 1 a ( s ) − b 1 y ( s ) c 1 ( s ) 2 { c ˆ 1 ( s ) − c 1 ( s ) } + 1 c 1 ( s ) { b ˆ 1 y ( s ) − b 1 y ( s ) } + b 0 y ( s ) c 0 ( s ) 2 { c ˆ 0 ( s ) − c 0 ( s ) } − 1 c 0 ( s ) { b ˆ 0 y ( s ) − b 0 y ( s ) } .

Rewrite the preceding display by collecting terms, and then multiply by N h :

(A4) N h b ˆ ( s ) a ˆ ( s ) − b ( s ) a ( s ) = − b ( s ) a ( s ) 2 1 c 1 ( s ) N 1 h { a ˆ 1 d ( s ) − a 1 d ( s ) } π 1 + b ( s ) a ( s ) 2 1 c 0 ( s ) N 0 h { a ˆ 0 d ( s ) − a 0 d ( s ) } π 0 + 1 a ( s ) 1 c 1 ( s ) N 1 h { b ˆ 1 y ( s ) − b 1 y ( s ) } π 1 − 1 a ( s ) 1 c 0 ( s ) N 0 h { b ˆ 0 y ( s ) − b 0 y ( s ) } π 0 + b ( s ) a ( s ) 2 a 1 d ( s ) c 1 ( s ) 2 − 1 a ( s ) b 1 y ( s ) c 1 ( s ) 2 N 1 h { c ˆ 1 ( s ) − c 1 ( s ) } π 1 − b ( s ) a ( s ) 2 a 0 d ( s ) c 0 ( s ) 2 − 1 a ( s ) b 0 y ( s ) c 0 ( s ) 2 N 0 h { c ˆ 0 ( s ) − c 0 ( s ) } π 0 .

The right side has six terms, which gives six asymptotic variances. Also, the three terms sharing the same subscript 1 are correlated with each other to give three covariances, and the three terms sharing the same subscript 0 also give three covariances. Hence, the asymptotic variance of N h { μ ˆ 1 ( s ) − μ 1 ( s ) } consists of 12 terms. We present some preliminaries next, and then turn to the 12 terms.

(2) Preliminaries with κ ≡ ∫ K ( t ) 2 d t

With ∫ K ( t ) t d t = 0 and the twice continuous differentiability of E ( Y ∣ s , δ ) and f S j ( s ) ,

E 1 h K S − s h Y ∣ δ = j = ∫ 1 h K t − s h E ( Y ∣ t , δ = j ) f S j ( t ) d t = ∫ K ( v ) E ( Y ∣ s + h v , δ = j ) f S j ( s + h v ) d v = E ( Y ∣ s , δ = j ) f S j ( s ) + O ( h 2 ) ; E 1 h K S − s h 2 Y 2 ∣ δ = j = ∫ 1 h K t − s h 2 E ( Y 2 ∣ t , δ = j ) f S j ( t ) d t = ∫ K ( v ) 2 E ( Y 2 ∣ s + h v , δ = j ) f S j ( s + h v ) d v = E ( Y 2 ∣ s , δ = j ) f S j ( s ) κ + O ( h 2 ) .

Analogous expressions hold when Y is replaced by D or Y D . Observe ( E 2 ( ⋅ ) ≡ { E ( ⋅ ) } 2 )

1 N j ∑ i ∈ G j 1 h K S i − s h Y i − E 1 h K S i − s h Y ∣ δ = j = 1 N j ∑ i ∈ G j 1 h K S i − s h Y i − E 1 h K S i − s h Y ∣ δ = j ;

E 1 h K S − s h Y − E 1 h K S − s h Y ∣ δ = j 2 ∣ δ = j = E 1 h 2 K S − s h 2 Y 2 ∣ δ = j − E 2 1 h K S − s h Y ∣ δ = j = h − 1 E ( Y 2 ∣ s , δ = j ) f S j ( s ) ⋅ κ + O ( h ) − { E ( Y ∣ s , δ = j ) f S j ( s ) + O ( h 2 ) } 2 .

Hence, invoking the Lindeberg CLT for triangular arrays,

N j h ∑ i ∈ G j 1 h K S i − s h Y i − E 1 h K S i − s h Y ∣ δ = j ⟶ d N { 0 , E ( Y 2 ∣ s , δ = j ) f S j ( s ) κ } .

An analogous result holds with Y replaced by D , and the asymptotic covariance between the two normalized sums with Y and D is

(A5) E 1 h K S − s h 2 Y D ∣ δ = j = E ( Y D ∣ s , δ = j ) f S j ( s ) κ + O ( h 2 ) .

(3) Variances

Because of

N j h { a ˆ j d ( s ) − a j d ( s ) } ⟶ d N { 0 , E ( D ∣ s , δ = j ) f S j ( s ) κ } ,

the variance of the first and second terms in (A4) is, respectively,

(A6) b ( s ) 2 a ( s ) 4 1 c 1 ( s ) 2 π 1 E ( D ∣ s , δ = 1 ) f S 1 ( s ) κ = b ( s ) 2 a ( s ) 4 1 f S 1 ( s ) π 1 E ( D ∣ s , δ = 1 ) κ ≡ V 1 , b ( s ) 2 a ( s ) 4 1 c 0 ( s ) 2 π 0 E ( D ∣ s , δ = 0 ) f S 0 ( s ) κ = b ( s ) 2 a ( s ) 4 1 f S 0 ( s ) π 0 E ( D ∣ s , δ = 0 ) κ ≡ V 2 .

Because of

N j h { b ˆ j y ( s ) − b j y ( s ) } ⟶ d N { 0 , E ( Y 2 ∣ s , δ = j ) f S j ( s ) κ } ,

the variance of the third and fourth terms in (A4) is, respectively,

1 a ( s ) 2 1 c 1 ( s ) 2 π 1 E ( Y 2 ∣ s , δ = 1 ) f S 1 ( s ) κ = 1 a ( s ) 2 1 f S 1 ( s ) π 1 E ( Y 2 ∣ s , δ = 1 ) κ ≡ V 3 , 1 a ( s ) 2 1 c 0 ( s ) 2 π 0 E ( Y 2 ∣ s , δ = 0 ) f S 0 ( s ) κ = 1 a ( s ) 2 1 f S 0 ( s ) π 0 E ( Y 2 ∣ s , δ = 0 ) κ ≡ V 4 .

Because of N j h { c ˆ j ( s ) − c j ( s ) } ⟶ d N { 0 , f S j ( s ) κ } , the variance of the fifth and sixth terms is, respectively,

b ( s ) a ( s ) 2 a 1 d ( s ) c 1 ( s ) 2 − 1 a ( s ) b 1 y ( s ) c 1 ( s ) 2 2 1 π 1 f S 1 ( s ) κ ≡ V 5 , b ( s ) a ( s ) 2 a 0 d ( s ) c 0 ( s ) 2 − 1 a ( s ) b 0 y ( s ) c 0 ( s ) 2 2 1 π 0 f S 0 ( s ) κ ≡ V 6 .

(4) Covariances

For the covariance between the first and third terms, we need the expected value of the product between N 1 h { a ˆ 1 d ( s ) − a 1 d ( s ) } and N 1 h { b ˆ 1 y ( s ) − b 1 y ( s ) } , where only the overlapping N 1 terms are non-zero whose expected value is E ( Y D ∣ s , δ = 1 ) f S 1 ( s ) κ as (A5) shows. Hence, the desired covariance is

− b ( s ) a ( s ) 2 1 c 1 ( s ) ⋅ 1 a ( s ) 1 c 1 ( s ) π 1 E ( Y D ∣ s , δ = 1 ) f S 1 ( s ) κ = − b ( s ) a ( s ) 3 1 f S 1 ( s ) π 1 E ( Y D ∣ s , δ = 1 ) κ ≡ C 1 .

For the covariance between the first and fifth terms, we need the expected value of the product between N 1 h { a ˆ 1 d ( s ) − a 1 d ( s ) } and N 1 h { c ˆ 1 ( s ) − c 1 ( s ) } , where only the overlapping N 1 terms are non-zero whose expected value is E ( D ∣ s , δ = 1 ) f S 1 ( s ) κ . Hence, the desired covariance is

− b ( s ) a ( s ) 2 b ( s ) a ( s ) 2 a 1 d ( s ) c 1 ( s ) 2 − 1 a ( s ) b 1 y ( s ) c 1 ( s ) 2 1 π 1 E ( D ∣ s , δ = 1 ) κ ≡ C 2 .

Analogously, for the covariance between the third and fifth terms, we need the expected value of the product between N 1 h { b ˆ 1 y ( s ) − b 1 y ( s ) } and N 1 h { c ˆ 1 ( s ) − c 1 ( s ) } , where only the overlapping N 1 terms are non-zero whose expected value is E ( Y ∣ s , δ = 1 ) f S 1 ( s ) κ . Hence, the desired covariance is

1 a ( s ) ⋅ b ( s ) a ( s ) 2 a 1 d ( s ) c 1 ( s ) 2 − 1 a ( s ) b 1 y ( s ) c 1 ( s ) 2 1 π 1 E ( Y ∣ s , δ = 1 ) κ ≡ C 3 .

As for the three covariance terms involving the three terms with the subscript 0, the terms analogous to C 1 , C 2 , and C 3 are, respectively,

(A7)□ − b ( s ) a ( s ) 3 1 f S 0 ( s ) π 0 E ( Y D ∣ s , δ = 0 ) κ ≡ C 4 , − b ( s ) a ( s ) 2 ⋅ b ( s ) a ( s ) 2 a 0 d ( s ) c 0 ( s ) 2 − 1 a ( s ) b 0 y ( s ) c 0 ( s ) 2 1 π 0 E ( D ∣ s , δ = 0 ) κ ≡ C 5 , 1 a ( s ) ⋅ b ( s ) a ( s ) 2 a 0 d ( s ) c 0 ( s ) 2 − 1 a ( s ) b 0 y ( s ) c 0 ( s ) 2 1 π 0 E ( Y ∣ s , δ = 0 ) κ ≡ C 6 .

A.4 Proof for (9)

Proof

Take E ( ⋅ ∣ λ X ) on λ X ≡ E ( δ ∣ X ) to see λ X = E ( δ ∣ λ X ) , which implies p = E ( δ ∣ λ X = p ) . Now, rewrite the ratio in (3) at λ X = p as:

□ E ( Y δ ∣ λ X = p ) / E ( δ ∣ λ X = p ) − [ E { Y ( 1 − δ ) ∣ λ X = p } / { 1 − E ( δ ∣ λ X = p ) } ] E ( D δ ∣ λ X = p ) / E ( δ ∣ λ X = p ) − [ E { D ( 1 − δ ) ∣ λ X = p } / { 1 − E ( δ ∣ λ X = p ) } ] = E [ Y ( δ / p ) − Y { ( 1 − δ ) / ( 1 − p ) } ∣ λ X = p ] E [ D ( δ / p ) − D { ( 1 − δ ) / ( 1 − p ) } ∣ λ X = p ] = E [ { Y δ ( 1 − p ) − Y ( 1 − δ ) p } / { p ( 1 − p ) } ∣ λ X = p ] E [ { D δ ( 1 − p ) − D ( 1 − δ ) p } / { p ( 1 − p ) } ∣ λ X = p ] = E { Y δ ( 1 − p ) − Y ( 1 − δ ) p ∣ λ X = p } E { D δ ( 1 − p ) − D ( 1 − δ ) p ∣ λ X = p } = E { Y ( δ − p ) ∣ λ X = p } E { D ( δ − p ) ∣ λ X = p } .

A.5 Proof for Theorem 3 (second ratio estimator)

Proof

Define a ˆ ( p ) and b ˆ ( p ) so that μ ˜ 1 ( p ) = b ˆ ( p ) / a ˆ ( p ) :

a ˆ ( p ) ≡ 1 N h ∑ i K P i − p h A i and b ˆ ( p ) ≡ 1 N h ∑ i K P i − p h B i .

Also, define their estimands: a ( p ) ≡ E ( A ∣ p ) f λ ( p ) and b ( p ) ≡ E ( B ∣ p ) f λ ( p ) . The linearization analogous to (A3) holds to give

N h b ˆ ( p ) a ˆ ( p ) − b ( p ) a ( p ) = − b ( p ) a ( p ) 2 N h { a ˆ ( p ) − a ( p ) } + 1 a ( p ) N h { b ˆ ( p ) − b ( p ) } + o p ( 1 ) .

The asymptotic variance from the two terms on the right side is

□ b ( p ) 2 a ( p ) 4 E ( A 2 ∣ p ) + 1 a ( p ) 2 E ( B 2 ∣ p ) − 2 b ( p ) a ( p ) 3 E ( A B ∣ p ) ⋅ κ f λ ( p ) = E 2 ( B ∣ p ) f λ ( p ) 2 E 4 ( A ∣ p ) f λ ( p ) 4 E ( A 2 ∣ p ) + 1 E 2 ( A ∣ p ) f λ ( p ) 2 E ( B 2 ∣ p ) − 2 E ( B ∣ p ) f λ ( p ) E 3 ( A ∣ p ) f λ ( p ) 3 E ( A B ∣ p ) κ f λ ( p ) = E 2 ( B ∣ p ) E 4 ( A ∣ p ) f λ ( p ) E ( A 2 ∣ p ) + 1 E 2 ( A ∣ p ) f λ ( p ) E ( B 2 ∣ p ) − 2 E ( B ∣ p ) E 3 ( A ∣ p ) f λ ( p ) E ( A B ∣ p ) κ = κ f λ ( p ) E 4 ( A ∣ p ) { E ( A 2 ∣ p ) E 2 ( B ∣ p ) + E ( B 2 ∣ p ) E 2 ( A ∣ p ) − 2 E ( A ∣ p ) E ( B ∣ p ) E ( A B ∣ p ) } .

A.6 Proof for Theorem 4

Proof

With μ 0 ( λ X ) defined in (1), rewrite the E ( Y ∣ δ , λ X ) equation in (A2) as:

(A8) E ( Y ∣ δ , λ X ) = μ 1 ( λ X ) P ( CP ∣ λ X ) δ + μ 0 ( λ X ) + μ 1 ( λ X ) E ( D 0 ∣ λ X ) = μ 1 ( λ X ) D + μ 0 ( λ X ) + μ 1 ( λ X ) { P ( CP ∣ λ X ) δ + E ( D 0 ∣ λ X ) − D } .

Define ζ ≡ Y − E ( Y ∣ δ , λ X ) ⇔ E ( Y ∣ δ , λ X ) = Y − ζ ⇒ E ( ζ ∣ δ , λ X ) = 0 to have

Y = μ 1 ( λ X ) D + μ 0 ( λ X ) + μ 1 ( λ X ) { P ( CP ∣ λ X ) δ + E ( D 0 ∣ λ X ) − D } + ζ = μ 1 ( λ X ) D + μ 0 ( λ X ) + U 1 , U 1 ≡ μ 1 ( λ X ) { P ( CP ∣ λ X ) δ + E ( D 0 ∣ λ X ) − D } + ζ .

E ( U 1 ∣ δ , λ X ) = 0 , as E ( ζ ∣ δ , λ X ) = 0 and E ( D ∣ δ , λ X ) = P ( CP ∣ λ X ) δ + E ( D 0 ∣ λ X ) in (A1).

Define ( D 0 = 1 , D 1 = 1 ) as “always takers,” and ( D 0 = 0 , D 1 = 0 ) as “never takers”; (4)(ii) rules out “defiers” ( D 0 = 1 , D 1 = 0 ) . With D = ( 1 − δ ) D 0 + D 1 δ , we have

Cov ( δ , D ∣ λ X ) = E ( δ D ∣ λ X ) − λ X E ( D ∣ λ X ) = E ( δ D 1 ∣ λ X ) − λ X E { ( 1 − δ ) D 0 + D 1 δ ∣ λ X } = E ( δ D 1 ∣ λ X ) ( 1 − λ X ) − E { ( 1 − δ ) D 0 ∣ λ X } λ X = P ( D 1 = 1 ∣ δ = 1 , λ X ) λ X ( 1 − λ X ) − P ( D 0 = 1 ∣ δ = 0 , λ X ) ( 1 − λ X ) λ X = { P ( always taker, complier ∣ λ X ) − P ( always taker, defier ∣ λ X ) } ( 1 − λ X ) λ X = P ( complier ∣ λ X ) ( 1 − λ X ) λ X = E ( D 1 − D 0 ∣ λ X ) ( 1 − λ X ) λ X > 0 (defiers ruled out).

Using the first and last expressions gives

□ E { ( δ − λ X ) DMM ′ } = E { Cov ( δ , D ∣ λ X ) M M ′ } = E { E ( D 1 − D 0 ∣ λ X ) ( 1 − λ X ) λ X M M ′ } .

A.7 Proof for Theorem 5

Proof

Let θ be the true value in the parameter space A θ whose generic element is a . Define

W ( θ ) ≡ { 1 , ( X ′ θ ) , ( X ′ θ ) 2 , … , ( X ′ θ ) J } ′ so that ∂ W ( θ ) ∂ a ′ = ∇ W ⋅ X ′ .

The IV for D W ( θ ˆ ) is ε ˆ W ( θ ˆ ) , and the IVE β ˆ satisfies

1 N ∑ i m ( θ ˆ , β ˆ , γ ˆ ) = 0 , m ( a , b , g ) = { Y − W ( a ) ′ g − D W ( a ) ′ b } { δ − Φ ( X ′ a ) } W ( a ) ,

where a is for θ , b is for β , and g is for γ . Taylor-expand the moment condition (times N ) around β to obtain, for some β ˆ ∗ ∈ ( β ˆ , β ) ,

(A9) 0 = 1 N ∑ i m ( θ ˆ , β , γ ˆ ) + 1 N ∑ i ∂ m ( θ ˆ , β ˆ ∗ , γ ˆ ) ∂ b ′ N ( β ˆ − β ) .

Solve this for N ( β ˆ − β ) and then further Taylor-expand m ( θ ˆ , β , γ ˆ ) around m ( θ , β , γ ) :

(A10) N ( β ˆ − β ) = − E − 1 ∂ m ( θ , β , γ ) ∂ b ′ 1 N ∑ i m ( θ , β , γ ) + E { ∂ m ( θ , β , γ ) ∂ a ′ } N ( θ ˆ − θ ) + E { ∂ m ( θ , β , γ ) ∂ g ′ } N ( γ ˆ − γ ) + o p ( 1 ) → d N ( 0 , Ω 1 ) ,

where

Ω 1 = E − 1 ∂ m ( θ , β , γ ) ∂ b ′ ⋅ E ( η 1 η 1 ′ ) ⋅ E − 1 { ∂ m ( θ , β , γ ) ∂ b } , η 1 ≡ m ( θ , β , γ ) + E ∂ m ( θ , β , γ ) ∂ a ′ η θ ˆ + E ∂ m ( θ , β , γ ) ∂ g ′ η γ ˆ ,

and ( η θ ˆ , η γ ˆ ) are the influence functions for ( θ ˆ , γ ˆ ) .

As m ( a , b , g ) = { Y − W ( a ) ′ g − D W ( a ) ′ b } { δ − Φ ( X ′ a ) } W ( a ) and ∂ W ( θ ) / ∂ a ′ = ∇ W X ′ ,

∂ m ( θ , β , γ ) ∂ a ′ = − ( ∇ W ′ γ + D ∇ W ′ β ) ε W X ′ − V ϕ ( X ′ θ ) W X ′ + V ε ∇ W X ′ , ∂ m ( θ , β , γ ) ∂ b ′ = − ε DWW ′ , ∂ m ( θ , β , γ ) ∂ g ′ = − ε W W ′ .

E { ∂ m ( θ , β , γ ) / ∂ g ′ } = 0 due to E ( ε ∣ X ) = E ( δ − λ X ∣ X ) = 0 . As for E { ∂ m ( θ , β , γ ) / ∂ a ′ } , we have E ( ∇ W ′ γ ε W X ′ ) = 0 . Hence, the asymptotic variance is

E − 1 ( ε DWW ′ ) E ( η 1 η 1 ′ ) E − 1 ( ε DWW ′ ) , η 1 ≡ V ε W − E { D ∇ W ′ β ε W X ′ + V ϕ ( X ′ θ ) W X ′ − V ε ∇ W X ′ } η θ ˆ .□

A.8 Proof for Theorem 6

Proof

The proof for Theorem 6 is almost the same as that for Theorem 5. Define

M ( θ ) ≡ { 1 , Φ ( X ′ θ ) , Φ ( X ′ θ ) 2 , … , Φ ( X ′ θ ) J } ′ so that ∂ M ( θ ) ∂ a ′ = ∇ M ⋅ X ′ .

The IV for D M ˆ is ε ˆ M ˆ , and the IVE β ˜ satisfies

1 N ∑ i m ( θ ˆ , β ˜ , γ ˜ ) = 0 , m ( a , b , g ) = { Y − M ( a ) ′ g − D M ( a ) ′ b } { δ − Φ ( X ′ a ) } M ( a ) .

(A9) and (A10) hold with ( β ˆ , γ ˆ ) and ( Ω 1 , η 1 ) replaced by ( β ˜ , γ ˜ ) and

Ω 2 = E − 1 ∂ m ( θ , β , γ ) ∂ b ′ ⋅ E ( η 2 η 2 ′ ) ⋅ E − 1 ∂ m ( θ , β , γ ) ∂ b , η 2 ≡ m ( θ , β , γ ) + E ∂ m ( θ , β , γ ) ∂ a ′ η θ ˆ + E ∂ m ( θ , β , γ ) ∂ g ′ η γ ˜ ,

and η γ ˜ is the influence function for γ ˜ .

With m ( a , b , g ) = { Y − M ( a ) ′ g − D M ( a ) ′ b } { δ − Φ ( X ′ a ) } M ( a ) ,

∂ m ( θ , β , γ ) ∂ a ′ = − ( ∇ M ′ γ + D ∇ M ′ β ) ε M X ′ − Γ ϕ ( X ′ θ ) M X ′ + Γ ε ∇ M X ′ , ∂ m ( θ , β , γ ) ∂ b ′ = − ε DMM ′ , ∂ m ( θ , β , γ ) ∂ g ′ = − ε M M ′ .

E { ∂ m ( θ , β , γ ) / ∂ g ′ } = 0 and E ( ∇ M ′ γ ε M X ′ ) = 0 , and the asymptotic variance is

E − 1 ( ε DMM ′ ) E ( η 2 η 2 ′ ) E − 1 ( ε DMM ′ ) , η 2 ≡ Γ ε M − E { D ∇ M ′ β ε M X ′ + Γ ϕ ( X ′ θ ) M X ′ − Γ ε ∇ M X ′ } η θ ˆ .□

References

[1] Rosenbaum PR. Observational studies. 2nd ed. Springer; 2002. https://doi.org/10.1007/978-1-4757-3692-2. Search in Google Scholar

[2] Lee MJ. Micro-econometrics for policy, program, and treatment effects. Oxford University Press; 2005. https://doi.org/10.1093/0199267693.001.0001. Search in Google Scholar

[3] Lee MJ. Matching, regression discontinuity, difference in differences, and beyond. Oxford University Press; 2016. https://doi.org/10.1093/acprof:oso/9780190258733.001.0001. Search in Google Scholar

[4] Pearl J. Causality. 2nd ed. Cambridge University Press; 2009. https://doi.org/10.1017/CBO9780511803161. Search in Google Scholar

[5] Imbens GW, Rubin DB. Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press; 2015. https://doi.org/10.1017/CBO9781139025751. Search in Google Scholar

[6] Abadie A, Cattaneo MD. Econometric methods for program evaluation. Annu Rev Econ. 2018;10:465–503. https://doi.org/10.1146/annurev-economics-080217-053402. Search in Google Scholar

[7] Imbens GW, Angrist J. Identification and estimation of local average treatment effects. Econometrica. 1994;62(2):467–76. https://doi.org/10.2307/2951620. Search in Google Scholar

[8] Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. J Amer Stat Assoc. 1996;91(434):444–55. https://doi.org/10.2307/2291629. Search in Google Scholar

[9] Frölich M. Nonparametric IV estimation of local average treatment effects with covariates. J Econom. 2007;139(1):35–75. https://doi.org/10.1016/j.jeconom.2006.06.004. Search in Google Scholar

[10] Abadie A. Semiparametric instrumental variable estimation of treatment response models. J Econom. 2003;113(2):231–63. https://doi.org/10.1016/S0304-4076(02)00201-4. Search in Google Scholar

[11] Tan Z. Regression and weighting methods for causal inference using instrumental variables. J Amer Stat Assoc. 2006;101(476):1607–18. https://doi.org/10.1198/016214505000001366. Search in Google Scholar

[12] Ogburn EL, Rotnitzky A, Robins JM. Doubly robust estimation of the local average treatment effect curve. J R Stat Soc (Ser B). 2015;77(2):373–96. https://doi.org/10.1111/rssb.12078. Search in Google Scholar PubMed PubMed Central

[13] Imai K, Ratkovic M. Estimating treatment effect heterogeneity in randomized program evaluation. Ann Appl Stat. 2013;7(1):443–70. https://doi.org/10.1214/12-AOAS593. Search in Google Scholar

[14] Künzel SR, Sekhon JS, Bickel PJ, Yu B. Metalearners for estimating heterogeneous treatment effects using machine learning. Proc Nat Acad Sci. 2019;116(10):4156–65. https://doi.org/10.1073/pnas.1804597116. Search in Google Scholar PubMed PubMed Central

[15] Athey S, Imbens GW. Machine learning methods that economists should know about. Ann Rev Econom. 2019;11:685–725. https://doi.org/10.1146/annurev-economics-080217-053433. Search in Google Scholar

[16] Hirano K, Porter JR. Asymptotics for statistical treatment rules. Econometrica. 2009;77(5):1683–701. https://doi.org/10.3982/ECTA6630. Search in Google Scholar

[17] Dudik M, Erhan D, Langford J, Li L. Doubly robust policy evaluation and optimization. Stat Sci. 2014;29(4):485–511. https://doi.org/10.1214/14-STS500. Search in Google Scholar

[18] Athey S, Imbens GW. Recursive partitioning for heterogeneous causal effects. Proc Nat Acad Sci. 2016;113(27):7353–60. https://doi.org/10.1073/pnas.1510489113. Search in Google Scholar PubMed PubMed Central

[19] Choi JY, Lee G, Lee MJ. Endogenous treatment effect for any response conditional on control propensity score. Stat Probability Lett. 2023;196:109747. https://doi.org/10.1016/j.spl.2022.109747. Search in Google Scholar

[20] Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. https://doi.org/10.1093/biomet/70.1.41. Search in Google Scholar

[21] Swanson SA, Hernán MA. Think globally, act globally: an epidemiologist’s perspective on instrumental variable estimation. Stat Sci. 2014;29(3):371–4. https://doi.org/10.1214/14-STS491. Search in Google Scholar

[22] Mogstad M, Torgovitsky A. Identification and extrapolation of causal effects with instrumental variables. Ann Rev Econ. 2018;10:577–613. https://doi.org/10.1146/annurev-economics-101617-041813. Search in Google Scholar

[23] Ichimura H. Semiparametric least squares (SLS) and weighted SLS estimation of single index models. J Econom. 1993;58(1–2):71–120. https://doi.org/10.1016/0304-4076(93)90114-K. Search in Google Scholar

[24] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. Springer; 2009. https://doi.org/10.1007/978-0-387-84858-7. Search in Google Scholar

[25] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, et al. Double/debiased machine learning for treatment and structural parameters. Econom J. 2018;21(1):C1–68. https://doi.org/10.1111/ectj.12097. Search in Google Scholar

[26] Lee MJ. Simple least squares estimator for treatment effects using propensity score residuals. Biometrika. 2018;105(1):149–64. https://doi.org/10.1093/biomet/asx062. Search in Google Scholar

[27] Lee MJ. Instrument residual estimator for any response variable with endogenous binary treatment. J R Stat Soc (Ser B). 2021;83(3):612–35. https://doi.org/10.1111/rssb.12442. Search in Google Scholar

[28] Gale WG, Scholz JK. IRAs and household saving. Amer Econ Rev. 1994;84(5):1233–60. Search in Google Scholar

[29] Poterba JM, Venti SF, Wise DA. Do 401(k) contributions crowd out other personal saving?. J Public Econ. 1995;58(1):1–32. https://doi.org/10.1016/0047-2727(94)01462-W. Search in Google Scholar

[30] Madrian BC, Shea DF. The Power of suggestion: inertia in 401(k) participation and savings behavior. Quarter J Econ. 2001;116(4):1149–87. https://doi.org/10.1162/003355301753265543. Search in Google Scholar

[31] Benjamin DJ. Does 401(k) eligibility increase saving? Evidence from propensity score subclassification. J Public Econ. 2003;87(5–6):1259–90. https://doi.org/10.1016/S0047-2727(01)00167-0. Search in Google Scholar

[32] Chetty R, Friedman JN, Leth-Petersen S, Nielsen TH, Olsen T. Active vs. passive decisions and crowd-out in retirement savings accounts: evidence from Denmark. Quarter J Econ. 2014;129(3):1141–219. https://doi.org/10.1093/qje/qju013. Search in Google Scholar

Received: 2022-05-17

Revised: 2023-05-25

Accepted: 2023-05-26

Published Online: 2023-07-14

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2022-0036

Keywords for this article

endogenous treatment; complier effect; instrument score; propensity score; single index model; instrumental variable estimator

Creative Commons

BY 4.0