Startseite A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome
Artikel Öffentlich zugänglich

A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome

  • Eric Tchetgen Tchetgen EMAIL logo
Veröffentlicht/Copyright: 7. Oktober 2014
Veröffentlichen auch Sie bei De Gruyter Brill

Abstract

Unobserved confounding is a well-known threat to causal inference in non-experimental studies. The instrumental variable design can under certain conditions be used to recover an unbiased estimator of a treatment effect even if unobserved confounding cannot be ruled out with certainty. For continuous outcomes, two stage least squares is the most common instrumental variable estimator used in epidemiologic applications. For a rare binary outcome, an analogous linear-logistic two stage procedure can be used. Alternatively, a control function approach is sometimes used which entails entering the residual from the first stage linear model for exposure as a covariate in a second stage logistic regression of the outcome on the treatment. Both strategies for binary response have previously formally been justified only for continuous exposure, which has impeded widespread use of the approach outside of this setting. In this note, we consider the important setting of binary exposure in the context of a binary outcome. We provide an alternative motivation for the control function approach which is appropriate for binary exposure, thus establishing simple conditions under which the approach may be used for instrumental variable estimation when the outcome is rare. In the proposed approach, the first stage regression involves a logistic model of the exposure conditional on the instrumental variable, and the second stage regression is a logistic regression of the outcome on the exposure adjusting for the first stage residual. In the event of a non-rare outcome, we recommend replacing the second stage logistic model with a risk ratio regression.

In recent years, the instrumental variable (IV) design has gained popularity in epidemiology, as a strategy to recover an unbiased estimate of an exposure or treatment causal effect in settings where unobserved confounding is suspected to be present (Greenland, 2000; Davey Smith and Ebrahim, 2003; Hernán and Robins, 2006; Lawlor et al., 2008; Palmer et al., 2011). For continuous outcomes, the most common IV estimator used in practice is two stage least squares which involves fitting a linear regression of the outcome on an estimate of the exposure mean, obtained by regressing the exposure on the IV in a first stage linear model (Wooldridge, 2002). For a rare binary outcome, an analogous two stage procedure is sometimes used, in which linear regression is used in the first stage; however, a logistic regression is substituted in the second stage to account for the binary nature of the outcome (Theil, 1953; Basmann, 1957; Angrist, 2001; Wooldridge, 2002; Didelez et al., 2010). A variation of the approach simply adjusts for the residual of the first stage linear regression of the exposure and in a second stage logistic regression of the outcome on the observed treatment; this strategy is described in the literature as a control function approach (Garen, 1984, Woldridge, 1997; Nagelkerke et al., 2000; Blundell and Powell, 2003; Terza et al., 2008). Both strategies for binary response have previously formally been justified only for continuous exposure (Mullahy, 1997; Didelez et al., 2010; Vansteelandt et al., 2011), which has impeded widespread application of either method outside of this setting. In this note, we consider the important setting of binary treatment in the context of a binary outcome. We provide an alternative formulation of the second strategy which applies for binary exposure, thus establishing simple conditions under which the approach may be used for IV estimation when the outcome is rare and the treatment is binary. In the proposed control function approach, the first stage regression involves a logistic model of the exposure conditional on the IV, and the second stage regression is a logistic regression of the outcome on the exposure adjusting for the first stage residual. In the event of a non-rare outcome, we recommend replacing the second stage with a risk ratio regression.

1 Review of control function for continuous treatment

Suppose that one has observed a rare binary outcome Y, a continuous exposure A,and a binary IV Z. Throughout, U will refer to an unmeasured continuous confounder of the AY causal association. A standard formulation of a data generating model for the control function approach assumes the outcome is generated from the log-linear model

[1]logPrY=1|A,Z,U=β0+βAA+U,

with βA the log risk ratio causal association between A and Y conditional on U (see for example, Palmer et al., 2011). The model rules out the possibility of latent effect heterogeneity of A wrt U on the multiplicative scale. For continuous A, the model further posits the following regression for the exposure:

[2]A=γ0+γ1Z+γ2U+εwhereεisindependentmeanzeroerror,

where γ20. In addition, the model assumes that U and Z are independent, as would be the case for a valid IV. Note that the above model encodes explicitly the assumption that Z is a valid IV, which satisfies the following conditions:

  1. Z only affect Y through its association with A, which is encoded by the fact that although Z appears in the conditioning event on the left-hand side of eq. [1], it does not appear on the right-hand side of the equation.

  2. The unmeasured confounder of the exposure effect on the outcome is independent of the IV, thus Z is independent of U.

  3. The IV is relevant for the exposure, i.e. Z predicts A and thus γ10.

Note that this formulation assumes U does not interact with Z in the model for A. Under these assumptions, one can show that

logPr(Y=1|A,Z)=β0+βAA+α1Δ

where

Δ=AEA|Z.

The above equation gives a simple parametrization for the log-linear regression of Y on (A,Z), which allows one to recover under the stated assumptions, the causal log risk ratio association βA between A and Y. Estimation typically proceeds in two stages. In the first stage, one fits a standard linear regression to estimate the exposure model

EA|Z=α0+α1Z

by ordinary least squares (OLS), which in turn is used to estimate the residual Δ with Δ=A(α0+α1Z), where (a0,a1) are OLS estimates. Then in a second stage, one regresses Y on (A,Δ) using standard logistic regression, as a suitable approximation for the log-linear model [4]. The regression coefficient for the exposure in the second stage logistic regression will then be approximately unbiased for βA. We will refer to the above two stage procedure as the linear-logistic control function approach. The large sample variance of the resulting estimator of βA must acknowledge the first stage estimation of EA|Z which is easily obtained from standard M-estimation theory (see the appendix). Alternatively, when convenient, one could also use the nonparametric bootstrap to obtain confidence intervals.

One can assess the extent of unobserved confounding, by evaluating the strength of association between Y and Δ, which can be performed with a test of the null hypothesis that α1=0.

2 Control function for binary treatment

Now, suppose that A is dichotomous, then as noted by Didelez et al. (2010), assumption [2] cannot be satisfied for binary exposure. Thus, we will consider an alternative formulation, whereby assumption [2] is replaced by the following location shift model for U:

[3]U=EU|A,Z+δwhereδisindependentof(A,Z)

The assumption would hold if U were normally distributed given A,Z with homoscedastic variance; however, this is not strictly required and the model allows for an arbitrary distribution for δ.

Result 1: Under assumptions [1], [3], and the assumption that U and Z are independent, we have that

[4]logPr(Y=1|A,Z)=β0+βAA+ω1+ω2ZΔ,

where(β0,ω1,ω2)are defined in the Appendix.

Result 1 provides formal justification for a generalization of the control function approach in the context of a binary treatment, with the standard control function approach recovered by setting ω2=0. Interestingly, unlike the standard formulation for continuous treatment, the regression model [4] formally allows for heterogeneity in the degree of selection bias due to confounding if ω20.

Implementation of the approach in practice is fairly straightforward. The main adjustment to account for binary treatment is in the first stage estimation of the treatment model, whereby OLS estimation of a linear model for continuous treatment can be replaced with maximum likelihood estimation (MLE) of a logistic regression for binary treatment:

logitPrA=1|Z=α0+α1Z

to produce an estimated propensity score π(Z)=Pr^(A=1|Z) using the MLE (a0,a1). The second stage of the approach proceeds by estimating the logistic regression of Y on (A,Δ,ZΔ), upon redefining the estimated residual as Δ=Aπ(Z). The resulting estimator of the regression coefficient for the exposure in the second stage logistic regression will be approximately unbiased for βA provided that the assumptions of Result 1 hold. For inference, one may use M-estimation theory to derive the large sample variance of the estimator provided in the Appendix; alternatively, one can proceed with the nonparametric bootstrap.

3 Control function without rare disease assumption

If Y is not rare in the target population, one may adopt one of several existing methods to estimate the risk ratio regression [4] that do not require the rare disease assumption, including the log-binomial model of Wacholder (1986), the Poisson regression approach of Zou (2004), and the semiparametric locally efficient approach of Tchetgen Tchetgen (2013a).

4 Control function under case–control sampling

Case–control studies are a common design in epidemiologic practice, particularly is settings where Y is rare in the population, or measuring Z or A is costly. Accounting for case–control ascertainment is fairly straightforward in the case of a rare outcome, since logistic regression, which appropriately accounts for the sampling design, can continue to be used in the second stage; however, the first stage regression model must be modified to account for possible selection bias. A simple strategy entails restricting estimation of the first stage regression of A on Z, to the subset of controls with Y=0, which should yield a reasonable approximation of the population regression model. This approach may however be inefficient, since it does not make use of the exposure and IV measured among cases. Under certain assumptions, it may be possible to improve the efficiency of the first stage regression which in turn may lead to a more efficient second stage estimator of the treatment effect. This can be achieved by using all available information on both cases and controls, and by adjusting for case–control status in estimating the first stage regression model. For instance, for continuous A,one may modify the first stage regression and instead estimate:

E(A|Z,Y)=α0+α1Z+α2Y,

which involves adjusting for ascertainment by directly conditioning on case–control status in the regression model. Under the rare disease assumption, the above model would in principle recover an unbiased estimator (α˜0,α˜1) of α0,α1 provided the degree of ascertainment bias (here encoded by a non-null value of α2) does not vary with Z (Tchetgen Tchetgen et al., 2013; Tchetgen Tchetgen, 2013b). It is important to note that some care is needed in forming the residual used in the second stage logistic regression, which must reflect the residual value for A in the underlying population and is therefore obtained by evaluating the predicted mean of A under the above estimated mean model, after setting Y=0 for both cases and controls (Tchetgen Tchetgen et al., 2013), i.e.

Δ=AE(A|Z,Y=0)=Aα˜0α˜1Z
Tchetgen Tchetgen (2013b) discusses analogous methodology to account for possible heterogeneity in the degree of selection bias, and similar techniques are also developed in the context of logistic regression for binary A, and are likewise extended to account for case–control ascertainment when Y is not necessarily rare in the population. However, similar to standard inverse probability weighting, which may also be used to account for the sampling design (although it may be relatively inefficient), the sampling fractions for cases and controls must be available to account for sampling conditional on Y which may not be rare in the target population (Tchetgen Tchetgen, 2013b).

5 Adjusting for covariates

Here we consider a straightforward generalization to allow for the presence of covariates C such that Z is a valid IV conditional on C but not necessarily so upon marginalizing over any component of C. Assuming standard prospective sampling, in order to incorporate such covariates, it suffices to modify regression models used in the first and second stages, such that in the case of continuous exposure, the first stage regression further adjusts for C, e.g.

E(A|Z,C)=α0+α1Z+α2C,

and likewise, for binary A, one could specify

logitPr(A=1|Z,C)=α0+α1Z+α2C.

The second stage regression in the rare outcome situation could also be modified accordingly, e.g.

logitPr(Y=1|A,Z,C)=β0+βC'C+βAA+(α1+α2Z)Δ

with Δ the estimated residual AE(A|Z,C), and analogous adjustments can be made to the risk ratio regression approach recommended for non-rare outcomes, as well as under case–control sampling.

6 Conclusion

In this note, an alternative framework is proposed to motivate the control function IV approach in the context of binary outcome, with binary treatment. Although emphasis is given to binary treatment, the approach can be modified to handle other types of discrete treatments, without much difficulty. The approach can also be used with a continuous IV without further difficulty. In addition, unlike previous formulations of the control function approach, the proposed framework allows for heterogeneity in the magnitude of selection bias (on the risk ratio scale) with respect to the IV. Such heterogeneity may reflect latent heterogeneity in the degree of association of the IV with the treatment, the presence of which cannot be ruled out with certainty in practice. Ignoring such heterogeneity when present may invalidate the commonly used control function approach, and therefore the proposed methods provide a valuable framework for relaxing this assumption.

Appendix

Proof of Result 1: Note that

PrY=1|A,Z=Eexpβ0+βAA+U|A,Z=expβ0+βAAEexpU|A,Z=expβ0+βAA+logEexpδ×expEU|A,Z.
Further note that
EU|A,Z=EU|A,ZEU|A=0,ZEU|a,ZEU|A=0,ZdFa|Z+EU|Z=ω1A+ω2AZα1EA|Zα2EA|ZZ+EU
where
ω1=EU|A=1,Z=0EU|A=0,Z=0
and
ω2=EU|A=1,Z=1EU|A=0,Z=1EU|A=1,Z=0EU|A=0,Z=0
therefore
Pr(Y=1|A,Z)=expβ0+βAA+ω1+ω2ZΔ
where
β0=β0+logEexpδ+EU
proving the result.

Asymptotic variance of the control function approach

Let Sαα=1,ZTΔα denote an individual score equation for the first stage logistic regression evaluated under the true parameter values, where we make explicit the dependence on α. Likewise, let Sββ,α=1,A,Z,ZΔαεβ,α denote the score equation for the second stage logistic regression evaluated at the truth, where εβ,α=YPrY=1|A,Z;β,α, with β=(β0,βA,ω1,ω2) and logit PrY=1|A,Z;β,α=β0+βAA+ω1+ω2ZΔα. The asymptotic variance of the control function estimate of (β0,βA,ω1,ω2) is given by EJJT, where

J=ESββ,αβT1Sββ,α+ESββ,ααTESααSαTα1Sαα.

References

Angrist,J. D. (2001). Estimation of limited dependent variable models with dummy endogenous regressors: Simple strategies for empirical practice. Journal of Business and Economic Statistics, 19(1):228.Suche in Google Scholar

Basmann,R. L. (1957). A generalized classical method of linear estimation of coefficients in a structural equation. Econometrica, 25(1):7783.Suche in Google Scholar

Blundell,R. W., and Powell,J. L. (2003). Endogeneity in nonparametric and semiparametric regression models. In: Advances in Economics and Econometrics: Theory and Applications. 8th World Congress of the Econometric Society, M.Dewatripont, L. P.Hansen, and S. J.Turnovsky (Eds.), 312357. Cambridge, UK: Cambridge University Press.Suche in Google Scholar

Davey Smith,G., and Ebrahim,S. (2003). “Mendelian randomization”: Can genetic epidemiology contribute to understanding environmental determinants of disease?International Journal of Epidemiology, 32(1):122.Suche in Google Scholar

Didelez,V., Meng,S., and Sheehan,N. A. (2010). Assumptions of IV methods for observational epidemiology. Statistical Science, 25(1):2240.Suche in Google Scholar

Garen,J. (1984). The returns to schooling: A selectivity bias approach with a continuous choice variable. Econometrica, 52(5):11991218.Suche in Google Scholar

Greenland,S. (2000). An introduction to instrumental variables for epidemiologists. International Journal of Epidemiology, 29(4):722729.Suche in Google Scholar

Hernán,M. A., and Robins,J. M. (2006). Instruments for causal inference: An epidemiologist’s dream?Epidemiology, 17(4):360372.10.1097/01.ede.0000222409.00878.37Suche in Google Scholar PubMed

Lawlor,D. A., Harbord,R. M., Sterne,J. A., et al. (2008). Mendelian randomization: Using genes as instruments for making causal inferences in epidemiology. Statistics in Medicine, 27(8):11331163.Suche in Google Scholar

Mullahy,J. (1997). Instrumental-variable estimation of count data models: Applications to models of cigarette smoking behaviour. The Review of Economics and Statistics, 79(4):568593.Suche in Google Scholar

Nagelkerke,N., Fidler,V., Bernsen,R., et al. (2000). Estimating treatment effects in randomized clinical trials in the presence of non-compliance. Statistics in Medicine, 19(14):18491864.Suche in Google Scholar

Palmer, T. M., et al. (2011). Instrumental variable estimation of causal risk ratios and causal odds ratios in Mendelian randomization analyses. American Journal of Epidemiology, 173(12):13921403.Suche in Google Scholar

Tchetgen Tchetgen,E. J. (2013a). Estimation of risk ratios in cohort studies with a common outcome: A simple and efficient two-stage approach. The International Journal of Biostatistics, 9(2):251264. doi: 10.1515/ijb-2013-0007.Suche in Google Scholar

Tchetgen Tchetgen,E. J.. (2013b). A general regression framework for a secondary outcome in case-control studies. Biostatistics, 15(1):117128.10.1093/biostatistics/kxt041Suche in Google Scholar PubMed PubMed Central

Tchetgen Tchetgen, E. J., Walter,S., and Glymour,M. M. (2013). Building an evidence base for Mendelian randomization studies. IJE, 42 (1), 328331.Suche in Google Scholar

Terza,J. V., Basu,A., and Rathouz,P. J. (2008). Two-stage residual inclusion estimation: Addressing endogeneity in health econometric modeling. Journal of Health Economics, 27(3):531543.Suche in Google Scholar

Theil,H. (1953). Repeated Least Squares Applied to Complete Equation Systems. The Hague, the Netherlands: Central Planning Bureau.Suche in Google Scholar

Vansteelandt, S., et al. (2011). On instrumental variables estimation of causal odds ratios. Statistical Science, 26(3):403422.Suche in Google Scholar

Wacholder,S. (1986). Binomial regression in GLIM: Estimating risk ratios and risk differences. American Journal of Epidemiology, 123:174184.Suche in Google Scholar

Wooldridge,J. M. (1997). On two stage least squares estimation of the average treatment effect in a random coefficient model. Economics Letters, 56(2):129133.Suche in Google Scholar

Wooldridge,J. M. (2002). Econometric Analysis of Cross Section and Panel Data. 2nd Edition. Cambridge, UK: MIT Press.Suche in Google Scholar

Zou,G. Y. (2004). A modified Poisson regression approach to prospective studies with binary data. American Journal of Epidemiology, 159:702706.Suche in Google Scholar

Published Online: 2014-10-7
Published in Print: 2014-12-1

©2014 by De Gruyter

Heruntergeladen am 1.10.2025 von https://www.degruyterbrill.com/document/doi/10.1515/em-2014-0009/html
Button zum nach oben scrollen