Propensity Score Estimation Using Classification and Regression Trees in the Presence of Missing Covariate Data

Bas B.L. Penning de Vries; Maarten van Smeden; Rolf H.H. Groenwold

doi:10.1515/em-2017-0020

Article Publicly Available

Propensity Score Estimation Using Classification and Regression Trees in the Presence of Missing Covariate Data

Bas B.L. Penning de Vries , Maarten van Smeden and Rolf H.H. Groenwold

Published/Copyright: September 18, 2018

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Epidemiologic Methods Volume 7 Issue 1

Abstract

Data mining and machine learning techniques such as classification and regression trees (CART) represent a promising alternative to conventional logistic regression for propensity score estimation. Whereas incomplete data preclude the fitting of a logistic regression on all subjects, CART is appealing in part because some implementations allow for incomplete records to be incorporated in the tree fitting and provide propensity score estimates for all subjects. Based on theoretical considerations, we argue that the automatic handling of missing data by CART may however not be appropriate. Using a series of simulation experiments, we examined the performance of different approaches to handling missing covariate data; (i) applying the CART algorithm directly to the (partially) incomplete data, (ii) complete case analysis, and (iii) multiple imputation. Performance was assessed in terms of bias in estimating exposure-outcome effects among the exposed, standard error, mean squared error and coverage. Applying the CART algorithm directly to incomplete data resulted in bias, even in scenarios where data were missing completely at random. Overall, multiple imputation followed by CART resulted in the best performance. Our study showed that automatic handling of missing data in CART can cause serious bias and does not outperform multiple imputation as a means to account for missing data.

Keywords: causal inference; propensity score; missing data; CART; multiple imputation

1 Introduction

Propensity score analysis has gained increasing popularity as means to adjust for measured confounding (Rosenbaum and Rubin 1983; Stürmer et al. 2006). Inference typically proceeds by stratification on the propensity score, propensity score adjustment in a regression model, inverse probability weighting (IPW) or matching based on propensity scores given measured covariates (Austin 2011a; Rosenbaum and Rubin 1983). It is standard practice to obtain estimates of the propensity score by a parametric (logistic) regression of the exposure on measured covariates. However, parametric models rely on assumptions about the distribution of variables in relation to one another, including the functional form and the presence or absence of interactions. If any of these are violated, covariate balance may not be attained, potentially leading to bias in making causal inferences about the exposure-outcome relation of interest (Drake 1993).

It has been suggested that machine learning and data mining methods, such as classification and regression tree analysis (CART), be used to estimate the relationship between the exposure and measured covariates. These methods avoid making the assumptions regarding functional form and interaction as in a standard logistic regression. The utility of data mining methods to estimate propensity scores in complete data settings has been studied previously (Setoguchi et al. 2008; Lee et al. 2010; Westreich et al. 2010; Wyss et al. 2014). However, in practice, researchers are often faced with missing values on the measured variables. Whereas incomplete data preclude logistic regression on all subjects, some CART algorithms allow for incomplete records to be incorporated in the tree fitting and provide propensity score estimates for all subjects. The ability of CART to accommodate missing values has been described as advantageous (Lee et al. 2010; McCaffrey et al. 2004; 2008; Rai et al. 2017). However, the precise impact of missing data on the performance of CART-based propensity score estimators has received little attention. The objective of this study was therefore to examine the performance of various CART-based propensity score estimation procedures in the presence of missing data. Throughout, particular emphasis is placed on the causal odds ratio for the marginal effect among the exposed (or Average Effect among the ‘Treated’, ATT) as the effect measure of interest.

The remainder of this article is structured as follows. In Section 2, we briefly review pertinent theory. Based on analytical work, we identify caveats in the handling of missing data by CART. Section 3 describes a series of Monte Carlo simulations that were used to evaluate the performance of various approaches to handling missing data, including (i) subjecting incomplete data directly to the CART algorithm, (ii) complete case analysis, and (iii) multiple imputation. In Section 4, we apply and compare the approaches in a case study on the effect of influenza vaccination and mortality. We conclude with a summary and discussion of our findings in the context of the existing literature.

2 Theory

2.1 Propensity score analysis of complete data

Counterfactual outcomes and estimating causal effects

We adopt a perspective of potential or counterfactual outcomes, formal accounts of which are given for example by Neyman et al. (1935), Holland (1986), Rubin (1974), Holland (1988), and Pearl (2009).

Consider a sequence S=(X1,X2,...,Xn) of variables and let F=(fX1,fX2,...,fXn) be a collection of functions fXj that deterministically map a realisation of the predecessors (Xi:i<j) of Xj and of exogenous variable εXj into a realisation of Xj. We may write the random variable Xj as follows:

(1)Xj=fXj(fX1(X1,εX1),fX2(X1,X2,εX2),...,fXj−1(X1,X2,...,Xj−2,εXj−1),εXj).

For any intervention setting Xt=xt for t in a subset T of {1,...,n}\{j}, the counterfactual version of Xj is obtained by evaluating the right-hand side of eq. (1) with Xt replaced by xt for all t∈T. Specifically, let S = (W,A,Y,R) so that W=fW(εW), A=fA(W,εA), Y=fY(W,A,εY), and R=fR(W,A,Y,εR). W may be thought of as a (random vector of) baseline or pre-exposure variable(s), A denotes the binary exposure of interest, Y the outcome, and R a missing indicator vector of W. A subject’s counterfactual outcomes Y0 and Y1, obtained if exposure A were set possibly contrary to fact to 0 and 1, respectively, are defined such that Y0=fY(W,0,εY) and Y1=fY(W,1,εY).

Causal effects are readily defined in terms of counterfactual outcomes. In this article, the focus is on the causal odds ratio (OR) for the marginal effect of exposure A on binary outcome Y among the exposed (A = 1), that is,

OR=E[Y1|A=1]/(1−E[Y1|A=1])E[Y0|A=1]/(1−E[Y0|A=1]).

Under consistency as defined by Cole and Frangakis (2009), Y1 is equal to the observed outcome Y for subjects in the exposure group. Y0, on the other hand, is not observed for exposed subjects. We may, however, validly estimate the causal OR under a set of conditions, which includes no interference between subjects (or Stable Unit Treatment Value Assumption, Tchetgen and VanderWeele, 2012), consistency, positivity, and conditional exchangeability (Lesko et al. 2017). To simplify arguments and notation, we shall assume that all of these conditions hold, with the exception of conditional exchangeability, unless otherwise indicated. If there exists a (set of) variable(s) Z such that the potential outcomes are conditionally independent of exposure status given Z, we may write

P(Y0|A=1)=E[P(Y0|A=1,Z)|A=1]=E[P(Y0|A=0,Z)|A=1]=E[P(Y|A=0,Z)|A=1],

so that the causal OR may be expressed entirely in terms of observables

OR=E[Y|A=1]/(1−E[Y|A=1])E{E[Y|A=0,Z]|A=1}/(1−E{E[Y|A=0,Z]|A=1}).

W satisfies the definition of Z whenever εY⊥⊥εA|W. In practice, validly estimating E[Y|A=0,Z] may be difficult when Z is multidimensional and Y is rare (Albert and Anderson 1984). In this case, it may be desirable to summarise Z in a single balancing score (Rosenbaum and Rubin 1983).

The propensity score

The propensity score e(W), defined as the conditional probability of exposure given covariates W, satisfies a number of balancing properties. First, covariate(s) W and exposure A are conditionally independent given the propensity score, and conditional exchangeability given covariate(s) W implies conditional exchangeability given e(W) (Rosenbaum and Rubin 1983, Theorems 1 and 3). Thus, the causal OR becomes

OR=E[Y|A=1]/(1−E[Y|A=1])E{E[Y|A=0,e(W)]|A=1}/(1−E{E[Y|A=0,e(W)]|A=1}).

This formulation has motivated the propensity score matching approach as discussed by Rosenbaum and Rubin (1983).

Balance may also be attained by inverse probability weighting (Appendix A). To simplify arguments and notation, we assume that W and Y take a discrete joint distribution; however, the results extend to continuous or mixed discrete/continuous distributions. To obtain an IPW estimator of the ATT, let

φ(w,a)=φ∗(w,a)E[φ∗(W,A)|A=a],φ∗(w,a)=I(a=1)+I(a=0)e(w)1−e(w),

for realisations w of W and a of A, where I denotes the indicator function taken the value 1 if the argument is true and 0 otherwise. Weighting by φ yields independence between covariate(s) W and A; that is, for all w,

φ(w,0)Pr(W=w|A=0)=φ(w,1)Pr(W=w|A=1).

Also, conditional exchangeability given W implies exchangeability following weighting by ϕ; that is, if (Y0,Y1)⊥⊥A|W=w for all w, then

∑wφ(w,0)Pr(Y0=y0,Y1=y1,W=w|A=0)=∑wφ(w,1)Pr(Y0=y0,Y1=y1,W=w|A=1)

for all y0,y1. Thus, the causal OR becomes

OR=∑wφ(w,1)Pr(Y=1,W=w|A=1){1−∑wφ(w,1)Pr(Y=1,W=w|A=1)}/∑wφ(w,0)Pr(Y=1,W=w|A=0){1−∑wφ(w,0)Pr(Y=1,W=w|A=0)}.

In words, this means that the causal odds ratio is equal to the crude odds ratio of the ATT in the (pseudo-)population that is obtained by weighting each observation by φ.

Ensemble CART methods in the absence of missing data

We will now briefly describe how CART can be applied to estimate the propensity score. Detailed information can be found elsewhere (Breiman 1996; Ridgeway 1999; Breiman 2001; McCaffrey et al. 2004; Elith et al. 2008; Hastie et al. 2009). CART is a type of supervised learning task that entails finding a set of rules, subject to constraints, that partition the data into regions based on the input data (covariates) such that within regions, target values (e.g., exposure levels) meet a desirable level of homogeneity. Typically, a tree is built in a recursive manner by splitting the dataset into increasingly homogeneous subsets and choosing the splitting rule at each step or node that best splits the data further, with ‘best’ referring to the greatest improvement in terms of some homogeneity metric, such as the Gini index (Therneau and Atkinson 2017). Ensemble techniques by definition fit more than one tree to the data and combine them to form a single predictor of ‘the outcome’ (in the case of propensity score, the assigned exposure). The aim of ensemble techniques is to enhance performance and reduce issues of overfitting by a single tree (Elith et al. 2008; 2008; Hastie et al. 2009). We focus here on two popular CART ensemble methods, namely boostrap aggregated (bagged) CART and boosted CART.

Bootstrap aggregated CART

Bagged CART involves drawing bootstrap samples form the original study sample (Breiman 1996). A CART tree is formed in each bootstrap sample, yielding multiple predictors of the target variable. For each subject, the final prediction is formed by the average or majority vote across all predictors. In the context of propensity scores, the prediction of a single tree for any given subject may be defined as the proportion of exposed subjects among those individuals that are assigned to the same region by the given tree. The final propensity score is the average of the predictions across all bootstrap samples. Propensity score matching may then be thought of as matching exposed subjects to unexposed subjects from the same or ‘nearby’ region.

Boosted CART

Boosted CART is related to bagged CART in the sense that it is an ensemble method; multiple trees are fit and merged to form a single predictor. With boosted CART, trees are fit in a forward, stagewise procedure. In boosting, trees are fit iteratively to the data such those observations whose observed exposure levels are poorly predicted by the predictor of the previous iteration receive greater weight at the current iteration (Ridgeway 1999; Elith et al. 2008). Some implementations construct trees using data splits aimed not at achieving homogeneity of the exposure values themselves, but at achieving homogeneity of prediction error of the estimator obtained in the previous step (McCaffrey et al. 2004; Elith et al. 2008). With each iteration, a new predictor is formed by making adjustments to the predictor obtained in the previous step. The final predictor is constructed with contributions from all trees.

2.2 Ignorable missing data and generalised propensity scores

In this section, we briefly review the concept of ignorable missing data, and discuss a generalisation of the propensity score which allows for missing data as well as strategies to incorporate missing data directly in the CART fitting. For certain CART algorithms (in our case boosted CART), the inherent missing data strategy yields estimates of the generalised propensity score.

Ignorable missing data

Suppose W=(W1,W2,...,Wp) and R=(R1,R2,...,Rp) are random vectors of length p such that for j = 1,2,...,p, Rj=0 if Wj is missing and Rj=1 if Wj is observed. Following Rubin (1976), define the extended random vector V=(V1,V2,...,Vp) with range to include the special value ∗ to indicate a missing datum: Vj=Wj if Rj=1 and Vj=∗ if Rj=0. Let v be a particular sample realisation of V, so that each vj is either a known quantity or ∗. These values imply a realisation for the random variable R, denoted r. For notational convenience, we write W=(Wobs,Wmis) and V=(Vobs,Vmis) to indicate that each may be partitioned into two vectors corresponding to all j such that rj=1 for observed data and rj=0 for missing data. It is important to note that these partitions are defined with respect to r, the observed pattern of missing data. Given a realisation r of R, and provided that A,Y are observed, covariate data are said to be missing at random (MAR) if Pr(R=r|Wobs,Wmis=u,A,Y) and Pr(R=r|Wobs,Wmis=u′,A,Y) are the same for all u,u^ʹ and at each possible value of the parameter vector φ that fully characterises the missing data mechanism (Rubin 1976). If in addition to MAR, the parameter φ is distinct, in the sense of Rubin (1976), from the vector θ that parameterises the distribution of the data that we would have based inference on had there been no missingness, then missing data is said to be ignorable and it is not necessary to consider the missing data or the missing data mechanism in making inferences about θ (Rubin 1976, 1987; Schafer 1997). Thus, if the missing data mechanism is ignorable, one may validly model the complete data to create imputations for the missing data (Rubin 1987; Van Buuren 2012).

The generalised propensity score

The generalised propensity score e∗(V) is defined as the conditional exposure probability given the extended covariate vector V (D’Agostino and Rubin 2000). That is,

e∗(V)=Pr(A=1|Wobs,R)=∑wPr(A=1|W,R)Pr(Wmis=w|Wobs,R).

Using the same argumentation to establish the balancing properties of the usual propensity score, it can be shown that the generalised propensity score has the same balancing properties with respect to V as the usual propensity score has with respect to W. Thus, the observed covariate data and missingness information and exposure A are conditionally independent given the generalised propensity score, and conditional exchangeability given the extended covariate(s) V implies conditional exchangeability given the generalised propensity score e∗(V).

To obtain an IPW estimator of the ATT, let

γ(v,a)=γ∗(v,a)E[γ∗(V,A)|A=a],γ∗(v,a)=I(a=1)+I(a=0)e∗(v)1−e(v),

for realisations v of V and a of A, Then, weighting by γ renders V independent of A; that is, for all v,

γ(v,0)Pr(V=v|A=0)=γ(v,1)Pr(V=v|A=1).

Also, conditional exchangeability given V implies conditional exchangeability following weighting by γ; that is, if (Y0,Y1)⊥⊥A|V, then

∑vγ(v,0)Pr(Y0=y0,Y1=y1,V=v|A=0)=∑vγ(v,1)Pr(Y0=y0,Y1=y1,V=v|A=1)

for all y0,y1. Importantly, the propensity score e(W) need not equal the generalised propensity score e∗(V). That is, given the observed covariate data, the unobserved covariate data need not provide the same information about exposure allocation as does the missing data pattern. In addition, neither covariate balance given the propensity score (W⊥⊥A|e(W)) nor balance of the observed data and missingness information given the generalised propensity score (V⊥⊥A|e∗(V)) generally implies covariate balance given the generalised propensity score (W⊥⊥A|e∗(V)).

More crucially perhaps, conditional exchangeability given the generalised propensity score is not guaranteed even if conditional exchangeability given the usual propensity score holds or the generalised propensity score balances both observed and unobserved covariate data (i.e., neither (Y0,Y1)⊥⊥A|e(W) nor W⊥⊥A|e∗(V) nor both imply that (Y0,Y1)⊥⊥A|e∗(V); see Appendix B for an example).

This suggests that it is not generally desirable to distribute across exposure groups both the observed data and the missingness information by adjusting for the generalised propensity score. However, there are situations conceivable in which it is appropriate to base inference on the generalised rather than the usual propensity score. Until now, we have assumed an ordering of the variables in which the outcome Y precedes R, the missingness pattern of W. Consequently, Y was defined as a function fY of W, A and exogenous variable εY and not of R. Consider now a setting where S = (W,R,A,Y) so that R forms a predecessor of A and Y (and, therefore, a potential common cause of A and Y). Then, if exchangeability can be attained by conditioning on e∗(V), conditional exchangeability given e(W) need not hold (see Appendix C for an example).

Thus, the choice between adjustment for the generalised versus the usual propensity score should ideally rest on the relative extent to which conditional exchangeability holds given the generalised versus the usual propensity score. In practice, it is not possible to estimate directly the true propensity score when covariate data are missing (Rosenbaum and Rubin 1983; D’Agostino and Rubin 2000; Cham and West 2016). However, under ignorability of missing data, one may ‘recover’ the unobserved data, e.g., via multiple imputation (Rubin 1987; Van Buuren 2012), prior to estimating propensity scores. Henceforth, we assume that exchangeability can be attained by conditioning on the complete covariate data or, therefore, the usual propensity score, if data were not missing. We also assume that missing data is ignorable.

Applying CART to incomplete data

Bootstrap aggregated CART

In this study, we used bagged CART as implemented in the R package ipred (Peters and Hothorn 2017, version 0.9-6). This implementation allows for missing data by first evaluating homogeneity at a given node among only those observations whose candidate splitting variable is observed. Once the splitting variable and split point have been decided, the algorithm uses a surrogate splits approach to classify records whose splitting variable is missing based on the other variables included in the tree fitting (Therneau and Atkinson 2017).

The bagged CART algorithm replaces missing confounder values without regard of the outcome or exposure status. As a result, any two subjects whose covariate data are identical, except possibly for the missing covariate, would be allocated to the same covariate region by any given tree. However, subjects within a given region need not be exchangeable. In fact, systematic differences in the outcome of the causal model (Y) between exposed and unexposed subjects may be in part attributable to the missing covariate (confounder). As such, even under completely at random missingness (MCAR), we would expect propensity score matching or IPW based on bagged CART to yield bias in the direction of confounding by the missing covariate.

Boosted CART

An implementation of boosted CART to estimate propensity scores is available in the R package twang (Ridgeway et al. 2017, version 1.5). This implementation allows for incomplete records to be incorporated in the tree fitting by regarding missingness as a special covariate level and assigning to a given (non-terminal) node three child nodes; one to which any individual is allocated whose splitting variable is missing, one for observed values that exceed some threshold, and one for the remainder. That is, rather than modelling the relationship between exposure and covariates, an attempt is made to model the association between exposure on the one hand and observed covariate data and missingness information on the other hand, and, therefore, to construct scores that balance the missingness across the matched or weighted exposure groups. In other words, the algorithm represents an estimator of the generalised propensity score.

While boosted CART may be successful at distributing missingness rates across exposure groups, it makes no attempt at distributing the unobserved values. If the partially observed covariate represents a confounder, systematic differences across exposure groups may persist after propensity score matching or IPW based on the generalised propensity score. As such, under MCAR, we would expect boosted CART to yield a propensity score matching or IPW estimator that is biased in the direction of confounding by the partially observed covariate. When missingness is MAR dependent on the outcome, boosted CART tends to render exposure groups more comparable in terms of the outcome and, therefore, attenuate the apparent exposure-outcome effect.

Bias when applying CART to incomplete data

In summary, above, we argued that using either boosted CART or bagged CART to estimate propensity scores may yield a biased estimator of the causal ATT, when applying the CART algorithm directly to the (partially) incomplete data. In bagged CART, missing confounder values are replaced, yet this procedure may not be appropriate, since exposure and outcome status are ignored in this process. Boosted CART, on the other hand, balances observed covariate values as well as missing indicator values. Since the latter may depend on the outcome (under the assumption of ignorability), boosted CART potentially balances outcome values too, yielding a biased estimator of the causal effect.

3 Monte Carlo simulations

We now describe a simulation study in which we evaluated the performance of CART-based propensity score matching and IPW in the presence of missing confounder data.

3.1 Methods

Simulation structure

We performed a series of Monte-Carlo simulation experiments based on the simulation structure described in Setoguchi et al. (2008) with modifications so as to allow for missing data. For n = 2000 subjects, we independently generated 10 covariates Wi (four confounders, three predictors of the exposure only, and three predictors of the outcome only), a binary exposure variable A, and a binary outcome Y (Figure 1). Missing data were introduced into one or two covariates. A number of CART-based approaches were used to estimate propensity scores, before and after the introduction of missing data, and in turn the log odds ratio for the exposure-outcome effect among the treated. For comparison, we also estimated propensity scores in imputed datasets using a correctly specified propensity score model, and using a logistic model with main effects only. The process was repeated 5000 times for each of eight simulation scenarios that varied primarily by missing data mechanism. All simulations were conducted with R-3.2.2 on a Windows 7 (64-bit) platform (R Core Team 2016).

Figure 1:

Complete data structure for simulation experiments. Dashed arcs without arrowheads connecting variables indicate non-zero entries for the corresponding variables in the covariance matrix of the joint distribution of all Wi, i = 1,2,...,10.

Data generation

Data were generated by sequentially going through the following steps. First, the covariates were generated by sampling from a multivariate normal distribution with zero means and unit variances; correlations were set to zero except for the correlations between W1 and W5, W2 and W6, W3 and W8, and W4 and W9, which were set to 0.2, 0.9, 0.2, and 0.9, respectively. Second, covariates W1, W3, W5, W6, W8, and W9 were dichotomised, setting any value to 1 if greater than 0 and to 0 otherwise.

Following Setoguchi et al. (2008), the binary exposure variable A was related to the covariate vector following the propensity score model Pr(A=1|W)=expit{0.8W1−0.25W2+0.6W3−0.4W4−0.8W5−0.5W6+0.7W7−0.25W22−0.4W42+0.7W72+0.4W1W3−0.175W2W4+0.3W3W5−0.28W4W6−0.4W5W7+0.4W1W6−0.175W2W3+0.3W3W4−0.2W4W5−0.4W5W6}. Realisations a for A were generated by drawing a pseudo-value from the uniform(0,1) distribution and setting a to 1 if this number was less than the true propensity score and to 0 otherwise. Consequently, A can be thought of as a exposure that is generated by a non-linear and non-additive propensity score model. This model assigns approximately half of subjects to the exposure group.

Outcomes were generated following the mechanism described by Setoguchi et al. (2008) with slight modifications to increase the outcome fraction (from approximately 2% to 20% or 40%). Specifically, the binary outcome, Y, was modelled as a Bernoulli random variable given A and W: an independent random number, εY, was drawn from the uniform distribution; Y was set to 1 if this number was less than the inverse logit (expit) of a linear transformation η(A,W)=−1+0.3W1−0.36W2−0.73W3−0.2W4+0.71W8−0.19W9+0.26W10+γA of A and W and to 0 otherwise. The true conditional log odds ratio for the exposure-outcome effect was set to 1 or −1 depending on the scenario. The outcome incidence was roughly 40% for scenarios with γ = 1; and 20% for scenarios with γ=−1. The counterfactual outcomes Y0 and Y1 for any subject with realisations w of W and u of εY are found by computing I(u < expit{η(0,w)}) and I(u < expit{η(1,w)}), I denoting the indicator function. With knowledge of the counterfactual outcomes, it can be inferred that with γ = 1, the marginal log odds ratio for the true exposure-outcome effect among the exposed (or treated; ATT) is approximately 0.906; with γ=−1, the marginal log odds ratio is approximately −0.926 (Hernán and Robins 2017). Note that these are different from the conditional causal odds ratios as a result of the non-collapsibility property of the odds ratio.

We considered ignorable missing data mechanisms for introducing missing data.

MCAR missingness. For all subjects, irrespective of complete data, values of W3 were set to missing with probability p, characterising the MCAR mechanism. The missingness probability of the other variables was set to zero.

MAR missingness. Let M3 be a missing indicator variable that takes the value of one if and only if the value of W3 is missing. Similarly, define M4 to be the missing indicator variable pertaining to W4. Given the full data, W3 and W4 were set to missing independently of one and other and with probability equal to Pr(M3=1|W,A,Y)=p and Pr(M4=1|W,A,Y)=expit{α0+α1W1+α2A+α3Y}. The missingness probability of the other variables was set to zero.

Scenarios

We evaluated the performance of various CART-based methods in eight scenarios (Table 1). The intercepts α0 in scenarios five through eight were chosen so as to yield roughly the same average proportion of missing data points per generated dataset of 24000 data points (2000 records on 10 covariates, one exposure and one outcome variable), namely 3%. The average proportion of missing data points and the fraction of incomplete records were largest in scenario 2 (5% and 60%, respectively; see Table 1). In all of the scenarios considered, data are ‘missing at random’ and it is assumed that there is conditional exchangeability given measured covariates (i.e., (Y0,Y1)⊥⊥A|W).

Table 1:

Description of scenarios. γ equals the conditional log odds ratio for the effect of A on Y given W. Given the full data, variables W3 and W4 were set to missing independently of one and other and with probabilities p and expit{α0+α1W1+α2A+α3Y}, respectively. Abbreviations: MCAR, missing completely at random; MAR, missing at random; PMP, average proportion of missing data points; PIR, average proportion of incomplete records.

Scenario	γ	MCAR/MAR	p	α0	α1	α2	α3	PMP	PIR
1	1	MCAR	0.3	–	–	–	–	0.03	0.30
2	1	MCAR	0.6	–	–	–	–	0.05	0.60
3	1	MAR	0.0	−0.7	0.0	0.0	1.5	0.04	0.48
4	−1	MAR	0.0	−1.0	0.0	0.0	1.5	0.03	0.35
5	1	MAR	0.1	−1.6	0.5	0.5	0.5	0.03	0.37
6	1	MAR	0.1	−2.1	0.5	0.5	1.5	0.03	0.37
7	1	MAR	0.1	−2.3	0.5	1.5	0.5	0.03	0.36
8	1	MAR	0.1	−2.2	1.5	0.5	0.5	0.03	0.37

Note that in scenarios 3 trough 8, conditioning on M may break the independence between A and unobserved outcome predictor εY through what is known as collider stratification (cf. Pearl, 2009). One might therefore expect that that discarding incomplete records in these scenarios would result in bias. In scenarios 3 and 4, however, covariate missingness M is conditionally independent of exposure status and covariate data given the outcome (i.e., M⊥⊥(A,W)|Y). As a result, in these scenarios, the conditional OR for the effect of A on Y given W is equal to the conditional OR given W among the complete cases (Westreich 2012). Bias of complete case estimators in scenarios 3 and 4 therefore cannot be attributed to collider stratification, despite the presence of an unobserved outcome predictor. Instead, it could result from the non-collapsibility of the odds ratio and changes in the covariate distribution brought about by narrowing the focus of inference to the complete cases (Hernán and Robins 2017).

Estimators

Bagged CART was based on 100 bootstrap replicates (Lee et al. 2010). We imposed complexity constraints on the tree fitting algorithm using the rpart package default control settings. For boosted CART, we used 20000 iterations, a shrinkage parameter of 0.0005, and an iteration stopping rule based on the mean Kolmogorov-Smirnov test statistic (Lee et al. 2010; McCaffrey et al. 2004).

The CART methods were combined with several common approaches to handling missing data: leaving missingness information as is (i.e., subjecting incomplete data directly to the CART algorithm); complete case analysis (CCA); and multiple imputation (MI). MI was implemented with the mice package (version 2.46.0) using the logreg and norm options to impute missing binary and continuous variables, respectively, and otherwise default settings (Van Buuren and Groothuis-Oudshoorn 2011). Imputation models included, apart from the variable to be imputed, all other variables, including the outcome, as untransformed main effects only. Propensity score analysis was performed within imputed datasets using the respective sets of estimated propensity scores (Penning de Vries and Groenwold 2016) and results were combined using Rubin’s (1987) rules.

In addition to using CART, as stated, we also estimated propensity scores in imputed datasets using a correctly specified propensity score model (LRc), and using a logistic model with main effects only (LRm).

Within each (multiply imputed) dataset, the ATT was estimated from a logistic model with robust variance estimation using the survey package (Lumley 2014, version 3.31). We used both propensity score matching and inverse probability weighting. Matching was performed using a greedy 1:1 nearest neighbour algorithm, matching exposed (A = 1) to unexposed individuals (A = 0) (Austin 2011a). For any given (imputed) dataset, matching was performed on the logit of the propensity score, using a calliper distance of 20% of the standard deviation of the logit propensity score estimates (Austin 2011b). With the ATT as the estimand, IPW weights were defined as 1 for exposed subjects and PS/(1−PS) for unexposed subjects (PS denoting the estimated propensity score). To avoid undefined weights (1/0) or logit propensity scores (logit(0)), we placed bounds on the estimated propensity scores, truncating all estimates less than 0.001 to 0.001 and setting estimates greater than 0.999 to 0.999. MI-based estimates were pooled using Rubin’s rules to yield for each original dataset a single effect estimate, standard error estimate, and 90% confidence interval (90%CI).

Performance metrics

We evaluated the performance of the various methods through several measures: bias, estimated by the mean deviation of the estimated from the true marginal exposure-outcome effect on the log scale; empirical standard error; mean estimated standard error; mean squared error (MSE); and 90%CI coverage, estimated by the percentage of the 5000 data sets in which the 90%CI included the true exposure-outcome effect. Based on 5000 simulation runs, the Monte Carlo standard error for the true coverage probability of 0.90 is √(0.90(1−0.90)/5000)≈0.0042, implying that the estimated coverage probability is expected to lie with 95% probability between 0.893 and 0.907. Empirical coverage rates outside this interval provide evidence against the true coverage probabilities being equivalent to the nominal level of 0.90. The primary interest, however, was to gauge the effect of missing data on the various effect estimators. Therefore, we also compared, for each scenario, the effect estimates before and after the introduction of missing data.

3.2 Results

In this section, we present (Table 2) and describe the results for IPW-based estimators only. Trends for estimators based on propensity score matching are similar and the results are presented in full in the Supplementary Material.

Bias

Before the introduction of missing data, baCART and bCART showed small to no bias (with absolute values ranging from 0.000 to 0.011 on the log odds ratio scale). MI+LRc performed generally well and to a similar extent as MI+baCART and MI+bCART. MI+LRm consistently underestimated the true effect when inference was based on IPW; this trend was weaker for inference based on propensity score matching (Supplementary Table 1). Among all CART-based missing data approaches considered, multiple imputation yielded the least biased estimators overall (with a maximum absolute value of bias of 0.026 versus 0.221 and 0.138 for CART-only and CCA estimators, respectively), whereas bCART deviated on average the most from the true effect after the introduction of missingness.

As expected, baCART and bCART were biased (with −0.029 and −0.039, respectively, for scenario 1 and −0.064 and −0.072 for scenario 2) under MCAR in the direction of confounding by W3, whereas CCA and MI produced exposure-outcome effect estimates that were on average very close to the true effect. In scenarios 3 and 4, where missingness was outcome-dependent, bCART was biased toward the null after the introduction of missingness (with bias estimates of −0.117 and 0.064 for scenario 3 and 4, respectively, where the causal log odds ratios were approximately 0.906 and −0.926); baCART was downwardly biased in both scenarios (with bias estimates of −0.088 and −0.112). Estimators based on CCA or MI with CART were considerably less biased. In scenarios 5 through 8, CCA estimators systematically underestimated the true effect, particularly when the effect of the exposure or the outcome on the missingness probability was large (scenarios 6 and 7, where bias estimates ranged from −0.116 to −0.138). In these scenarios (5 through 8), baCART produced estimates that were on average close to the true effect, except in scenario 6, where the effect was clearly underestimated (estimated bias −0.050). bCART resulted in estimates that deviated in the same direction and to a similar or greater extent from the true effect as compared with CCA estimators. Again, MI with CART resulted in estimates that were on average close to the true effect. Increasing the effect of covariate W1 on the missingness probability (scenario 8 versus 5) had no evident impact on the results of any of the estimators.

Other performance

As expected, discarding incomplete records (CCA) resulted in relatively large empirical standard errors. Interestingly, MI+LRc had the largest empirical standard error in most scenarios, probably as a consequence of the complexity of the fitted propensity score models. In comparing empirical and mean estimated standard errors, note that multiple imputation produced generally conservative estimates of the standard error. This is consistent with previous observations (Van Buuren 2012). Among the CART-based estimators, the MSE was largest for CCA in nearly all scenarios. MI estimators had consistently small MSE. Overall, the best performance in terms of MSE was attained by MI estimators, followed by baCART and bCART. Multiple imputation with CART resulted in empirical coverage rates close to or slightly higher than the nominal 90% and those of the other estimators.

Table 2:

Performance metrics of inverse probability weighting estimators in 5000 simulated datasets with and without missing data. Abbreviations: SE, standard error; MSE, mean squared error; 90%CI, 90% confidence interval; CART, classification and regression trees; baCART, bootstrap aggregated CART; bCART, boosted CART; CCA, complete case analysis; MI, multiple imputation; LRc, logistic regression with correctly specified model; LRm, logistic regression with main effects only.

	Missing		Scenario
Metric	data	Method	1	2	3	4	5	6	7	8
Bias	Without	baCART	0.009	0.011	0.009	0.011	0.007	0.007	0.008	0.010
		bCART	−0.001	0.002	−0.000	0.004	−0.001	−0.000	−0.001	0.001
	With	baCART	−0.029	−0.064	−0.088	−0.112	−0.011	−0.050	0.010	−0.004
		bCART	−0.037	−0.072	−0.117	0.064	−0.057	−0.221	−0.123	−0.053
		CCA+baCART	0.001	0.001	0.016	−0.023	−0.040	−0.138	−0.116	−0.032
		CCA+bCART	0.000	0.006	0.022	−0.010	−0.046	−0.136	−0.129	−0.035
		MI+baCART	0.001	0.002	0.025	0.018	0.020	0.016	0.022	0.026
		MI+bCART	−0.008	−0.011	−0.024	−0.025	−0.010	−0.023	−0.008	−0.007
		MI+LRc	0.002	−0.007	−0.028	−0.029	−0.007	−0.025	−0.004	−0.002
		MI+LRm	−0.099	−0.094	−0.072	−0.075	−0.087	−0.083	−0.089	−0.082

Empirical	Without	baCART	0.116	0.114	0.116	0.129	0.115	0.115	0.114	0.116
SE		bCART	0.134	0.133	0.136	0.147	0.135	0.133	0.133	0.135
	With	baCART	0.118	0.121	0.126	0.136	0.119	0.121	0.119	0.119
		bCART	0.132	0.128	0.125	0.137	0.131	0.127	0.149	0.131
		CCA+baCART	0.141	0.186	0.189	0.199	0.147	0.156	0.143	0.148
		CCA+bCART	0.158	0.202	0.211	0.221	0.165	0.172	0.154	0.163
		MI+baCART	0.116	0.116	0.115	0.129	0.114	0.115	0.113	0.116
		MI+bCART	0.132	0.130	0.129	0.140	0.128	0.126	0.126	0.128
		MI+LRc	0.216	0.205	0.202	0.218	0.210	0.211	0.203	0.214
		MI+LRm	0.125	0.123	0.125	0.138	0.122	0.121	0.124	0.123

Mean	Without	baCART	0.114	0.114	0.114	0.128	0.114	0.114	0.114	0.114
SE^		bCART	0.136	0.136	0.136	0.149	0.136	0.136	0.136	0.136
	With	baCART	0.115	0.119	0.118	0.129	0.117	0.117	0.119	0.117
		bCART	0.134	0.132	0.131	0.144	0.134	0.135	0.153	0.135
		CCA+baCART	0.139	0.189	0.190	0.199	0.148	0.156	0.145	0.148
		CCA+bCART	0.160	0.204	0.211	0.221	0.166	0.174	0.160	0.163
		MI+baCART	0.116	0.120	0.115	0.130	0.116	0.116	0.115	0.116
		MI+bCART	0.140	0.143	0.137	0.150	0.138	0.138	0.137	0.138
		MI+LRc	0.196	0.198	0.187	0.201	0.189	0.191	0.185	0.191
		MI+LRm	0.131	0.135	0.128	0.142	0.128	0.128	0.128	0.128

MSE	Without	baCART	0.014	0.013	0.013	0.017	0.013	0.013	0.013	0.014
		bCART	0.018	0.018	0.019	0.022	0.018	0.018	0.018	0.018
	With	baCART	0.015	0.019	0.024	0.031	0.014	0.017	0.014	0.014
		bCART	0.019	0.022	0.029	0.023	0.020	0.065	0.037	0.020
		CCA+baCART	0.020	0.034	0.036	0.040	0.023	0.043	0.034	0.023
		CCA+bCART	0.025	0.041	0.045	0.049	0.029	0.048	0.040	0.028
		MI+baCART	0.013	0.013	0.014	0.017	0.013	0.013	0.013	0.014
		MI+bCART	0.018	0.017	0.017	0.020	0.016	0.016	0.016	0.016
		MI+LRc	0.047	0.042	0.042	0.048	0.044	0.045	0.041	0.046
		MI+LRm	0.025	0.024	0.021	0.025	0.023	0.021	0.023	0.022

Empirical	Without	baCART	0.896	0.901	0.895	0.897	0.892	0.899	0.898	0.890
90%CI		bCART	0.907	0.909	0.903	0.904	0.904	0.909	0.909	0.907
coverage	With	baCART	0.881	0.847	0.784	0.753	0.890	0.854	0.898	0.893
		bCART	0.897	0.861	0.772	0.886	0.878	0.497	0.797	0.880
		CCA+baCART	0.893	0.905	0.894	0.901	0.888	0.760	0.792	0.884
		CCA+bCART	0.906	0.905	0.902	0.904	0.890	0.798	0.804	0.886
		MI+baCART	0.904	0.914	0.894	0.897	0.895	0.900	0.903	0.896
		MI+bCART	0.922	0.931	0.915	0.919	0.918	0.928	0.926	0.923
		MI+LRc	0.911	0.920	0.909	0.906	0.908	0.921	0.919	0.906
		MI+LRm	0.815	0.841	0.858	0.865	0.831	0.848	0.827	0.841

3.3 Additional simulation experiment

To investigate the estimator performances in a simpler setting, we repeated the simulation experiment of scenario 2 with the squared and interaction terms removed from the exposure allocation model of the data generating mechanism. The results, presented in Supplementary Table 2, indicate generally the same trends as previously noted. Of note, in the absence of missing data, inverse weighting based on CART showed noticeably more bias than in scenarios 1 through 8. This is probably related to CART’s inherent limited ability to model smooth functions. Multiply imputing missing data followed by CART yielded approximately the same extent of bias. However, this bias appears to be partially cancelled out by the bias introduced by CART’s automatic handling of missing data to the extent that CART alone performed better with than without missing data. Nonetheless, relative to the extent of bias of the respective CART algorithm before the introduction of missing data, multiple imputation with CART outperformed both CCA with CART and CART applied directly to incomplete data in terms of bias.

4 Case study

In this section, we illustrate the application of the CART-based estimators to an empirical dataset, constructed to assess the association between annual influenza vaccination and mortality risk among elderly Groenwold et al. (2009). The dataset comprises 44418 complete records on vaccination status, mortality during the influenza epidemic period and potential confounders (age, sex, health status and prior health care and medication use). Among the 32388 vaccinated individuals 266 died, whereas 113 out of 12030 nonvaccinated individuals died (crude odds ratio 0.87, 90%CI 0.73–1.05). To control for measured confounders, propensity scores were estimated via bCART and a pseudopopulation was constructed using IPW such as to preserve the covariate distribution of the vaccination group. This resulted in an odds ratio of 0.60 (90%CI 0.49–0.73) for the marginal effect of vaccination on mortality risk among the vaccinated. Substituting bCART with baCART yielded an odds ratio of 0.65 (90% 0.53–0.81). As expected, introducing MCAR missingness into a confounder by setting a random 50% of subjects’ number of prior general practitioner (GP) visits to missing, resulted in odds ratio estimates that were closer to the crude effect. Setting the number of GP visits to missing with probability 0.5 for all subjects who died and zero otherwise, resulted in estimates substantially closer to the null for bCART and away from the null for baCART. Thus, as in our simulations, outcome-dependent MAR missingness resulted in apparent attenuation of the exposure-outcome effect as estimated by bCART. Table 3 shows the results also for the complete case and multiple imputation equivalents of baCART and bCART as well as for IPW based on propensity score estimation using main effects logistic regression and with weights truncated to the interval from the 0th to the 97.5th percentile. To better handle potential violations of standard imputation model assumptions, we used a nonparametric multiple imputation strategy (option cart rather than norm in the mice package) to estimate the effect of vaccination on mortality risk (Doove et al. 2014). Interestingly, in the MAR setting, bCART and baCART yielded the two most extreme estimates for the effect of vaccination on mortality risk among the elderly.

Table 3:

Estimated effects of vaccination on mortality risk among the elderly in dataset with no missing data, MCAR missingness or outcome-dependent MAR missingness. Estimates are adjusted for age, sex, health status and prior health care and medication use. Abbreviations: MCAR, missing completely at random; MAR, missing at random; OR, odds ratio; 90%CI, 90% confidence interval; CART, classification and regression trees, baCART, bootstrap aggregated CART; bCART, boosted CART; CCA, complete case analysis; MI, multiple imputation; LRm, main effects logistic regression. ^†In case of (MCAR or MAR) missingness, MI was implemented before LRm.

Missingness	None	MCAR	MAR

Method	OR (90%CI)	OR (90%CI)	OR (90%CI)
baCART	0.65 (0.53–0.81)	0.69 (0.56–0.85)	0.53 (0.44–0.66)
bCART	0.60 (0.49–0.73)	0.63 (0.51–0.77)	0.79 (0.63–0.98)
CCA+baCART	–	0.55 (0.41–0.73)	0.62 (0.46–0.84)
CCA+bCART	–	0.50 (0.37–0.66)	0.56 (0.42–0.75)
MI+baCART	–	0.60 (0.47–0.75)	0.70 (0.55–0.89)
MI+bCART	–	0.58 (0.47–0.72)	0.63 (0.51–0.78)
LRm^†	0.59 (0.49–0.71)	0.62 (0.51–0.76)	0.70 (0.57–0.86)

5 Discussion

In this paper we examined the workings of CART based propensity score estimators in scenarios with missing covariate data. Although the CART has been described as a promising approach to automatically handle missing covariate data when developing a propensity score (Setoguchi et al. 2008; Lee et al. 2010), there has been little discussion on the performance of these methods. Through analysis and simulations we showed that the application of CART for propensity score estimation can yield serious bias in estimates of exposure-outcome relations. We showed that this problem not only pertains to the situation of MAR but critically also to the situations with MCAR, which are often considered harmless when bias is concerned resulting only in larger variance of the estimator of exposure-outcome relations.

An attractive property of CART-based methods relative to standard logistic regression procedures, is perhaps not having to discard incomplete records. Indeed, in our simulations, discarding incomplete records resulted in the largest empirical standard errors. Alternatively, multiple imputation may be used to replace missing values under MCAR or MAR prior to propensity score estimation. This approach was shown to work well in our simulations. One criticism of multiple imputation in its parametric form is that it makes possibly erroneous distributional assumptions. In particular, the standard multiple imputation algorithms do not properly capture nonlinear relations like interaction effects (Cham and West 2016). Multiple imputation algorithms that use nonparametric methods have been developed. For example, Doove et al. (2014), following Burgette and Reiter (2010), proposed CART to be incorporated as imputation method in the multiple imputation by chained equations framework. As with parametric multiple imputation, the algorithm is designed to account for the inherent variability in the data. However, while the approach of Doove et al. (2014) seems promising, there is still room for improvement. Particularly, the algorithm does not explicitly account for uncertainty about the (implicit) CART trees’ model parameters. To address this, Shah et al. (2014) proposed a promising algorithm in which random forest CART is embedded in the multiple imputation by chained equations framework and imputation models are fitted to bootstrap samples. An implementation is available via the R package CALIBERrfimpute (Shah 2014).

In interpreting our findings, it is important to note that we considered only a small number of scenarios. We assumed throughout that data were MCAR or MAR and that there was no unmeasured confounding (conditional exchangeability given measured confounders). As noted, there are situations conceivable in which it is not problematic to estimate the generalised propensity score. If the missingness information conveys information about a strong unmeasured confounder, estimating the generalised propensity score may allow for partial control of unmeasured confounding. On the other hand, adjusting for missingness information, e.g., through the generalised propensity score (estimated by some CART algorithms), may be problematic particularly when it is a strong proxy for the outcome, an intermediate, or a common effect of the exposure and outcome.

Our arguments for caution when using CART to estimate propensity scores in the presence of missing data are in line with the recommendation to incorporate information on the outcome in imputing missing covariate data (Penning de Vries and Groenwold 2016; Leyrat et al. 2017; Moons et al. 2006). Since propensity score estimation is typically done without any information on the outcome (Rubin et al. 2008), any missing data imputation (e.g., with a surrogate) that is inherent to the propensity score estimation procedure will likely fail. An important feature of the propensity score matching or weighting methodology is that, in the absence of missing data, it need not make distributional assumptions about the outcome in relation to the exposure and covariates in constructing a matched or weighted dataset. In the presence of missing covariate data, omitting information on the outcome in imputing missing covariate data, however, imposes a structure on the data that likely contrasts with the true data distribution and the analysis model. This is similar to the idea of models being “uncongenial” in the sense of Meng (1994). The current study also relates to the literature on the missing indicator method, given its resemblance with the approach to handling missing data taken by the boosted CART algorithm. Like the automatic handing of missing data by the boosted CART algorithm, the missing indicator method typically results in bias (Groenwold et al. 2012).

It has been suggested to perform balance diagnostics on the matched or weighted study sample at hand (Austin 2011a). If systematic differences persist between exposure groups following matching or weighting, this may be an indication that the propensity score estimation algorithm requires modification (Austin and Stuart 2015). In the context of CART, one may assign greater weight to subjects at a certain covariate level in evaluating exposure homogeneity at any given node. We did not adopt an iterative approach to propensity score estimation and balance diagnostics in our simulation studies for several reasons. First, doing so would increase the computational burden of the simulations. Second, whereas CART facilitates the estimation of propensity scores that balance the entire covariate joint distribution across exposure groups, standard balance diagnostics procedures typically ignore the complex relationship between exposure and covariates. For example, when using the standardised mean difference, it is typically assumed that all variables that need to be balanced with respect to the mean are identified and included in the set over which a summary (e.g., weighted mean or maximum) standardised mean difference is calculated. The utility of the metric may be poor if important variables (e.g., higher order moments) are omitted. Other balance metrics, such as the Kolomogorov-Smirnov metric, Lévy distance, and overlapping coefficient (Belitser et al. Belitser et al. 2011; Franklin et al. 2014; Ali et al. 2015) often fail to reflect the extent of imbalance with respect to the entire covariate joint distribution. In addition, what constitutes good balance ultimately depends on the outcome model too. Substantial imbalance may be acceptable for covariates that are weakly predictive of the outcome, while small departures from perfect balance may be problematic for covariates that are strongly predictive of the outcome.

We emphasise that our simulations were not designed to compare CART versus logistic regression as means to estimate propensity scores. Main effects logistic regression here and in previous studies demonstrated a robust performance against model misspecification in terms of bias when inference was based on propensity score matching (Setoguchi et al. 2008). This is likely attributable to the set-up of the simulations. The outcome model included homogeneous exposure-outcome effects and main effects only. Since between-exposure-group imbalances with respect to interaction terms or higher order moments of covariates need not accompany systematic differences in outcomes, it is not surprising that propensity score matching based on main effects logistic regression may perform roughly the same in terms of bias as propensity score matching based on logistic regression with correct model specification. Further studies comparing CART versus main effects logistic regression may well demonstrate more clearly the advantageous properties of CART in settings with both complex propensity score and complex outcome models.

In summary, we compared various approaches to handling missing data in estimating propensity scores via CART. While the use of machine learning in estimating propensity scores seems promising for handling complex full data structures, it unlikely represents a suitable substitute for well-established methods, such as multiple imputation, to deal with missing data.

A Appendix

For realisations w of W and a of A, let

φ(w,a)=φ∗(w,a)E[φ∗(W,A)|A=a],φ∗(w,a)=I(a=1)+I(a=0)e(w)1−e(w).

We show in this subsection that weighting by φ yields independence between covariate(s) W and A; that is, for all w,

φ(w,0)Pr(W=w|A=0)=φ(w,1)Pr(W=w|A=1).

Also, conditional exchangeability given W implies exchangeability following weighting by φ; that is, if (Y0,Y1)⊥⊥A|W=w for all w, then

∑wφ(w,0)Pr(Y0=y0,Y1=y1,W=w|A=0)=∑wφ(w,1)Pr(Y0=y0,Y1=y1,W=w|A=1)

for all y0,y1.

We begin by considering E[φ∗(W,A)|A=a]. It is easy to see that E[φ∗(W,A)|A=1]=1. For a = 0, we have

Since φ(w,1) = 1, to prove the first statement it suffices to show that φ(w,0)Pr(W=w|A=0)=Pr(W=w|A=1) for all w. Now,

for all w, as desired.

To complete this proof, observe that

B Appendix

In this subsection, we give an example of a simple setting where (Y0,Y1)⊥⊥A|e(W) and W⊥⊥A|e∗(V) hold, yet (Y0,Y1)⊥⊥A|e∗(V). Let W, A and Y be binary mutually independent random variables and suppose that covariate missingness is MAR dependent on Y. Specifically, let Pr(W = 1) = 0.5 and Pr(A=1|W=w)=Pr(A=1)=0.5 for all w. Further, define Y=I(εY<(1+A)/10), where εY∼U(0,1) such that εY⊥⊥(A,W). Thus, there is conditional exchangeability given W, so that Pr(Y=1|A=a,W=w)=Pr(Ya=1|A=a,W=w)=(1+a)/10 for all a,w. Rosenbaum and Rubin (1983, Theorems 1 and 3) and Appendix A establish conditional exchangeability given e(W) and exchangeability following inverse probability weighting with weights defined on the basis of e(W). Now, let Pr(R=0|W,A,Y,εY)=0.1+0.5Y. It is easily verified that W⊥⊥A|e∗(V). However, e∗(V)=4/7 if and only if V = * or, equivalently, R = 0. Since R⊥⊥εY|(A,Y) and R⊥⊥A|Y, for any u∈(0,1), we therefore have

Pr(εY≤u|A=a,e∗(V)=4/7)=Pr(εY≤u|A=a,R=0)=∑yPr(εY≤u|A=a,Y=y)Pr(Y=y|A=a,R=0)=∑y{Pr(εY≤u|A=a,Y=y)×Pr(R=0|Y=y)Pr(Y=y|A=a)∑y′Pr(R=0|Y=y′)Pr(Y=y′|A=a)}=∑yPr(εY≤u|A=a,Y=y)(1+5y)[(1+a)y+(9−a)(1−y)]15+5a,

where

Pr(εY≤u|A=a,Y=y)=Pr(Y=y|A=a,εY≤u)Pr(εY≤u)Pr(Y=y|A=a)=q(y,u,a)u(1+a)y/10+(9−a)(1−y)/10,

with q(y,u,a)=1−y+(−1)1−ymin{(1+a)/10,u}/u. In particular, Pr(εY≤0.5|A=a,e∗(V)=4/7) equals 2/3 if a = 0 and 3/4 if a = 1. Hence, εY⊥⊥A|e∗(V) and, given the definitions of Y, Y0 and Y1, we have (Y0,Y1)⊥⊥A|e∗(V).

C Appendix

This subsection details an example where (Y0,Y1)⊥⊥A|e∗(V), yet (Y0,Y1)⊥⊥A|e(W). Suppose that W and that A and Y are all binary random variables. Further, let (A,R) be marginally independent of W, let A conditionally depend on R given W, and let Y conditionally depend on A and R given W. Specifically, let Pr(W = 1) = 0.5, Pr(R=0|W)=0.1, Pr(A=1|R,W)=2(1+R)/10, and Y=I(εY<2(1+2R)/20), where εY⊥⊥(W,R,A). To see that (Y0,Y1)⊥⊥A|e∗(V), first note that

for w = 0,1, and that e∗(v) equals Pr(A=1|R=0)=0.20 if v = * and Pr(A=1|R=1)=0.40 otherwise. Now,

for a = 0,1. Also, Pr(Y0=1|A,e∗(V)=0.40)=0.10. Thus, Y0⊥⊥A|e∗(V). Similarly, it can be shown that (Y0,Y1)⊥⊥A|e∗(V). Next, observe that

which is not invariant to changes in a = 0,1. Hence, (Y0,Y1)⊥⊥A|e(W) .

References

Albert, A., and Anderson J. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71:1–10.10.1093/biomet/71.1.1Search in Google Scholar

Ali, M., Groenwold, R., Belitser, S., Pestman, W., Hoes, A., Roes, K., de Boer, A., and Klungel, O. (2015). Reporting of covariate selection and balance assessment in propensity score analysis is suboptimal: a systematic review. Journal of Clinical Epidemiology, 68:122–131.10.1016/j.jclinepi.2014.08.011Search in Google Scholar PubMed

Austin, P. (2011a). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46:399–424.10.1080/00273171.2011.568786Search in Google Scholar PubMed PubMed Central

Austin, P. (2011b). Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharmaceutical Statistics, 10:150–161.10.1002/pst.433Search in Google Scholar PubMed PubMed Central

Austin, P., and Stuart, E. (2015). Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in Medicine, 34:3661–3679.10.1002/sim.6607Search in Google Scholar PubMed PubMed Central

Belitser, S., Martens, E., Pestman, W., Groenwold, R., Boer, A., and Klungel, O. (2011). Measuring balance and model selection in propensity score methods. Pharmacoepidemiology and Drug Safety, 20:1115–1129.10.1002/pds.2188Search in Google Scholar PubMed

Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140.10.1007/BF00058655Search in Google Scholar

Breiman, L. (2001). Random forests. Machine Learning, 45:5–32.10.1023/A:1010933404324Search in Google Scholar

Burgette, L., and Reiter, J. (2010). Multiple imputation for missing data via sequential regression trees. American journal of epidemiology, 172:1070–1076.10.1093/aje/kwq260Search in Google Scholar PubMed

Cham, H., and West, S. (2016). Propensity score analysis with missing data. Psychological Methods, 21:427–445.10.1037/met0000076Search in Google Scholar PubMed

Cole, S., and Frangakis, C. (2009). The consistency statement in causal inference: a definition or an assumption? Epidemiology, 20:3–5.10.1097/EDE.0b013e31818ef366Search in Google Scholar PubMed

D’Agostino Jr., R., and Rubin, D. (2000). Estimating and using propensity scores with partially missing data. Journal of the American Statistical Association, 95:749–759.10.1017/CBO9780511810725.029Search in Google Scholar

Doove, L., van Buuren, S., and Dusseldorp, E. (2014). Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72:92–104.10.1016/j.csda.2013.10.025Search in Google Scholar

Drake, C. (1993). Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics, 49:1231–1236.10.2307/2532266Search in Google Scholar

Elith, J., Leathwick, J., and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology, 77:802–813.10.1111/j.1365-2656.2008.01390.xSearch in Google Scholar PubMed

Franklin, J., Rassen, J., Ackermann, D., Bartels, D., and Schneeweiss, S. (2014). Metrics for covariate balance in cohort studies of causal effects. Statistics in Medicine, 33:1685–1699.10.1002/sim.6058Search in Google Scholar PubMed

Groenwold, R., Nelson, D., Nichol, K., Hoes, A., and Hak, E. (2009). Sensitivity analyses to estimate the potential impact of unmeasured confounding in causal research. International Journal of Epidemiology, 39:107–117.10.1093/ije/dyp332Search in Google Scholar PubMed

Groenwold, R. H., White, I. R, Donders, A. R. T., Carpenter, J. R, Altman, D. G, and Moons K. G. (2012). Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis. Canadian Medical Association Journal, 184:1265–1269.10.1503/cmaj.110977Search in Google Scholar PubMed PubMed Central

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition. New York: Springer.10.1007/978-0-387-84858-7Search in Google Scholar

Hernán, M., and Robins, J. (2017). Fine point 4.3: Collapsibility of the odds ratio. In: Causal Inference, M. Hernán, and J. Robins (Eds.). Boca Raton: Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/, forthcoming.Search in Google Scholar

Holland, P. (1986). Statistics in causal inference. Journal of the American Statistical Association, 81:945–960.10.1080/01621459.1986.10478354Search in Google Scholar

Holland, P. (1988). Causal inference, path analysis, and recursive structural equations models. Sociological Methodology, 18:449–484.10.2307/271055Search in Google Scholar

Lee, B., Lessler, J., and Stuart, E. (2010). Improving propensity score weighting using machine learning. Statistics in Medicine, 29:337–346.10.1002/sim.3782Search in Google Scholar PubMed PubMed Central

Lesko, C., Buchanan, A., Westreich, D., Edwards, J., Hudgens, M., and Cole, S. (2017). Generalizing study results: a potential outcomes perspective. Epidemiology, 28:553–561.10.1097/EDE.0000000000000664Search in Google Scholar PubMed PubMed Central

Leyrat, C., Seaman, S. R, White, I. R, Douglas, I., Smeeth, L., Kim, J., Resche-Rigon, M., Carpenter, J. R, and Williamson, E. J. (2017). Propensity score analysis with partially observed covariates: How should multiple imputation be used? Statistical methods in medical research, 0962280217713032.10.1177/0962280217713032Search in Google Scholar PubMed PubMed Central

Lumley, T. (2014). survey: Analysis of complex survey samples (R package, version 3.31). Comprehensive R Archive Network, Vienna, Austria. http://cran.r-project.org/web/packages/survey/index.html.Search in Google Scholar

McCaffrey, D., Ridgeway, G., and Morral, A. (2004). Propensity score estimation with boosted regression for evaluating adolescent substance abuse treatment. Psychological Methods, 9:403–425.10.1037/1082-989X.9.4.403Search in Google Scholar PubMed

Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 538–558.10.1214/ss/1177010269Search in Google Scholar

Moisen, G. (2008). Classification and regression trees. In: Encyclopedia of Ecology, volume 1, S. Jorgensen, and B. Fath (Eds.). Oxford: Elsevier.Search in Google Scholar

Moons, K. G., Donders, R. A, Stijnen, T., and Harrell Jr F. E. (2006). Using the outcome for imputation of missing predictor values was preferred. Journal of clinical epidemiology, 59:1092–1101.10.1016/j.jclinepi.2006.01.009Search in Google Scholar PubMed

Neyman, J., Iwaszkiewicz, K., and St. Kolodziejczyk (1935). Statistical problems in agricultural experimentation. Supplement to the Journal of the Royal Statistical Society, 2:107–180.Search in Google Scholar

Pearl, J. (2009), Causality: Models, Reasoning and Inference. New York: Cambridge University Press.10.1017/CBO9780511803161Search in Google Scholar

Penning de Vries, B., and Groenwold, R. (2016). Comments on propensity score matching following multiple imputation. Statistical Methods in Medical Research, 25:3066–3068.Search in Google Scholar

Peters, A., and Hothorn, T. (2017). ipred: Improved Predictors (R package, version 0.9-6), Comprehensive R Archive Network, Vienna, Austria. http://cran.r-project.org/web/packages/ipred/index.html.Search in Google Scholar

R Core Team. (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.Search in Google Scholar

Rai, D., Lee, B., Dalman, C., Newschaffer, C., Lewis, G., and Magnusson, C. (2017). Antidepressants during pregnancy and autism in offspring: population based cohort study. BMJ, 385:j2811.10.1136/bmj.j2811Search in Google Scholar PubMed PubMed Central

Ridgeway, G. (1999). The state of boosting. Computing Science and Statistics, 31:172–181.Search in Google Scholar

Ridgeway, G., McCaffrey, D., Morral, A., Griffin, B., and Burgette, L. (2017). twang: Toolkit for Weighting and Analysis of Nonequivalent Groups (R package, version 1.5). Comprehensive R Archive Network, Vienna, Austria. http://cran.r project.org/web/packages/twang/index.html.Search in Google Scholar

Rosenbaum, P., and Rubin, D. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41–55.10.21236/ADA114514Search in Google Scholar

Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66:688–701.10.1037/h0037350Search in Google Scholar

Rubin, D. (1976). Inference and missing data. Biometrika, 63:581–592.10.1093/biomet/63.3.581Search in Google Scholar

Rubin, D. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.10.1002/9780470316696Search in Google Scholar

Rubin, D. B., et al. (2008). For objective causal inference, design trumps analysis. The Annals of Applied Statistics, 2:808–840.10.1214/08-AOAS187Search in Google Scholar

Schafer, J. (1997). Analysis of incomplete multivariate data. Boca Raton: CRC Press.10.1201/9781439821862Search in Google Scholar

Setoguchi, S., Schneeweiss, S., MA, M. B, Glynn, R., and Cook, E. (2008). Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiology and Drug Safety, 17:546–555.Search in Google Scholar

Shah, A. (2014). CALIBERrfimpute: Imputation in MICE using Random Forest (R package, version 0.1-2). Comprehensive R Archive Network, Vienna, Austria. http://cran.r-project.org/web/packages/CALIBERrfimpute/index.html.Search in Google Scholar

Shah, A., Bartlett, J., Carpenter, J., Nicholas, O., and Hemingway, H. (2014). Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. American Journal of Epidemiology, 179:764–774.10.1093/aje/kwt312Search in Google Scholar PubMed PubMed Central

Stürmer, T., Joshi, M., Glynn, R., Avorn, J., Rothman, K., and Schneeweiss, S. (2006). A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. Journal of Clinical Epidemiology, 59:437–e1.10.1016/j.jclinepi.2005.07.004Search in Google Scholar PubMed PubMed Central

Tchetgen, E. T., and VanderWeele, T. (2012). On causal inference in the presence of interference. Statistical Methods in Medical Research, 21:55–75.10.1177/0962280210386779Search in Google Scholar PubMed PubMed Central

Therneau, T., and Atkinson, E. (2017). An introduction to recursive partitioning using the RPART routines. Rochester: Mayo Foundation.Search in Google Scholar

Van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton: CRC Press.10.1201/b11826Search in Google Scholar

Van Buuren, S., and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45:1–67.10.18637/jss.v045.i03Search in Google Scholar

Westreich, D. (2012). Berkson’s bias, selection bias, and missing data. Epidemiology, 23:159–164.10.1097/EDE.0b013e31823b6296Search in Google Scholar PubMed PubMed Central

Westreich, D., Lessler, J., and Jonsson Funk M. (2010). Propensity score estimation: neural networks, support vector machines, decision trees (cart), and meta-classifiers as alternatives to logistic regression. Journal of clinical epidemiology, 63:826–833.10.1016/j.jclinepi.2009.11.020Search in Google Scholar PubMed PubMed Central

Wyss, R., Ellis, A., Brookhart, M., Girman, C., Jonsson Funk, M., LoCasale, R., and Stürmer, T. (2014). The role of prediction modeling in propensity score estimation: an evaluation of logistic regression, bcart, and the covariate-balancing propensity score. American Journal of Epidemiology, 180:645–655.10.1093/aje/kwu181Search in Google Scholar PubMed PubMed Central

Supplementary Material

The online version of this article offers supplementary material (DOI:https://doi.org/10.1515/em-2017-0020).

Received: 2017-12-08

Revised: 2018-07-25

Accepted: 2018-07-25

Published Online: 2018-09-18

Supplementary Material Details

Articles in the same Issue

https://doi.org/10.1515/em-2017-0020

Keywords for this article

causal inference; propensity score; missing data; CART; multiple imputation