Marginal Structural Models with Counterfactual Effect Modifiers

Wenjing Zheng; Zhehui Luo; Mark J van der Laan

doi:10.1515/ijb-2018-0039

Article Open Access

Marginal Structural Models with Counterfactual Effect Modifiers

, and

Published/Copyright: June 8, 2018

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal The International Journal of Biostatistics Volume 14 Issue 1

Abstract

In health and social sciences, research questions often involve systematic assessment of the modification of treatment causal effect by patient characteristics. In longitudinal settings, time-varying or post-intervention effect modifiers are also of interest. In this work, we investigate the robust and efficient estimation of the Counterfactual-History-Adjusted Marginal Structural Model (van der Laan MJ, Petersen M. Statistical learning of origin-specific statically optimal individualized treatment rules. Int J Biostat. 2007;3), which models the conditional intervention-specific mean outcome given a counterfactual modifier history in an ideal experiment. We establish the semiparametric efficiency theory for these models, and present a substitution-based, semiparametric efficient and doubly robust estimator using the targeted maximum likelihood estimation methodology (TMLE, e.g. van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. Int J Biostat. 2006;2, van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data, 1st ed. Springer Series in Statistics. Springer, 2011). To facilitate implementation in applications where the effect modifier is high dimensional, our third contribution is a projected influence function (and the corresponding projected TMLE estimator), which retains most of the robustness of its efficient peer and can be easily implemented in applications where the use of the efficient influence function becomes taxing. We compare the projected TMLE estimator with an Inverse Probability of Treatment Weighted estimator (e.g. Robins JM. Marginal structural models. In: Proceedings of the American Statistical Association. Section on Bayesian Statistical Science, 1-10. 1997a, Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11:561–570), and a non-targeted G-computation estimator (Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Math Modell. 1986;7:1393–1512.). The comparative performance of these estimators is assessed in a simulation study. The use of the projected TMLE estimator is illustrated in a secondary data analysis for the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial where effect modifiers are subject to missing at random.

Keywords: causal inference; effect modification; epidemiology; machine learning

1 Introduction

In social and health sciences, research questions often involve systematic comparison of the effectiveness of different longitudinal exposures or treatment strategies on an outcome of interest. Specifically, consider a study where individuals are followed over time. In addition to their baseline covariates, we record their time-varying treatments, time-varying covariates, and the outcomes of interest. Time-varying confounding is ubiquitous in these settings: the treatment of interest depends on past covariates and in turn affects future covariates. Right censoring is often present, in response to past covariates and treatments. It has been widely recognized that in these cases, conventional analytic methods, such as multiple regressions, often fail to properly account for the time-varying confounding of the treatment effect (e.g. [7]). Marginal Structural Models (MSMs), introduced by [4], are well-established and widely used tools to address the problem of time-varying confounding. These models estimate the marginal expectation of an intervention-specific counterfactual outcome, i.e. the mean outcome of a subject in an ideal experiment where he/she was assigned to follow a given intervention.

To assess effect modification, MSMs are traditionally used to model the conditional counterfactual mean outcome given an observed history. Yet, in many settings one may wish to model the conditional counterfactual mean outcome given a counterfactual history. Consider the simple case of effect modification by baseline covariates. A traditional observed baseline MSM conditions on the observed baseline modifiers [7] and allows one to assess how the treatment effect changes as a function of the observed modifier values. (Observed) History-Adjusted MSMs (HA-MSMs), introduced in [8, 9], generalize the observed baseline MSMs by modeling the counterfactual mean outcome given the observed history of treatment and modifiers of interest up till a time point.

However, since the modifiers of interest may be affected by their preceding treatment assignments, which may in turn depend on past modifiers and other covariates, the counterfactual mean within each stratrum of this history will also be affected by the observed treatment mechanism. In this case, the parameters of the HA-MSM would not be generalizable to an equivalent population with different or counterfactual treatment mechanisms [10]. We consider a model in which the true outcome that one wishes to model is the the conditional mean outcome given modifier history in an ideal experiment where subjects were assigned a given intervention of interest and a counterfactual history of the time-varying modifiers of interest.

To model these conditional counterfactual mean outcomes given a counterfactual history, we can use the so-called Counterfactual History Adjusted MSMs (CHA-MSM), introduced by [1]. Inverse Probability of Treatment Weighted (IPTW) estimators for CHA-MSM were proposed in [1]. These estimators are very intuitive, can be easily implemented using standard software, and offer influence function based standard error estimates. However, their consistency relies on consistent estimation of all the treatment weights. Doubly Robust IPTW (DR-IPTW) estimators for CHA-MSM were also described in [1]; they were based on estimating equations derived by orthogonalizing the IPTW estimating function with respect to the treatment mechanism. These DR-IPTW estimators are consistent if the treatment weights or the conditional covariate and outcome densities are consistently estimated. Moreover, they are solutions to the estimating equation defined by the efficient influence function (EIF), and thus are asymptotically semi-parametric efficient.

Despite these advances, there are still many gaps in the efficiency theory and robust estimation of CHA-MSMs. Firstly, even with the EIF being a key actor in semi-parametric estimation, there still lacks an explicit representation of it as an orthogonal decomposition of the nuisance parameters corresponding to the time-varying covariates. Compared to the IPTW-orthogonalized representation in [1], such an explicit representation would provide a comprehensive picture of the efficiency theory for CHA-MSM. In particular, by shining a light on the role of the nuisance parameters in the EIF, such an explicit representation can inform the study of semi-parametric estimation for these models, advise on the trade-offs in estimating different nuisance parameters, and provide insights on the challenges and solutions to handling high-dimensional covariates. Secondly, estimating equation based estimators, e.g. the IPTW and the DR-IPTW, may be unstable in the presence of near positivity violations [11], resulting in biased point and standard error estimates. In applications with dynamic treatment regimes, this instability is especially difficult to circumvent due to the limitations of effective weight stabilization. In contrast, a substitution-based estimator for these models can provide a way to maximize finite sample performance by preserving global information about the parameters.

This paper aims to fill these gaps in the literature by establishing the efficiency theory for CHA-MSM and providing substitution-based, semi-parametric efficient and robust estimators. Firstly, we describe the identification of the conditional counterfactual intervention-specific mean outcome given a counterfactual history, and the identification of the corresponding MSM parameters of interest. Secondly, we determine the EIF for these statistical parameters as an orthogonal decomposition of the nuisance parameters. This EIF is used to construct a substitution-based, semi-parametric efficient and doubly robust estimator using the targeted maximum likelihood estimation methodology (TMLE, e.g. [2, 3]). However, as we shall see, due to the form of the EIF, the computations in this estimator may prove arduous in applications where the effect modifiers are high dimensional. To address this problem, our third contribution is a projected influence function (and the corresponding TMLE estimator), which retains most of the robustness of its efficient peer and can be used in applications where the use of the EIF becomes taxing.

1.1 Illustrative example

Throughout this paper, we will illustrate our presentation using an example from mental health research. The STAR*D (Sequenced Treatment Alternatives to Relieve Depression) trial is a multi-level, longitudinal pragmatic trial of treatment strategies for major depression. After an initial screening, patients are enrolled into level 1 of the study, where everyone is assigned citalopram. At the start of each subsequent level, if a patient still remains in the study, then he/she is randomized to an option within one of the chosen treatment strategies. At level 2, the patient and his/her provider can choose between augmenting the citalopram with multiple options or switching to a new regimen with multiple options. We are interested in the comparative effect of augmenting vs switching medication. Because these two strategies are not randomized, this analysis is analogous to an observational study.

Suppose we wish to assess the effect modification of level 2 treatment strategy (augmenting vs switching) by the depression symptoms measured prior to entering level 2. Covariates either measured at enrollment or level 1 exit, for our purpose, are both considered baseline modifiers. These symptom measures are obtained at clinical visits and the level 1 exit survey. It is reasonable to believe that the more depressed patients and those less satisfied with their level 1 treatment may be less inclined to follow up with clinical visits or to respond to surveys. In this case, a simple complete-case analysis of effect modification may only provide results applicable to patients with less severe symptoms or more satisfied with their level 1 treatment. If we assume that this missingness, while associated with the outcome, can be predicted using covariate history collected up to level 1 exit, then we can regain some of the generalizability. Indeed, we cast these symptom measures as counterfactual variables under an intervention on their missingness mechanism. This way, the target parameter is formulated in terms of an ideal experiment where the report of the symptom measures was always ensured (i.e., intervened to be non-missing).

1.2 Organization of the article

This paper is organized as follows. In Section 2, we formulate the causal inference problem and determine the identifiability of the desired causal parameters from the observed data distribution. In Section 3, we present the EIF for the parameter of interest under a saturated semiparametric model, as well as a projected influence function. The robustness conditions for the efficient and projected influence functions are also established. We then present the construction of two TMLE estimators, one using the EIF (we call it the full TMLE) and one using the projected influence function (we call it the projected TMLE). For comparison we presented an IPTW estimator and a non-targeted substitution estimator. In Section 4, a simulation study demonstrates robustness properties of the projected TMLE. In Section 5, we use the STAR*D trial to illustrate the application of the projected TMLE. A final summary concludes the article.

2 Data structure and parameters of interest

For simplicity of exposition, we consider the case where one wishes to assess the conditional intervention-specific mean outcome given the counterfactual history up to the first time point.

2.1 Data structure

Specifically, consider a longitudinal data structure

O=(W,A0,V,L0,A1,L1,…,AK,LK)∼P0,

where W encodes baseline covariates, A_t encodes the exposures of interest and censoring indicators at time t,V encodes the time-varying history (between treatment A0 and A1) on which one wishes to condition, Lt encodes covariates (including time-varying confounders and the outcome process Yt) measured between At and At+1. Variable YK is the final outcome of interest. For the sake of discussion, assume that YK is either a binary or a bounded continuous variable (without losing generality, we may assume it is bounded in (0,1)). We note that in general, A_t encodes intervention variables of interest, i.e. variables whose distribution one would set in an ideal experiment. These include exposure variables (where one would randomize in an ideal experiment) as well as censoring variables (where one would prevent in an ideal experiment).

For later convenience, we introduce some useful notations. For a time-varying variable Xs, let Xs≡X1,…,Xs and Xs,t≡Xs,…,Xt for s<t. We will also use the shorthand X=XK. The equivalent notations in lower case will apply to the realizations of the corresponding variable. Finally, variables with degenerate indices, such as Lt at t=−1, are empty.

The time-ordering assumptions can be captured by a nonparametric structural equations model (NPSEM, [12]):

(1)W=fWUW;A0=fA0W,UA0;V=fVW,A0,UV;L0=fL0W,A0,V,UL0;At=fAtW,A0,V,At−1,Lt−1,UAt;Lt=fLtW,A0,V,At,Lt−1,ULt.

This framework assumes that each variable X in the observed data structure is an unknown deterministic function of all its preceding observed variables and some unmeasured exogenous random factors U. We will refer to the observed variables in the input of fX as the parents of X and denote their set as Pa(X).

Concretely, we describe the data example introduced in Section 1.1 with these notations. Our goal is to assess the effect modification of level 2 treatment strategy (augmenting vs switching) by the depression symptoms measured prior to entering level 2; yet many of these baseline effect modifiers were missing at random. By study protocol, all subjects will have either entered remission (treatment success), moved onto the next level (treatment failure), or dropped out (right censoring), by the end of 23 weeks of the level 2 period. For our study goal, the variables that we would enforce/randomize in an ideal experiment are the missingness status of the baseline effect modifier, the treatment assignment at the start of level 2, and the censoring status in each of the 23 weeks.

To this end, let V be the baseline effect modifier of interest. Let A0 be the indicator for non-missingness of V, and A1 be the treatment strategy received by a patient at level 2 (A1=0 for augmenting medication and A1=1 for switching medication). We use W to encode the baseline covariates that may affect the measurement status A0, and L0 to encode other baseline covariates (beside the modifier of interest) measured prior to treatment assignment A1. Hence, under this indexing, K=24 and the first week of level 2 starts at t=2. For t≥2,A_t is a counting process which drops to 0 if patient was censored by time t, and Lt are time-varying covariates such as visit statistics (duration in level thus far, visit frequency, etc), side-effect burden and symptom measures at time t. The variable Lt also contains two counting processes: the outcome process Yt, which is a binary indicator for entering remission by time t, and an exit process Et that jumps to 1 if a patient is moved to the next level, in which case the remission status will be zero for this level and the patient is considered non-censored (since the outcome was observed to be unsuccessful). Our final outcome of interest is YK — the remission status by the end of 23 weeks.

2.2 Parameter of interest

To assess the effect of the interventions, we study the intervention-specific counterfactual mean outcomes as a function of the interventions. These counterfactual mean outcomes can be obtained from an ideal experiment where one intervenes to assign A=a, and measures the resulting covariates and the outcome of interest. In our example, an ideal experiment would set A0=1 (always measure baseline modifier), A1=a1 (switching or augmenting), and A2,K=1 (always prevent drop outs). This data-generating process can be formalized using eq. (1) by setting At=at in the equations for A_t and in the parents of the non-intervention variables V,Lt and Y. We denote these resulting, possibly counterfactual, covariates as V(a0),Lt(at) and YK(a).

Now, we wish to assess effect modification of such treatment effects by changing values of V. That is, we ask “how does the differential effect of (a0,a∗1,K) vs (a0,a1,K), and of (a0,a1,K) vs (a0∗,a1,K), differ as a function of V, which is affected by the treatment assignment A₀?”. To answer this question, our parameters of interest are the so-called Counterfactual-History-Adjusted mean outcomes [1]

(2)ρ(a,v)≡EY(a)∣V(a0)=v,

and our goal is to study how these mean outcomes change as a function of a0,a1,K and v. As discussed in Section 1, eq. (2) differs from the traditional History-Adjusted mean outcomes, EY(A0,a1,K)∣V=v,A0=a0, in that these traditional parameters condition on the observed treatment and modifier values. Hence within each such stratum, the conditional mean outcome may still depend on the observed treatment mechanism. As such, these parameters are affected by potential selection bias.

In our example, the conditional mean outcomes we wish to study are E(Y(1,a1,K)∣V(1)=v), with {a1,K=(a1,…,aK):a1∈{0,1};at=1,t≥2}. In words, it is the intervention-specific counterfactual mean outcome in an ideal experiment without missing measurements or censoring.

A challenge to study of eq. (2) is that the curse of dimensionality arises in applications with more than two time points and/or with categorical or continuous V. To address this issue, it is useful to summarize eq. (2) using a working MSM, mψ(a,v), which is parametrized by a finite dimensional ψ∈S⊂Rd. Since the outcome Y is binary (or bounded within the unit interval), concretely we consider a logistic MSM

(3)mψ(a,v)=expitψ⋅ϕ(a,v),

where ϕ(a,v) is the vector of linear predictors in the generalized linear model. The methods presented here are easily modified to other forms of MSM.

Given a CHA-MSM eq. (3) for the conditional mean outcomes eq. (2), the true MSM parameter ψ can be interpreted as the best summary measure of ρ(a,v) as a function of a and v. Formally, let A denote the set of interventions of interest and V denote the set of modifier values of interest. For a given kernel weight function h(a,v), the true MSM parameter ψ in eq. (3) is defined as

(4)ψ=argminψ∈S−∑(a,v)∈A×VP(V(a0)=v)h(a,v)×ρ(a,v)logmψ(a,v)+1−ρ(a,v)log1−mψ(a,v)}.

In words, ψ yields the best weighted approximation of the counterfactual conditional dose-response curve ρ(a,v), according to the quasi-loglikelihood loss, kernel weights and working model mψ(a,v).

The rest of this paper is devoted to the identification and inference of eq. (4) from the observed data.

2.3 Statistical estimand

To identify eq. (4) from the data generating distribution P0, we make the Positivity Assumption (PA) and the Sequential Randomization Assumption (SRA, derived by [13]). Specifically, under the PA, there exists αt>0 such that αt≤P0(At=at∣Pa(At)), for all t and a∈A, almost everywhere. The SRA assumes that At⊥V(a0),{Lt(a):t}∣Pa(At). Under these conditions, the joint distribution V(a0),{Lt(a):t} is identifiable from the observed data distribution P0. In our STAR*D example, the plausibility of the SRA can be fortified by measuring sufficiently many confounders of the modifier’s missingness, the treatment selection, and the censoring mechanism.

By calculations shown in the Appendix, the SRA allows us to identify P(V(a0)=v) as

(5)γ(a0,v)≡EWP(V=v∣A0=a0,W),

and the counterfactual mean outcome ρ(a,v) as

(6)ρ(a,v)≡EWP(V=v∣A0=a0,W)γ(a0,v)×Q0a(V=v,W),

where, for t=0,…,K,

(7) Q t a ( L t − 1 , V , W ) ≡ ∑ l t , K y K ∏ j = t K P ( l j ∣ A j = a j , L t − 1 , l t , j − 1 , V , A 0 = a 0 , W ) ,

where lt,j−1=(lt,...,lj−1) for t<j-1. Equation (7) represents a conditional expectation of the final outcome YK given observed histories (Lt−1,V,W) and under an intervention a on A.

For simplicity, we suppressed the notation for the functionals γ,ρ,Qta, as each should be evaluated with respect to a given distribution. We will use γ(P),ρ(P),Qta(P) to express explicitly these functionals evaluated at a specific distribution P when needed, but will keep the shorthand notation otherwise.

In comparison to eq. (6), parameters in conventional settings where the modifier V is not counterfactual can be written as

EW1/P(A0=a0∣W)EW1/P(A0=a0∣W)∣V=v,A0=a0Q1a(L0,V=v,W)V=v,A0=a0.

The weight 1/P(A0=a0∣W)E1/P(A0=a0∣W)∣V=v,A0=a0 adjusts for potential selection bias introduced by treatment assignment A0=a0. Indeed, if A0 does not depend on W (in our STAR*D example, this means missingness is completely at random), then this weight equals 1, in which case ρ(a,v) is equivalent to the estimand in an analysis stratified by A0.

Combining eqs. (5) and (6), the causal MSM parameter ψ in eq. (4) identifies to

(8)ψ0≡Ψ(P0)≡argminψ∈S−∑(a,v)∈A×Vγ(a0,v)h(a,v)×ρ(a,v)logmψ(a,v)+1−ρ(a,v)log1−mψ(a,v)},

where the functionals γ,ρ are evaluated at P0. In the forthcoming sections, we study the statistical inference of Ψ(P0).

2.4 Notations

Before we proceed, let us introduce some more useful definitions and notations. Let M be a saturated semiparametric model containing our data generating distribution P0. The parameter of interest in eq. (8) is the map P↦Ψ(P), from M to Rd, evaluated at P₀.

Suppose we observe n i.i.d. copies of O∼P0. Let Pn denote the empirical distribution of this sample. For a function f of O, we will write Pnf≡1n∑i=1nf, and for a distribution P, we will write Pf=EPf(O).

Bang and Robins [14] noted the following recursive property when dealing with longitudinal intervention-specific mean outcomes:

(9)Qta(Lt−1,V,W)=EPQt+1a(Lt,V,W)At=at,Lt−1,V,A0=a0,W,

for t=0,…,K. This will prove useful in our upcoming endeavor. We also adopt the notations QW(P) for the marginal distribution of W,QV(P) for the conditional distribution P(V∣W,A0), and Q(P)≡QW(P),QV(P),Qta(P):t=0,…K. We write g for the treatment allocation probabilities P(At∣At−1,Lt−1,V,W). When referring to a generic P∈M, we may sometimes write Q and g in place of Q(P) and g(P), similarly for their respective components. When referring to the functionals at the data-generating distribution P0, we may sometimes write Q0 and g0, in place of Q(P0) and g(P0). When there is no confusion, we will simply write Q and g under the implied distribution.

3 Statistical inference: A tale of two influence functions

As defined in eq. (8), Ψ(P) optimizes a function of the functionals γ and ρ evaluated at P. Note also that ρ(a,v)=η(a,v)γ(a0,v), where

η(a,v)≡EP(V=v∣A0=a0,W)×Q0a(V=v,W).

We will make use of the following useful characterizations for Ψ(P):

Remark 1

For m ψ ( a , v ) = expit ψ ⋅ ϕ ( a , v ) , Ψ ( P ) defined in eq. (8), and η and γ functionals evaluated at P , we have

0 = U ( Ψ ( P ) ) ≡ ∑ ( a , v ) ∈ A × V h ( a , v ) ϕ ( a , v ) γ ( a 0 , v ) η ( a , v ) γ ( a 0 , v ) − m Ψ ( a , v ) = E W ∑ ( a , v ) ∈ A × V h ˜ ( a , v ) P ( V = v ∣ A 0 = a 0 , W ) Q 0 a ( V = v , W ) − m Ψ ( a , v ) = E W ∑ a ∈ A E V h ˜ ( a , V ) Q 0 a ( V , W ) − m Ψ ( a , V ) ∣ A 0 = a 0 , W = E ∑ a ∈ A I ( A 0 = a 0 ) g ( A 0 = a 0 ∣ W ) h ˜ ( a , V ) Q 0 a ( V , W ) − m Ψ ( a , V ) ,

where h ˜ ( a , v ) ≡ h ( a , v ) ϕ ( a , v ) , and the last expectation is taken over ( W , A 0 , V ) .

The derivations are straightforward, and we left them in the Appendix for reference.

Suppose that the following normalizing matrix is invertible at ψ=Ψ:

(10) M ( ψ ) = ∂ ∂ ψ ∑ ( a , v ) ∈ A × V γ ( a 0 , v ) h ( a , v ) ϕ ( a , v ) ρ ( a , v ) − m ψ ( a , v ) = E ∑ a ∈ A I ( A 0 = a 0 ) g ( a 0 ∣ W ) h ( a , V ) ϕ ( a , V ) ϕ ( a , V ) ⊤ m ψ ( a , V ) 1 − m ψ ( a , V ) .

3.1 Inverse probability weighted estimator ψIPW

Define

Cta(g)≡I(At=at)g(A0=a0∣W)∏j=1tgAj=aj∣Aj−1=aj−1,Lj−1,V,W,

for t=0,…,K. We note that these are functionals of g(P), but we have suppressed the notation here.

From Remark 1, a valid estimating function for Ψ is given by

(11)DIPW(g,ψ)≡∑a∈ACKa(g)h˜(a,V)YK−mψ(a,V).

Up to a normalizing matrix M(ψ)−1,DIPW(g,ψ) is a gradient for Ψ under a model Mg where g is known. Note that P0DIPW(g(P0),ψ)=U(ψ0) and the corresponding equation to zero is solved for ψ0=Ψ(P0). Therefore, this is an unbiased estimating function for ψ0 and ψIPW is an unbiased estimator if g=g(P0).

To implement the IPW estimator, we first obtain estimators of the probability of treatment propensity gˆ of g. For an observation in the dataset with A having a value in A, the variable CKA(gˆ) will have numerator equals 1 and probabilities in the denominator given by gˆAj∣Aj−1,Lj−1,V,W; otherwise CKA(gˆ)=0. The estimator ψIPW can be obtained by fitting a weighted logistic regression of YK on ϕ(A,V), with weights CKA(gˆ)h˜(A,V). This ψIPW satisfies PnDIPW(gˆ,ψIPW)=0, and it is consistent if gˆ consistently estimates g. Under standard regularity and empirical process conditions, ψIPW is asymptotically linear with influence function M(ψ0)−1DIPW(g,ψ0). The asymptotic covariance of n(ψIPW−ψ0) can be estimated by the sample covariance matrix ΣIPW of M(ψIPW)−1DIPWgˆ,ψIPW.

Due to its inverse weighting by products of treatment and censoring probabilities, this estimator may be sensitive to near positivity violations, wherein observations with small observed treatment propensity would be given excessive weight. This can be particularly pronounced as the number of time points increase. Substitution estimators like the ones we propose in sections 3.2 and 3.3 can slightly mitigate this problem by incorporating global information in the parameter map. However, the effect of near positivity violations can still take form of poor estimation in the nuisance parameters.

3.2 Non-targeted substitution estimators ψVsubs and ψA0subs

From eq. (8) and Remark 1, we can formulate Ψ as ΨQ0a,QV or ΨQ0a,g. These two formulations lead to two non-targeted substitution estimators, ψVsubs and ψA0subs. Note the latter option opens the door for G-computation estimators in applications with high-dimensional V.

We start by obtaining an estimator of Q0a(V,W) for a given a∈A. We will exploit the recursive property of Qta noted in eq. (9), so that we can avoid estimation of densities, and will instead employ regression-based procedures to estimate the conditional expectations.

Initiate at t=K: We first regress YK on LK−1,A,V,W among observations uncensored by time K, to obtain an estimator of EYK∣LK−1,A,V,W. This can be implemented using data-adaptive (e.g. SuperLearner by [15]) or parametric procedures. We then evaluate this fitted estimator at the observed (LK−1,V,W) with setting A=a, to obtain an estimator QˆKa(LK−1,V,W) for each observation uncensored at time K−1.
At each t=K−1,…,0, in decreasing order, we would have obtained an estimator Qˆt+1a(Lt,V,W) for observations uncensored by time t+1. We now regress this variable on Lt−1,At,V,W among observations uncensored by time t, and then evaluate the resulting fit at the observed (Lt−1,V,W) with setting At=at among observations uncensored at time t−1.
After running step 2 sequentially from t=K−1 down to t=0, we would have an estimator Qˆ0a(V,W) for each of the n observations. We can now use Qˆ0a(V,W) to estimate Ψ through the characterizing equations in Remark 1.

Consider the first representation ΨQ0a,QV. Let QˆV be an estimators of QV(v∣A0=a0,W). For each a∈A and each v∈V, we create a row of data consisting of Qˆ0a(v,W),ϕ(a,v),h(a,v),QˆV for each observation in the dataset. We then pool together these datasets characterized by (a,v). The estimator ψVsubs can be obtained by a weighted logistic regression of Qˆ0a(v,W) onto ϕ(a,v) on the pooled dataset, with weights h(a,v)QˆV. This ψVsubs is consistent if both Qˆ0a and QˆV are consistent. Here we use the empirical distribution to estimate the marginal distribution of W.

Consider now the alternative representation ΨQ0a,g0, from the last equality in Remark 1. Let gˆ be an estimator of g. For each a∈A, we create a row of data consisting of Qˆ0a(V,W),ϕ(a,V),h(a,V) and I(A0=a0)/gˆ(a0∣W) for each observation in the dataset. We then pool together these datasets characterized by a. The estimator ψA0subs can be obtained by a weighted logistic regression of Qˆ0a(V,W) onto ϕ(a,V) on the pooled dataset, with weights h(a,V)I(A0=a0)gˆ(a0∣W). This ψA0subs is consistent if both Qˆ0a and gˆ are consistent. Here we use the empirical distribution to estimate the marginal distribution of (W,A0,V).

While these substitution estimators use weighted regressions to obtain estimators for ψ0, they are less susceptible to extremely large weights than the IPW estimator described above. Indeed, QˆV is bounded within (0,1) while the inverse 1/gˆ utilizes product of only one conditional probability. If the Q components are correctly specified parametric models, then the substitution estimators are also asymptotically linear and converge to a normal random variable at n1/2. However, the parametric modeling assumptions are difficult to justify in scientific investigations. Data-adaptive alternatives may yield estimators which are not n1/2 consistent, which prohibits statistical inference.

3.3 Targeted maximum likelihood estimators

In a non-targeted substitution estimator, we can estimate the nuisance parameters of the data-generating distribution by stratification (nonparametric MLE), by fitting a parametric statistical model, or by using a machine learning algorithm. These estimates are then used to evaluate the parameter of interest. As the number of potential confounders increases, these methods may break down due to the curse of dimensionality, or yield a bias–variance trade off that is not optimal for the parameter of interest. The targeted learning approach aims at developing optimal (n1/2 consistent, asymptotically normal, and efficient) estimators of smooth low-dimensional parameters through the use of data-adaptive machine learning [16].

A targeted MLE adds a loss-based targeting step that aims to modify the initial estimators towards the parameter of interest, and provides potential robustness and semiparametric efficiency gains. The targeting is guided by an influence function for the target parameter eq. (8). As a result of this targeting step, the final estimates of the nuisance parameters and of the target parameter satisfy a user-chosen score equation. Under certain regularity conditions, this yields inference based on the Central Limit Theorem. For example, it has been shown that both TMLE and the estimating equation based estimators (using the EIF as estimating equation) are double robust asymptotically efficient estimators under a Donsker class condition and a second order reminder condition [16]. The Donsker class condition roughly states that the EIF at Qn and gn falls in a Donsker class such as multivariate real valued cadlag functions with uniformly bounded sectional variation norm. The second order condition states that the product, of (a) the L2 norm of the estimated outcome regressions minus their true counterpart with (b) the L2 norm of the estimated treatment/censoring mechanism minus their true counterpart, converges to zero at a a rate faster than n1/2. The advantage of the TMLE is that it does not rely on the existence of a unique solution to the estimating equation. The TMLE estimators in this section are asymptotically linear and efficient under these conditions. They also allow different combination of rates for Qn and gn. For example, if g is known, then only consistency is needed for asymptotic efficiency.

Another advantage of the targeted learning approach is that it does not rely on the existence of a unique solution of the estimating equation. Thus for marginal structural working models that are non-linear in parameters, the non-targeted estimating equations estimators may have multiple solutions or no solution. We refer to [2, 3] for the general methodology.

3.3.1 Efficient influence function

We first establish the EIF for our parameter of interest. From a fundamental result in [17], under standard regularity conditions, the variance of the canonical gradient of Ψ provides a generalized Cramer-Rao lower bound for any regular and asymptotically linear estimators of Ψ. Therefore, this canonical gradient is a vital ingredient in building asymptotically linear and efficient estimators; fittingly, it is also commonly known as the EIF. For parameters in causal inference and missing data applications (such as those in our examples), the EIF also provides insights into the potential robustness against model mis-specifications.

From the first equality in Remark 1, we can obtain the EIF for Ψ via implicit differentiation. We formally state the result here and leave the proof in the Appendix.

Lemma 1

(Efficient Influence Function)

The EIF of Ψ at P∈M is given by M(ψ)−1D(ψ), where

(12) D ( ψ ) = ∑ t = 0 K D t ( Q , g ) + D V ( Q , g , ψ ) + D W ( Q , ψ ) ,

w i t h D t ( Q , g ) ≡ ∑ a ∈ A C t a ( g ) h ˜ ( a , V ) Q t + 1 a ( L t , V , W ) − Q t a ( L t − 1 , V , W ) , D V ( Q , g , ψ ) ≡ ∑ a ∈ A I ( A 0 = a 0 ) g ( A 0 = a 0 ∣ W ) { h ˜ ( a , V ) Q 0 a ( V , W ) − m ψ ( a , V ) − E V h ˜ ( a , V ) Q 0 a ( V , W ) − m ψ ( a , V ) A 0 , W } , D W ( Q , ψ ) ≡ ∑ a ∈ A E V h ˜ ( a , V ) Q 0 a ( V , W ) − m ψ ( a , V ) A 0 = a 0 , W .

Moreover, if Q(P)=Q(P0) or g(P)=g(P0), then EP0D(P)=0 implies Ψ(P)=Ψ(P0).

Proof

The proof is given in the Appendix.

3.3.2 Targeted MLE using the EIF D(ψ)

The steps described in Section 3.2 can be used to obtain initial estimates of the Q components. The targeted estimators will sequentially update these initial estimates by finding a best fluctuation along a submodel given by components of D(ψ) in eq. (12).

Regarding Qta as a conditional expectation of Qt+1a, as given in eq. (9), we use the quasi loglikelihood loss function for Qta:

(13) L Q t a = − Q t + 1 a log Q t a + 1 − Q t + 1 a log 1 − Q t a .

For a given (Q,g), and each t=0,…,K, consider the d-dimensional working submodel {Qta(ϵ):ϵ}, with

(14)Qta(ϵ)=expitlogitQta+ϵ⋅h˜(a,v)Cta(g).

This submodel satisfies 〈ddϵ∑aLQta(ϵ)|ϵ=0〉⊃〈Dt(Q,g)〉, where 〈x〉 represents the linear span of a vector x.

Next, we rewrite DV and DW in eq. (12) as

DV(Q,g,ψ)≡∑(a,v)∈A×VI(A0=a0)g(a0∣W){h˜(a,v)Q0a(V=v,W)−mψ(a,v)×I(V=v)−QV(v∣A0,W)},DW(Q,ψ)≡∑(a,v)∈A×Vh˜(a,v)QV(v∣A0=a0,W)Q0a(V=v,W)−mψ(a,v).

We also consider the loss function L(QV)≡−logQV, and a d-dimensional fluctuation model through a given QV at ϵ=0 given by

QV(ϵ)≡QV(V∣A0=a0,W)expϵ⋅B(Q,ψ)∑vQV(v∣A0=a0,W)expϵ⋅B(Q,ψ),

where B(Q,ψ)≡∑aI(A0=a0)g(a0∣W)h˜(a,V)Q0a(V,W)−mψ(a,V). It is easy to verify that 〈ddϵL(QV(ϵ))|ϵ=0〉⊃〈DV(Q,g,ψ)〉.

Start at t=K: regress YK on LK−1,A,L0,V,W, and then evaluate at A=a to obtain an initial estimator QˆKa of QKa. The optimal fluctuation amount around this initial estimate is given by

ϵ ˆ K ≡ arg min ϵ ∑ a P n L Q ˆ K a ( ϵ ) .
This can be implemented by creating one row for each individual and each a∈A, and fitting a weighted logistic regression of YK on the multivariate covariate ϕ(a,v)CKa(gˆ) on these observations with weights h(a,v) and offset QˆKa(LK−1,V,W). Update the initial estimator using QˆK∗,a≡QˆKa(ϵˆK).
At each subsequent step t=K−1,…,0, we have thus far obtained an updated estimator Qˆt+1∗,a for each a∈A. Regress Qˆt+1∗,a on Lt−1,At,L0,W,V and evaluate at At=at to get an initial estimator Qˆta of Qta. The optimal fluctuation amount around this initial estimator is given by ϵˆt=argminϵ∑aPnLQˆta(ϵ), and can be obtained in an analogous manner to step 1. The updated estimator is Qˆt∗,a≡Qˆta(ϵˆt).
After sequentially performing step (2) in decreasing order of t, we now have a targeted estimator Qˆ0∗,a of Q0a.
Let QˆV be an estimator of QV. For each a∈A,v∈V and individual i, create a row of data consisting of Qˆ0∗,a(V=v,W),h(a,v),ϕ(a,v) and QˆV(v∣A0=a0,W). Obtain a first-iteration estimator ψ1 of ψ0 by fitting a weighted logistic regression of Qˆ0∗,a(V=v,W) onto ϕ(a,v), with weights h(a,v)×QˆV(v∣A0=a0,W), on this pooled data.
Given ψ1 obtained in previous step, we update the estimator for QV as follows. Using previously obtained Qˆ0∗,a,gˆ, and ψ1, the optimal fluctuation amount around the initial QˆV is given by ϵˆV=argminϵPnI(A0=a0)gˆ(a0∣W)LQˆV(ϵ). This can be obtained by solving for ϵ in the equation

0 = ∑ i = 1 n I ( ( A 0 ) i = a 0 ) g ˆ ( a 0 ∣ W i ) × B ˆ ( W i , V i ) − ∑ v B ˆ ( W i , v ) Q ˆ V ( v ∣ A 0 = a 0 , W i ) exp ⋅ B ˆ ( W i , v ) ∑ v Q ˆ V ( v ∣ A 0 = a 0 , W i ) exp ⋅ B ˆ ( W i , v )
where Bˆ≡B(Qˆ0∗,a:a),ψ1. The updated density is given by Qˆ1,V≡QˆV(ϵV).
Having obtained a parameter estimate ψj and an updated density Qˆj,V at the j-th iteration, repeat step (4) and (5) to obtain a targeted estimate ψj+1 and Qˆj+1,V, until ϵˆV converges to 0. In practice, this convergence can be achieved (close to 0) after a few iterations. We denote the final updates as ψ∗,TMLE and Qˆ∗,V. We call this estimator, ψ∗,TMLE, the full TMLE.

Let Qˆ∗≡QˆW,Qˆ∗,V,Qˆt∗,a, where QˆW is the empirical distribution of W. By design, PnD∗(Qˆ∗,gˆ,ψ∗,TMLE)=0. From Lemma 1, we know that ψ∗,TMLE is consistent if either Qˆ∗=Q0 or gˆ=g0. Under standard regularity and empirical process conditions, ψ∗,TMLE is asymptotically linear with influence function M(Ψ)−1D∗(Ψ). The asymptotic covariance of n(ψn∗,TMLE−ψ0) can be estimated by the sample covariance matrix Σ∗,TMLE of M(ψ∗,TMLE)−1D∗Qˆ∗,gˆ,ψ∗,TMLE.

In particular, since M(Ψ)−1D∗(Ψ) is the canonical gradient of Ψ, the estimator ψ∗,TMLE is asymptotically efficient if all relevant components in D∗ are consistently estimated.

Remark 2

To implement the proposed targeted estimator and satisfy the efficient score equation related to eq. (12), we must specify (and iteratively update) the conditional density of V given A 0 , W . When V is high-dimensional, correct specification of these densities may be difficult to accomplish.

As an alternative, a regression-based estimator can be used to directly estimate the conditional expectation E V h ˜ ( a , V ) Q ˆ 0 ∗ , a ( V , W ) − m ψ ( a , V ) ∣ A 0 = a 0 , W in D V and D W . A root ψ that solves the corresponding D V + D W can be computed via numerical methods. For each ψ in the solution space, one must evaluate the corresponding conditional expectation, for each a ∈ A . This can become computationally intractable. Moreover, in such estimating equation based approach, the root ψ to the efficient score equation may not be unique.

If components corresponding to V are not estimated correctly, the resulting parameter estimate would not be efficient, and its consistency would hinge on correct specification of g . In the following section, we trade-off estimation of the V component using a projected influence function that retains most of the robustness properties of the EIF, and the resulting estimation is computationally less demanding as it avoids estimating components of V .

3.3.3 Projected influence function

Lemma 2

(Projected Influence Function)

Consider the setup in Lemma 1. The influence function for Ψ at P under the model MA0, in which g(A0∣W) is known, is Mψ−1DA0(ψ), where

(15) D A 0 ( ψ ) = ∑ t = 1 K D t ( ψ ) + D W A 0 ( Q , g , ψ ) , w i t h D W A 0 ( Q , g , ψ ) ≡ ∑ a ∈ A I ( A 0 = a 0 ) g ( a 0 ∣ W ) h ˜ ( a , V ) Q 1 a ( L 0 , V , W ) − m ψ ( a , V ) .

In particular, it is a valid estimating function for Ψ:M→Rd.

Suppose g(A0∣W) is correctly specified. If in addition, either g(At∣At−1,Lt−1,V,W) or Qta are correct for all t≥1, then EP0DA0(Ψ)=0 implies Ψ(P)=Ψ(P0).

Proof

The proof is given in the Appendix.

At its face value, the proposed DA0 may seem less robust than the EIF D, as it always relies on consistent estimation of g(A0∣W). However, as we noted in Remark 2, when V is high-dimensional, there are more standard machine learning algorithms available for estimating g(A0∣W), which is the propensity score of a binary treatment. Moreover, as it becomes apparent in the next section, the estimators that utilizes DA0, which we call the projected TMLE, are also easier to implement, since standard software packages can be used, and no iterated updates are required in the targeted estimator.

3.3.4 Targeted TMLE using the PIF DA0(ψ)

We follow steps (1) and (2) in Section 3.3.2. We use the loss functions eq. (13) and fluctuation models eq. (14) to estimate and update Qˆt∗,a, sequentially from t=K to t=1.
We now have a targeted estimator Qˆ1∗,a. We obtain an estimator ψpTMLE, i.e., the projected TMLE of ψ0 by fitting a weighted logistic regression of Qˆ1∗,a(L0,V,W) on ϕ(a,V), with weights h(a,v)I(A0=a0)gˆ(a0∣W).

By construction, PnDA0Qˆt∗,a,gˆ,ψpTMLE=0. From Lemma 2, ψpTMLE is a consistent estimator of ψ0 if gˆ(A0∣W) is consistent, and either Qˆt∗,a or gˆ(At∣At−1,Lt−1,V,W), for t=1,…,K are consistent.

Compared with the full TMLE using D(ψ) in the previous section, this estimator is particularly appealing when V is high-dimensional, and still provides more robustness protection than the IPTW or the non-targeted substitution estimators. Moreover, under standard regularity and empirical process conditions, ψpTMLE is asymptotically linear with influence function M(Ψ)−1DA0. The asymptotic covariance of n(ψpTMLE−ψ0) can be estimated by the sample covariance matrix ΣpTMLE of M(ψpTMLE)−1DA0Qˆt∗,a,gˆ,ψpTMLE.

4 Simulation study

In this section, we examine the relative performance of the IPTW estimator (Section 3.1), the G-computation (i.e., non-targeted substitution) estimator (Section 3.2), and the projected TMLE estimator (Section 3.3.4) for the parameters of an MSM model. Though the full TMLE is the most efficient, as Remark 2 noted, when V is high dimensional, correct specification of all conditional densities of V given A0,W may be difficult. The simulation is thus focused on the performance of the projected TMLE in comparison with the IPTW and G-computation estimators.

4.1 Data generating process and target parameter

We consider an example with data structure O=(W,A0,V,L0,(At,Lt):t=1,…K) with K=3, where A₀ is the missing mechanism for V,A1 is the treatment assignment, and A_t for t>1 is the indicator of remaining in the study by time t. Time varying covariate Lt consist of Lt1,Lt2, and the death indicator Yt. The data generating process is as follows:

W1,W2∼Bern(0.3),Bern(0.7);A0∼Bernexpit(1+2W1+0.1W2);V∈{0,1,2}∼{I(V=1)∼Bernexpit(−2+1.2W1+0.7W2),{I(V=2)∣V≠1}∼Bernexpit(−0.7+1.2W1+W2)};L01∼Bernexpit(−0.2+2W1+0.5W2+0.2I(V=1)+0.4I(V=2)),L02∼Bernexpit(−0.8+W1+W2−0.3I(V=1)−0.1I(V=2));

At∼Bern(expit(−1+W1+1.3W2+0.1I(V=1)+0.1I(V=2)+1.2L01+L02−0.7W1×L02−0.5W2×L01)), for t=1,Bern(expit(2+W1+W2+0.1I(V=1)+0.1I(V=2)+0.6L01+1.2L02−0.5At−1−0.1t+0.8Lt1−0.3Lt2+0.1Lt−11−0.2Lt−12−0.3A×L02+0.2At−1×W1−0.3At−1×Lt2−0.2At−1×Lt−11)), for t \gt 1.Lt1∼Bern(expit(−1+W1+0.1W2−0.5I(V=1)−0.7I(V=2)+L01+0.3L02+1.5At+0.4t−Lt−11−0.8At×I(V=1)−0.2At×I(V=2)−0.3At×W1));Lt2∼Bern(expit(−2+0.1W1+0.1W2+0.5I(V=1)+0.5I(V=2)+0.7L01+0.2L02−At+0.2t+Lt−12−0.2At×I(V=1)−0.4At×I(V=2)−0.3At×L02));Yt∼Bern(expit(−1.4+1.5W1+W2−0.7I(V=1)−0.8I(V=2)+L01+L02+At−Lt1−0.1Lt2−Lt−11−0.3Lt−12+At×I(V=1)+0.8At×I(V=2)−0.3At×Lt1−0.4At×Lt−11−0.3At×L02−0.2At×W1)).

Once either the censoring equals to 0 or death process equals to 1, then all subsequent variables are encoded by carrying forward their last observation.

Our interventions of interest are A={(1,0,1,1),(1,1,1,1)}. Under the above distribution, 0.1<P(A1=1∣⋅)<0.95, and P(At=1∣⋅)>0.5 for all t≥2.

We model the dose response {ρ(a,v):a,v} by the MSM

mψ(a,v)=expit(ψ1+ψ2a1+ψ3v1+ψ4v2+ψ5a1v1+ψ6a1v2)=expit(ψ⋅ϕ(a,v)),

where ϕ(a,v)=1,a1,v1,v2,a1v1,a1v2, with kernel weights h(a,v)=P(a1∣v,A0=a0). Note that in this case, the kernel weights are assumed to be known. The target parameter defined in eq. (8) takes value ψ0=0.825,0.105,0.249,−0.046,1.474,0.960.

4.2 Estimators

The propensity scores g(A0∣W) is estimated using Super Learner [15], with base algorithms glm and nnet, adjusting for W1 and W2. This is considered a correct specification of the propensity for A0. We use two estimators for the subsequent propensities at t≥1: a correctly specified logistic model (shorthand ‘gc’), and a misspecified logistic model (‘gm’) that misses some key baseline and time-varying covariates. The denominator for each Cta(g) is truncated below by 0.025. We use two estimators for Qta: the correctly specified estimator (‘Qc’), and the misspecified estimator (‘Qm’) only adjusts for V1 and V2 at each time t. Both estimators Super Learner with candidate fitting algorithms glm and nnet.

We consider three cases of model misspecification on Qta and g(At∣⋅) for t≥1: all correct (‘Qc, gc’); correct Qta and misspecified g (‘Qc,gm’); misspecified Qta and correct g (‘Qm, gc’). For all three cases we always use the same correctly specified g(A0∣W). We implement the second version of the G-computation estimator in Section 3.2, where the weights are given by g. This estimator changes only under specifications ’Qc, gc’ and ‘Qm, gc’. The IPTW estimator changes only under specifications ‘Qc, gc’ and ‘Qc, gm’.

4.3 Results

The bias, variance, and coverage probability (for the influence function-based confidence intervals) are appraised using 500 and 2000 repetitions.

In Table 1, we see that when g is misspecified, projected TMLE using a correct Qta reduces bias over the misspecified IPTW estimator. Similarly, when Qta is misspecified, projected TMLE using the correct g reduces bias over the misspecified G-computation estimator. When comparing the correct vs misspecified G-computation estimators, and the correct vs misspecified IPTW, coefficients involving the adjusted covariates (V1,V2) were still estimated very well by the misspecified estimator. Under ‘Qc, gc’, the correct G-computation estimator converges much slower than the IPTW and the projected TMLE estimators. We posit that this may be due to its sole reliance on the nonparametric likelihood estimates. As expected, G-computation has the smallest sample variance, and IPTW has the largest sample variance despite the truncated estimators for g.

Under certain regularity conditions, the IPTW and projected TMLE estimators are asymptotically linear. Table 2 tabulates the coverage probability of their influence function based confidence intervals. At the correct models ‘Qc,gc’, IPTW and projected TMLE are asymptotically linear with influence function DIPW and DA0, respectively. We used varˆDnIPW/n and varˆDnA0/n to estimate their respective standard errors. As sample size grows, the actual coverage probability is quite close to the nominal coverage probability, with IPTW having a better coverage. When one of the components is misspecified, the projected TMLE still provides very good coverage, even though theoretically DA0 is only part of its influence curve; we postulate that this is because the influence function based standard error estimates are large relative to the finite sample bias. The misspecified IPTW has very good coverage for the covariates that were adjusted for in the misspecified model, but very bad coverage for the confounded coefficients (A and the intercept).

Table 1:

Results: Bias, Variance, MSE for estimators of ψ0. Qc = correct Qta, Qm= misspecified Qta, gc=correct g, gm=misspecified g.

ψ	Intercept		A		V 1		V 2		A × V 1		A × V 2
n	500	2000	500	2000	500	2000	500	2000	500	2000	500	2000
Bias

Qc, gc
Gcomp	0.334	0.364	0.606	0.603	0.82	0.818	0.560	0.552	1.377	1.390	0.906	0.904
IPW	0.016	0.017	0.056	0.063	0.169	0.131	0.070	0.073	0.402	0.117	0.083	0.099
projected TMLE	0.036	0.001	0.021	0.002	0.078	0.054	0.014	0.008	0.519	0.018	0.002	0.006

Qc, gm
IPW	0.492	0.512	0.831	0.836	0.094	0.131	0.075	0.072	0.638	0.160	0.026	0.004
projected TMLE	0.022	0.033	0.029	0.028	0.069	0.029	0.02	0.006	0.499	0.015	0.008	0.008

Qm, gc
Gcomp	0.751	0.773	1.383	1.353	0.66	0.640	0.431	0.414	1.089	1.069	0.752	0.727
projected TMLE	0.01	0.024	0.04	0.064	0.123	0.124	0.048	0.062	0.462	0.103	0.041	0.081

Variance

Qc, gc
Gcomp	0.082	0.019	0.112	0.023	0.137	0.033	0.084	0.017	0.038	0.008	0.022	0.005
IPW	0.134	0.035	0.226	0.051	0.939	0.118	0.353	0.068	9.605	0.235	0.555	0.112
projected TMLE	0.113	0.029	0.187	0.041	0.734	0.091	0.279	0.053	9.149	0.169	0.398	0.092

Qc, gm
IPW	0.114	0.030	0.196	0.045	0.897	0.093	0.259	0.056	9.298	0.184	0.418	0.094
projected TMLE	0.097	0.026	0.16	0.036	0.718	0.078	0.21	0.048	9.062	0.141	0.319	0.078

Qm, gc
Gcomp	0.095	0.024	0.146	0.035	0.147	0.045	0.091	0.024	0.191	0.057	0.071	0.025
projected TMLE	0.123	0.031	0.201	0.044	0.839	0.105	0.316	0.057	9.424	0.207	0.464	0.096

MSE

Qc, gc
Gcomp	0.195	0.151	0.479	0.387	0.809	0.701	0.397	0.322	1.934	1.941	0.842	0.822
IPW	0.134	0.035	0.229	0.054	0.967	0.135	0.358	0.073	9.767	0.248	0.562	0.122
projected TMLE	0.115	0.029	0.187	0.041	0.740	0.094	0.279	0.053	9.419	0.17	0.398	0.092

Qc, gm
IPW	0.357	0.292	0.887	0.744	0.906	0.111	0.264	0.062	9.704	0.21	0.419	0.094
projected TMLE	0.097	0.027	0.16	0.036	0.723	0.079	0.210	0.048	9.311	0.141	0.319	0.078

Qm, gc
Gcomp	0.66	0.621	2.059	1.865	0.583	0.455	0.277	0.195	1.377	1.2	0.6378	0.554
projected TMLE	0.123	0.031	0.203	0.048	0.854	0.12	0.318	0.061	9.637	0.218	0.466	0.103

Gcomp: G-computation estimator. IPW: inverse probability weighted estimator. TMLE: targeted maximum likelihood estimator.

Table 2:

Coverage Probability for the Asymptotically Linear Estimators, using Influence-Function based Confidence Interval. Qc = correct Qta, Qm= misspecified Qta, gc=correct g, gm=misspecified g.

	Intercept		A		V 1		V 2		A × V 1		A × V 2
n	500	2000	500	2000	500	2000	500	2000	500	2000	500	2000
Qc, gc
IPW	0.948	0.946	0.944	0.940	0.966	0.930	0.940	0.952	0.912	0.950	0.956	0.960
projected TMLE	0.934	0.932	0.934	0.942	0.936	0.924	0.918	0.946	0.906	0.942	0.942	0.938

Qc, gm
IPW	0.698	0.146	0.556	0.022	0.956	0.928	0.958	0.958	0.928	0.942	0.958	0.964
projected TMLE	0.964	0.956	0.954	0.956	0.984	0.984	0.972	0.986	0.932	0.954	0.966	0.964

Qm, gc
projected TMLE	0.942	0.930	0.942	0.948	0.956	0.930	0.936	0.954	0.896	0.942	0.954	0.958

IPW: inverse probability weighted estimator.

5 Data analysis example

To illustrate the application of the projected TMLE, we revisit our STAR*D example. After an initial screening, patients are enrolled into level 1 of the treatment, where everyone is treated with citalopram. At the start of each subsequent level, if a patient still remains in the study, then he/she is randomized to an option (i.e. a particular medication) within one of his/her accepted treatment strategies (augmenting vs switching). Regular follow-up visits are conducted throughout each level. At each follow-up visit, covariates are collected, and the patient is subject to dropout, entering remission, or moving onto next level if treatments fail.

In the field of depression research, there is very little consistent literature concerning individual baseline characteristics that may differentially modify the effect of switching medication vs augmenting medication on remission. Because the strategies themselves are not randomized, this analysis is de facto equivalent to that of an observational study. Our estimators will account for baseline and time-varying confounding. All candidate modifiers are a priori selected through a literature review of previous START*D publications. These covariates are measured prior to the assignment of level 2 treatment, either at study screening, clinical visits or exit surveys at the end of level 1. Many of these candidate modifiers are subject to missingness, often times due to missing visits and surveys, therein lies the need for the tools developed here. Our study population is the set of 1395 patients at level 2 who found medication strategies acceptable. Table 3 tabulates the events in level 2 by strategies received. Note that there are three strategies, but we are only comparing switching medication vs augmenting medication. The data structure was described in detail in Section 2 as part of our running example.

We consider here two types of potential effect modifiers: some are measured at enrollment and some are measured at exit of level 1. If V is a baseline covariate, W includes all demographic variables and medical and psychiatric history prior to enrollment; if V is a level 1 exit covariate, then we add to W variables summarizing number of visits, adherence to study protocol, and time spent in level 1. Table 4 summarizes percent of missingness, range, and scale of each effect modifier. Of all the candidate modifiers proposed by literature review, we exclude from our analysis the history of amphetamine use at baseline and history of drug abuse, since these two variables are missing for more than 70% of the patients. The multivariate nature of most of the effect modifiers underscores the need for projected TMLE.

Table 3:

Level 2 events by acceptable strategy received.

Received	Medication-Sw	Medication-Aug	Cognitive Therapy

n	727	565	103
Remission	257 (35%)	287 (51%)	53 (51%)
No Remission	227 (32%)	132 (23%)	22 (21%)
Dropout	243 (33%)	146 (26%)	28 (27%)

The analysis below compares medication switching (Sw) and medication augmentation (Aug) only.

Table 4:

Candidate effect modifiers of interest for level 2. Candidates with more than 50% missing in the study population will be discarded from the analysis.

	% missing	Type (# of categories, or range of semi-continuous variable)
At screening

Race	0	Categorical (3)
Years of schooling completed	0	Semi-continuous (0,27)
Employment status	0	Categorical (3)
Education level	0	Categorical (3)
Attempted suicide	0	Categorical (2)
Duration of current major depression episode (in units of 6 months)	1	Semi-continuous (0,111)
AxI: Amphetamine	73 (will not use)	Categorical (2)
History of drug abuse	76 (will not use)	Categorical (2)
# of Cumulative Illness Rating (CIRS) categories	0	Semi-continuous (0,12)
CIRS severity score	0	Semi-continuous (0,4)
Have AxI Obsessive-Compulsive Disorder	1	Categorical (2)
Have AxI Panic Disorder	1	Categorical (2)
Have AxI Post Traumatic Stress Disorder	2	Categorical (2)
Have AxI Agoraphobia	1	Categorical (2)
Have AxI Social Anxiety Disorder	2	Categorical (2)
Have AxI Generalized Anxiety Disorder	2	Categorical (2)
Anxious Depression at screening	6	Categorical (2)
Atypical Depression at screening	5	Categorical (2)
Melancholic Depression at screening	5	Categorical (2)

At level 1 exit

Maximum Side Effects	1	Categorical (7)
SF-36 Physical Health (in units of 5)	23	Semi-continuous (0,12)
SF-36 Mental Health (in units of 5)	23	Semi-continuous ((0,12)
Quality of Life	23	Semi-continuous (0,98)
Side effect burden	1	Categorical (7)
Quick Inventory of Depressive Symptom (QIDS) total score (self-reported)	1	Semi-continuous (0,27)
Hamilton rating scale for depression (17-time) score	12	Semi-continuous (0,43)
IDS-C (30-item) score	13	Semi-continuous (0,74)

AxI: axis I disorders. CIRS: Cumulative Illness Rating Scale. IDS-C: 30-item Inventory of Depressive Symptomatology — Clinician Rated.

The MSM is a generalized linear model with logit link. The linear predictor ϕ(a,v)=1,a1,v1,…,vk,a1×v1,…,a1×vk and ψ≡ψ1,ψ2,ψ3,1,…,ψ3,k,ψ4,1,…,ψ4,k∈R2k+2 for the categorical V with k+1 categories, and ϕ(a,v)=1,a1,v,v2,a1×v,a1×v2 and ψ≡ψ1,ψ2,ψ3,1,ψ3,2,ψ4,1,ψ4,2∈R6 for the semi-continuous V. The kernel weights are h(a,v)=P(A1=a1∣V), to be estimated using Super Learner with fitting algorithms glm, knnreg, nnet and bayesglm. The Super Learner will use 10-fold cross validation to select the best weighted combination of these fitting algorithms to estimate h(a,v). The initial estimators of g and Qta adjust for all baseline covariates and time-varying covariates with up to 2 time lag (each covariate is coupled with its missingness indicator). Each fitting algorithm is coupled with each of the following screening algorithms: Spearman correlation tests at significance levels 0.01, 0.05, 0.1, 0.2; ranking p-values from the correlation tests and take the top k variables, where k ranges from 10 variables up to 30% of the total number of variables being considered, in increments of 20. The Super Learner selects the best linear combination of all screening-fitting couples.

The goal of the analysis is to identify potential effect modifiers from a pool of candidate pre-treatment covariates. Our strategy is to measure the treatment heterogeneity for each of these covariates using an MSM, and then apply multiple testing methods to identify those for which the treatment heterogeneity is significant. A common way to assess treatment heterogeneity across strata of V is to test whether the coefficients of the interaction terms ψ4 are different from 0. We perform the corresponding Wald test in Table 5, with the null hypothesis ψ4=0 and test statistics Tn=ψ4,n⊤Σn−1ψ4,n∼χk2, where ψ4,n is the projected TMLE estimator, Σn=Covˆ(Dψ4A0)/n, and k is the number of interaction terms. The false discovery rate (FDR) of the simultaneous comparisons is controlled at 0.05 with the Benjamini-Hochberg procedure.

Table 5:

Data Analysis Results: Multivariate Wald Test. ψ4=(ψ4,1,…ψ4,k) are the coefficients of the interaction terms in the MSM; if V is semi-continuous, interaction terms are ψ4⋅(v,v2), if V is categorical with k+1 categories, interaction terms are ψ4⋅(v1,…,vk).H0:ψ4=0.Tn=ψ4,n⊤Σn−1ψ4,n∼χk2, where ψn is given by projected TMLE, and Σn is the corresponding covariance estimator based on the covariance matrix of the projected-IF. Control false discovery rate at 0.05 with the Benjamini-Hochberg procedure.

	P-value	Reject	ψ 4	S E ( ψ 4 )
At screening

Race	4.97e−01	F	(−1.68e−01, −4.62e−01)	(2.62e−01, 4.27e−01)
Years of schooling completed	2.89e−01	F	( 2.50e−01, −1.10e−02)	(1.71e−01, 7.07e−03)
Employment status	1.50e−01	F	( 3.92e−01, −3.94e−01)	(2.30e−01, 5.89e−01)
Education level	1.13e−01	F	( 4.89e−01, −5.69e−03)	(3.13e−01, 3.80e−01)
Attempted suicide	1.80e−05	T	(−1.10e+00)	(2.56e−01)
Duration of current major depression episode (in units of 6 months)	4.44e−02	F	( 1.92e−02, −9.41e−04)	(2.78e−02, 5.59e−04)
# of Cumulative Illness Rating (CIRS) categories	2.05e−02	F	( 3.40e−01, −4.51e−02)	(1.42e−01, 1.65e−02)
CIRS severity score	8.09e−02	F	(−6.21e−01, 3.65e−01)	(4.41e−01, 1.85e−01)
Have AxI Obsessive-Compulsive Disorder	5.20e−01	F	(−1.86e−01)	(2.89e−01)
Have AxI Panic Disorder	1.17e−01	F	( 5.53e−01)	(3.53e−01)
Have AxI Post-Traumatic Stress Disorder	8.78e−01	F	( 3.86e−02)	(2.51e−01)
Have AxI Agoraphobia Disorder	9.73e−01	F	( 1.04e−02)	(3.12e−01)
Have AxI Social Anxiety Disorder	5.15e−01	F	( 1.81e−01)	(2.78e−01)
Have AxI Generalized Anxiety Disorder	1.90e−01	F	( 3.68e−01)	(2.81e−01)
Anxious Depression at screening	6.89e−01	F	(−9.17e−02)	(2.3e−01)
Atypical Depression at screening	4.15e−04	T	( 9.79e−01)	(2.77e−01)
Melancholic Depression at screening	1.58e−01	F	(−3.48e−01)	(2.46e−01)

At level 1 exit

Maximum side effects	6.46e−09	T	( 5.53e−01, 8.10e−02, 1.85e+00, −2.62e−01, 1.39e+00, 5.88e−01)	(5.27e−01, 5.25e−01, 4.59e−01, 5.02e−01, 5.20e−01, 5.65e−01)
SF-36 Physical Health (in units of 5)	6.83e−09	T	( 2.55e+00, −1.74e−01)	(4.17e−01, 2.91e−02)
SF-36 Mental Health (in units of 5)	1.36e−01	F	( 5.55e−01, −5.48e−02)	(2.78e−01, 2.86e−02)
Quality of Life	4.84e−04	T	(−7.53e−02, 5.35e−04)	(3.12e−02, 3.46e−04)
Side effect burden	0.00e+00	T	( 2.91e−01, 5.18e−01, −6.26e−02, 1.29e+00, 2.73e+00, −5.85e−01)	(2.00e−03, 2.41e−03, 2.60e−03, 2.50e−03, 1.96e−03, 1.21e−03)
Quick Inventory of Depressive Symptom (QIDS) to- tal score (self−reported)	7.61e−02	F	( 2.82e−01, −1.03e−02)	(1.24e−01, 4.62e−03)
Hamilton rating scale for depression (17-item) score	4.83e−02	F	( 1.16e−01, −1.88e−03)	(8.45e−02, 2.17e−03)
IDS-C (30 item) score	1.27e−01	F	( 5.06e−02, −4.75e−04)	(4.47e−02, 6.72e−04)

AxI: axis I disorders. IDS-C: 30-item Inventory of Depressive Symptomatology — Clinician Rated.

Since these candidate covariates differ in their type (semi-continuous vs binary vs multiple categories), the corresponding MSM parameters are of different dimensions, we also consider the measure of treatment heterogeneity

β(ψ)≡maxvℓOR(v;ψ)−minvℓOR(v;ψ),

where

ℓOR(v;ψ)=logm(1,v;ψ)1−m(1,v;ψ)−logm((0,v;ψ)1−m(0,v;ψ)=ψ2+ψ4,1v1+⋯+ψ4,kvk,for categorical V, ψ2+ψ4,1v+ψ4,2v2,for semi-continuous V.

This measure quantifies the most change in log odds ratio between any two values of V. Consider the null hypothesis H0:β(ψ0)=0. Using projected TMLE, we obtain an estimator βn=β(ψnpTMLE) for each V. An application of the functional delta method, with the covariance matrix Σn of the estimated influence function DnA0, yields a standard error estimate SEn for βn. We use the test statistics Tn=βn/SEn∼N(0,1). The results of the analysis are summarized in Table 6. Interestingly, this approach yields almost all the same potential effect modifiers as in Table 5.

Using our methods, we identified six factors, two at screening and four at level 1 exit, that modified the comparative treatment effect of augmenting vs switching medication. These effect modifiers are: prior suicide attempts, atypical depression, level 1 side-effect frequencies and burden (impairment), SF-36 physical functionality, and quality of life at level 1 exit. In general, the heterogeneity of the treatment effects are most pronounced for the level 1 exit modifiers than for those from baseline, indicating opportunities for close monitoring and medication management.

There has only been one publication that addressed the heterogeneous treatment effects between switching and augmenting in level 2 treatment in STAR*D [18]. The authors employed propensity score and weighting methods to examine heterogeneity of the treatment effect among the treated (medication augmentation) through stratification by propensity score decile. This approach does not explicitly identify factors contributing to heterogeneity. Missing data is less problematic in this case, and is handled by multiple imputation.

6 Summary

In this work, we studied causal effect modification by a counterfactual modifier. The tools developed here are applicable in situations where the effect modifier of interest may be better cast as counterfactual variables. Examples of such situations include the study of time-varying effect modification, or the study of baseline effect modification with modifiers that are missing at random.

We established the efficient influence function (EIF) for the corresponding marginal structural model parameters which provides the semiparametric efficiency bound for all asymptotically linear estimators. This EIF is also doubly robust in that it remains an unbiased estimating function if either 1) the outcome expectations and the modifier densities, or 2) the treatment and censoring mechanisms, are consistently estimated. However, in applications with high-dimensional effect modifiers, we saw that it may be difficult to fully utilize the EIF without potentially compromising consistency. To solve this problem, we presented a projected influence function (PIF), which equals the EIF in a model where the missingness mechanism (or more generally, pre-modifier intervention, A0) is known. Though not fully efficient under the larger model, this PIF is robust against misspecification of the outcome models or the treatment mechanisms, whenever the missingness mechanism is consistently estimated. We presented two TMLE estimators using the EIF and the PIF. We also described an IPTW estimator that is unbiased if the intervention probabilities are consistently estimated, and a non-targeted G-computation estimator that is unbiased if the outcome expectations and either the modifier densities or the missingness mechanisms are consistently estimated. Under standard regularity and empirical process conditions, the two TMLE estimators and the IPTW estimators are asymptotically linear, thereby allowing Central Limit Theorem based standard error estimates. Moreover, the full TMLE estimator using the EIF will be semiparametric efficient if all the components of the likelihood are consistently estimated.

Table 6:

Data Analysis Results: β0=maxvℓOR(v;ψ0)−minvℓOR(v;ψ0),H0:β0=0.Tn=βn/SEn∼N(0,1), where ψn given by projected TMLE. The standard error estimate SEˆ(βn) is computed using the variance of the projected influence function and the functional delta method. Control false discovery rate at 0.05 with the Benjamini-Hochberg procedure.

	P	Reject	β n	S E ˆ ( β n )	arg min v ℓ O R ( v ; ψ 0 )	arg max v ℓ O R ( v ; ψ 0 )
At screening

Race	2.8e−01	F	4.6e−01	4.3e−01	3.0e+00	2.0e+00
Years of schooling completed	1.2e−01	F	2.7e+00	1.7e+00	2.7e+01	1.1e+01
Employment status	1.9e−01	F	7.9e−01	6.0e−01	3.0e+00	2.0e+00
Education level	9.2e−02	F	4.9e−01	2.9e−01	3.0e+00	2.0e+00
Attempted suicide	1.8e−05	T	1.1e+00	2.6e−01	1.0e+00	0.0e+00
Duration of current major depression episode (in units of 6 months)	3.3e−02	F	9.6e+00	4.5e+00	1.1e+02	1.0e+01
# of Cumulative Illness Rating (CIRS) categories	5.9e−03	T	3.1e+00	1.1e+00	1.2e+01	3.8e+00
CIRS severity score	2.7e−02	F	3.6e+00	1.6e+00	8.5e−01	4.0e+00
Have AxI Obsessive-Compulsive Disorder	5.2e−01	F	1.9e−01	2.9e−01	1.0e+00	0.0e+00
Have AxI Panic Disorder	1.2e−01	F	5.5e−01	3.5e−01	0.0e+00	1.0e+00
Have AxI Post-Traumatic Stress Disorder	8.8e−01	F	3.9e−02	2.5e−01	0.0e+00	1.0e+00
Have AxI Agoraphobia Disorder	9.7e−01	F	1.0e−02	3.1e−01	0.0e+00	1.0e+00
Have AxI Social Anxiety Disorder	5.2e−01	F	1.8e−01	2.8e−01	0.0e+00	1.0e+00
Have AxI Generalized Anxiety Disorder	1.9e−01	F	3.7e−01	2.8e−01	0.0e+00	1.0e+00
Anxious Depression at screening	6.9e−01	F	9.2e−02	2.3e−01	1.0e+00	0.0e+00
Atypical Depression at screening	4.2e−04	T	9.8e−01	2.8e−01	0.0e+00	1.0e+00
Melancholic Depression at screening	1.6e−01	F	3.5e−01	2.5e−01	1.0e+00	0.0e+00

At level 1 exit

Maximum side effects	8.1e−09	T	2.1e+00	3.7e−01	4.0e+00	3.0e+00
SF-36 Physical Health (in units of 5)	9.0e−10	T	9.3e+00	1.5e+00	0.0e+00	7.3e+00
SF-36 Mental Health (in units of 5)	1.0e−01	F	2.6e+00	1.6e+00	1.2e+01	5.1e+00
Quality of Life	1.4e−04	T	2.6e+00	7.0e−01	7.0e+01	0.0e+00
Side effect burden	0.0e+00	T	3.3e+00	2.3e−03	6.0e+00	5.0e+00
Quick Inventory of Depressive Symptom (QIDS) to- tal score (self-reported)	2.7e−02	F	1.9e+00	8.7e−01	0.0e+00	1.4e+01
Hamilton rating scale for depression (17-item) score	2.0e−02	F	1.8e+00	7.6e−01	0.0e+00	3.1e+01
IDS-C (30 item) score	5.1e−02	F	1.3e+00	6.9e−01	0.0e+00	5.3e+01

AxI: axis I disorders. IDS-C: 30-item Inventory of Depressive Symptomatology — Clinician Rated.

Funding statement: This work was supported by NIH grants RC4MH092737 (Principal Investigator Z. Luo) and R01 A1074345 (Principal Investigator M. van der Laan).

Acknowledgements

We thank Josh Schwab for his invaluable help in implementing the projected TMLE.

Appendix

A Proof of identifiability

P ( V ( a 0 ) = v ) = ∑ w P ( w , V ( a 0 ) = v ) = ∑ w P ( w ) P ( V ( a 0 ) = v ∣ w ) = S R A ∑ w P ( w ) P ( V = v ∣ A 0 = a 0 , W = w ) = E W P ( V = v ∣ A 0 = a 0 , W ) ≡ γ ( a 0 , v ) . E [ Y K ( a ) ∣ V ( a 0 ) = v ] = ∑ w , l y K P ( L ( a ) = l , L 0 ( a 0 ) = l 0 , V ( a 0 ) = v , W = w ) P ( V ( a 0 ) = v ) = S R A ∑ w , l y K P ( w ) P ( V = v ∣ A 0 = a 0 , w ) P ( l 0 ∣ w , A 0 = a 0 , V = v ) ∏ t = 1 K P l t ∣ A t = a t , l t − 1 , V = v , A 0 = a 0 , w P ( V ( A 0 = a 0 ) = v ) = 1 γ ( P ) ( a 0 , v ) E P P ( V = v ∣ A 0 = a 0 , W ) Q 0 a ( V = v , W )

B Proof of Remark 1

The first equality in the remark follows from definition of Ψ(P) and choice of mψ(a,v), the third equality is trivial. We only show the second one.

∑(a,v)∈A×Vh˜(a,v)η(P)(a,v)−γ(P)(a0,v)mΨ(P)(a,v)=∑(a,v)∈A×Vh˜(a,v)EWp(V=v∣a0,W)Q0a(v,W)−EWp(V=v∣a0,W)mΨ(P)(a,v)=EW∑a,vp(V=v∣a0,W)h˜(a,v)Q0a(v,W)−mΨ(P)(a,v)

The first-line in the right-hand-side of the above equation is zero by definition of η(P)(a,v).

C Proof of Lemma 1: Efficient influence function for Ψ(P) under M

C In this appendix, we derive the EIF at P of the map Ψ:M→Rd. For each P∈M, let HP denote the Hilbert space of 1-dimensional mean zero measurable functions of O with finite variance, endowed with the covariance inner product. For an r∈HP, define a 1-dimensional parametric submodel {Pr(α):|α|<1/∥r∥∞}, through P at α=0, given by dPr(α)dP=1+αr(O). Since we are working under a saturated model M, this submodel is indeed contained in M. We shall consider the class of all such 1-dimensional submodels indexed by HP.

For a given D˜∈HPd, we define the vector inner product EPD˜×r as the vector of the component-wise inner products EPD˜j×r:j=1…d. We wish to show that D satisfies

dΨ(Pr(α))dαα=0=MΨ(P),P−1EPD×r.

From definition of the maps Ψ,γ(a0,v) and η(a,v), and our choice of working model mψ(a,v), we know that at each Pr(α)

(16)0=∑(a,v)∈A×Vγ(Pr(α))(a0,v)h(a,v)ϕ(a,v)η(Pr(α))(a,v)γ(Pr(α))(a0,v)−mΨ(Pr(α))(a,v).

We perform an implicit differentiation with respect to α on the above equation, at α=0, to obtain the equality

(17)dΨ(Pr(α))dαα=0×∑(a,v)∈A×Vγ(P)(a0,v)h(a,v)ϕ(a,v)mΨ(P)(a,v)1−mΨ(P)(a,v)ϕ(a,v)⊤=∑(a,v)∈A×Vh(a,v)ϕ(a,v)dη(Pr(α))(a,v)dαα=0−dγ(Pr(α))(a0,v)dαα=0mΨ(P)(a,v).

Next, for a given (a,v), we proceed to express dγPr(α)dαα=0 and dηPr(α)dαα=0 as EP(Dγ×r) and EP(Dη×r), respectively, for some functions Dγ and Dη belonging to the Hilbert space HP.

For convenience of indexing, for a given vector lK, we shall use the short hand hta,v=(A0=a0,V=v,Lt−1=lt−1,At=at), for t=1,…,K, and hta,v=(A0=a0,V=v) for t=0. From definition of Pr(α), it follows that

Pr(α)(W)=P(W)1+αEP(r∣W),Pr(α)(v∣A0=a0,W)=P(v∣A0=a0,W)1+αEP(r∣V=v,A0=a0,W)1+αEP(r∣A0=a0,W),and Pr(α)(lt∣W,Hta,v)=P(lt∣W,Hta,v)1+αEPr∣lt,Hta,v1+αEPr∣Hta,v.

Proposition 1

For a given ( a 0 , v ) , d γ ( P r ( α ) ) d α α = 0 = E P ( D γ × r ) , where

(18) D γ ( P ) = I ( A 0 = a 0 ) g ( a 0 ∣ W ) I ( V = v ) − P ( v ∣ A 0 = a 0 , W ) + P ( v ∣ A 0 = a 0 , W ) − γ ( P ) ( a 0 , v ) .

Proof

d γ ( P r ( α ) ) d α α = 0 = lim α → 0 γ ( P r ( α ) ) − γ ( P ) α = lim α → 0 1 α ∫ W { P ( W ) P ( v ∣ A 0 = a 0 , W ) α E P ( r ∣ v , A 0 = a 0 , W ) − E P ( r ∣ A 0 = a 0 , W ) + E P ( r ∣ W ) + α E P ( r ∣ W ) E P ( r ∣ v , A 0 = a 0 , W ) 1 + α E P ( r ∣ A 0 = a 0 , W ) } = ∫ W P ( W ) P ( v ∣ A 0 = a 0 , W ) E P ( r I ( V = v ) ∣ A 0 = a 0 , W ) − E P ( r ∣ A 0 = a 0 , W ) + E P ( r ∣ W ) = E P I ( A 0 = a 0 g ( a 0 ∣ W ) I ( V = v ) − P ( v ∣ A 0 = a 0 , W ) + P ( v ∣ A 0 = a 0 , W ) × r ( O ) = E P D γ ( P ) × r ( O ) .

In obtaining the last equality, we note that centering the left factor of the integrand by γ(P)(a0,v) does not change the expression because EP(γ(P)r(O))=γ(P)EP(r(O))=0 by definition of r. It is straightforward to check that indeed EPDγ(P)=0. Moreover, under our saturated model Dγ(P) is in fact in the tangent space. This concludes the proof of Proposition 1.

Proposition 2

For a given a ∈ A , v ∈ V , d η ( P r ( α ) ) d α α = 0 = E P ( D η × r ) , where

(19) D η ( P ) ≡ I ( V = v ) ∑ t = 0 K C t a Q t + 1 a ( L t , v , W ) − Q t a ( L t − 1 , v , W ) + I ( A 0 = a 0 ) g ( a 0 ∣ W ) Q 0 a ( v , W ) I ( V = v ) − P ( v ∣ A 0 = a 0 , W ) + P ( v ∣ A 0 = a 0 , W ) Q 0 a ( v , W ) − η ( P ) ( a , v ) .

where Cta=I(At=at)∏j=0tgaj∣Aj−1=aj−1,Lj−1,V,W.

Proof

d η ( P r ( α ) ) ( a , v ) d α α = 0 = lim α → 0 η ( P r ( α ) ) ( a , v ) − η ( a , v ) ( P ) α = lim α → 0 1 α ∫ w , l 0 , l K y K P ( w ) P ( v ∣ A 0 = a 0 , w ) ∏ j = 0 K P ( l j ∣ w , h j a , v ) α × { ∑ t = 0 K E P ( r ∣ l t , w , h t a , v ) + E P ( r ∣ v , A 0 = a 0 , W ) + E P ( r ∣ W ) 1 + α M P ′ ( w , l 0 , l K ) − ∑ t = 0 K E P ( r ∣ w , h t a , v ) − E P ( r ∣ A 0 = a 0 , W ) + α M P ( w , l 0 , l K ) 1 + α M P ′ ( w , l 0 , l K ) } = ∫ w , l 0 , l K y K P ( w ) P ( v ∣ A 0 = a 0 , w ) ∏ j = 0 K P ( l j ∣ w , h j a , v ) × { ∑ t = 0 K E P ( r ∣ l t , w , h t a , v ) − ∑ t = 0 K E P ( r ∣ w , h t a , v ) + E P ( r ∣ v , A 0 = a 0 , w ) − E P ( r ∣ A 0 = a 0 , w ) + E P ( r ∣ w ) } = E P { I ( A 0 = a 0 ) g ( a 0 ∣ W ) I ( V = v ) ∑ t = 1 K I ( A t = a t ) ∏ j = 1 t g ( a t ∣ p a r e n t s ( a t ) ) × Q t + 1 a ( l t , v , w ) − Q t a ( l t − 1 , v , w ) × r ( O ) } + E P I ( A 0 = a 0 ) g ( a 0 ∣ W ) Q 0 a ( V = v , W ) I ( V = v ) − P ( v ∣ A 0 = a 0 , W ) × r ( O ) + E P P ( v ∣ A 0 = a 0 , W ) Q 0 a ( V = v , W ) × r ( O ) = E P ( D η × r )

In the first equality, MP and MP′ are shorthand for the remaining terms in the expansion of the products. This concludes the proof of Proposition 2

Now, we derive the EIF for Ψ at P∈M. From Proposition 1 and Proposition 2, after some simplifications, we conclude that the right-hand-side of eq. (17) can be written as

∑(a,v)∈A×Vh(a,v)ϕ(a,v)dη(Pr(α))(a,v)dαα=0−dγ(Pr(α))(a0,v)dαα=0mΨ(P)(a,v)=∑(a,v)∈A×Vh(a,v)ϕ(a,v)EP(Dη(P)×r)−EP(Dγ(P)×r)mΨ(P)(a,v)=EP∑(a,v)∈A×Vh˜(a,v)Dη(P)−Dγ(P)mΨ(P)(a,v)×r(O)=EPD(Q,g,Ψ(P))×r,

where

D(Q,g,ψ)=∑(a,v)∈A×Vh˜(a,v)Dη(P)−Dγ(P)mΨ(P)(a,v)=∑t=0K∑ah˜(a,V)CtaQt+1a(Lt,V,W)−Qta(Lt−1,V,W)+∑aI(A0=a0)g(a0∣W){h˜(a,V)Q0a(V,W)−mΨ(P)(a,V)−EVh˜(a,V)Q0a(V,W)−mΨ(P)(a,V)∣A0=a0,W}+∑aEVh˜(a,V)Q0a(V,W)−mΨ(P)(a,V)∣A0=a0,W−∑a,vh˜(a,v)η(P)(a,v)−mΨ(P)(a,v)γ(P)(a0,v).

It follows from Remark 1 that the last summand ∑a,vh˜(a,v)η(P)(a,v)−mΨ(P)(a,v)γ(P)(a0,v)=U(P,Ψ(P))=0. To emphasize the role of P, we shall write D(Q,g,Ψ(P)) as D(P). To see that EPD(P)=0, we first note that all but the last line are centered functions with respect to P; secondly, from Remark 1, the last line also has zero expectation under P when ψ=Ψ(P). Since we are operating under a saturated model, each component of D(P) is in the tangent space, hence it is in fact the efficient influence curve.

To finish the proof of first part of Lemma 1, it suffices to see from eq. (17) that the normalizing quantity is given by the inverse of

∂EPDW(P,Ψ(P))∂Ψ(P)=∑(a,v)∈A×Vγ(P)(a0,v)h(a,v)ϕ(a,v)mΨ(P)(a,v)1−mΨ(P)(a,v)ϕ(a,v)⊤=∫WP(W)∑(a,v)P(v∣A0=a0,W)h(a,v)ϕ(a,v)mΨ(P)(a,v)1−mΨ(P)(a,v)ϕ(a,v)⊤=E∑aI(A0=a0)g(a0∣W)h(a,V)ϕ(a,V)mΨ(P)(a,V)1−mΨ(P)(a,V)ϕ(a,V)⊤=MΨ(P),P.

To prove the robustness statement, we first consider the case Q(P)=Q(P0). All but the last line in D(P) are centered about components of Q(P0), therefore they will have expectation zero under P0. The last line in EP0D(P) is DW(P0,Ψ(P0)), which equals zero by definition of ψ0.

Next, we consider the case g(P)=g0. By telescoping the sums in eq. (12) and applying definition of Q0a(P0), we obtain

P0D(Q(P),g0,ψ0)=EP0∑(a,v)∈A×VP0(V=v∣(A0=a0,W)h˜(a,v)Q0a(P0)(V=v,W)−EP0∑(a,v)∈A×VP0(V=v∣(A0=a0,W)h˜(a,v)mψ0(a,v),

which equals 0 by Remark 1.

D Proof of Lemma 2

To see that DA0 is a score, up to a normalizing matrix, on the model where g(A0∣W) is known, we repeat the steps in Section C, but this time we only consider the class of 1-dimensional submodels indexed by

RP=r=r′−EP(r′∣A0,W)+EP(r′∣W)−EP(r′):r′∈HP⊂HP.

It is straight forward to verify that EP(r∣A0=a0,W)=0=EP(r∣W) for such r∈RP. Therefore, Dγ(P)(a0,v) in this case becomes I(A0=a0)g(a0∣W)I(V=v)−γ(P)(a0,v) and

Dη(P)=I(V=v)∑t=0KCtaQt+1a(Lt,v,W)−Qta(Lt−1,v,W)+I(A0=a0)g(a0∣W)I(V=v)Q0a(v,W)−η(a,v).

The rest is straightforward from eq. (16).

To see that PDA0(Q,g,Ψ(P))=0, it suffices to show that PDWA0(Q,g,Ψ(P))=0:

P I ( A 0 = a 0 ) g ( a 0 ∣ W ) ∑ a h ( a , v ) ϕ ( a , v ) Q 1 a ( L 0 , V , W ) − m Ψ ( P ) ( a , v ) = P I ( A 0 = a 0 ) g ( a 0 ∣ W ) ∑ a h ( a , v ) ϕ ( a , v ) Q 1 a ( L 0 , V , W ) − Q 0 a ( V , W ) + P I ( A 0 = a 0 ) g ( a 0 ∣ W ) ∑ a h ( a , v ) ϕ ( a , v ) Q 0 a ( V , W ) − m Ψ ( P ) ( a , v ) .

The first term in the right hand side of the equality is zero by definition of Q1a; the second term is zero by Remark 1. There DA0 is a valid estimating function.

To see its robustness properties under g=g(P0), first consider the case of Qta(P)=Qta(P0) for t=1,…,K. Trivially, P0Dt∗(Q,g)=0 for each t≥1 by definition of Qta. On the term DWA0, we have

P0I(A0=a0)g(P0)(1∣W)∑ah˜(a,v)Q1a(P0)(L0,V,W)−mψ0(a,v)=P0∑vQV(P0)(v∣A0=a0,W)∑ah˜(a,v)Q0a(P0)(V=v,W)−mψ0(a,v)=0,

per definition of Q0a and ψ0.

Next, we consider the case of g(P)=g0. By telescoping the sums in eq. (15) and applying definition of Q0a, we again obtain

P0DA0(Q,g0,ψ0)=P0∑(a,v)∈A×VP0(V=v∣(A0=a0,W)h˜(a,v)Q0a(P0)(V=v,W)−P0∑(a,v)∈A×VP0(V=v∣(A0=a0,W)h˜(a,v)mψ0(a,v)=0.

This completes proof of Lemma 2.

References

[1] van der Laan MJ, Petersen M. Statistical learning of origin-specific statically optimal individualized treatment rules. Int J Biostat. 2007;3:Article 6.10.2202/1557-4679.1040Search in Google Scholar PubMed PubMed Central

[2] van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. Int J Biostat. 2006;2:Article 11.10.2202/1557-4679.1043Search in Google Scholar

[3] van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. Springer Series in Statistics, 1st ed. New York, NY: Springer, 2011.10.1007/978-1-4419-9782-1Search in Google Scholar

[4] Robins JM. Marginal structural models. In: Proceedings of the American Statistical Association. Section on Bayesian Statistical Science.1997;1–10.Search in Google Scholar

[5] Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11:561–570.10.1097/00001648-200009000-00012Search in Google Scholar PubMed

[6] Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Math Modell. 1986;7:1393–1512.10.1016/0270-0255(86)90088-6Search in Google Scholar

[7] Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–560.10.1097/00001648-200009000-00011Search in Google Scholar PubMed

[8] van der Laan MJ, Petersen ML, Joffe MM. History-adjusted marginal structural models and statically-optimal dynamic treatment regimens. Int J Biostat. 2005;1(1):Article 4–.10.2202/1557-4679.1003Search in Google Scholar

[9] Petersen M, Deeks S, Martin J, van der Laan MJ. History-adjusted marginal structural models for estimating time-varying effect modification. Am J Epidemiol. 2007;166:985–93.10.1093/aje/kwm232Search in Google Scholar PubMed PubMed Central

[10] Petersen M, Deeks S, van der Laan MJ. Individualized treatment rules: generating candidate clinical trials. Stat Med. 2007;26:4578–601.10.1002/sim.2888Search in Google Scholar PubMed PubMed Central

[11] Petersen M, Porter K, Gruber S, Wang Y, van der Laan MJ. Diagnosing and responding to violations in the positivity assumption. Technical report 269, Division of Biostatistics, University of California, Berkeley, 2010. http://www.bepress.com/ucbbiostat/paper269.Search in Google Scholar

[12] Pearl J. Causality: models, reasoning and inference, 2nd ed. New York: Cambridge University Press, 2009.10.1017/CBO9780511803161Search in Google Scholar

[13] Robins JM, Rose S. Causal inference from complex longitudinal data. In: Berkane M, editor(s). Latent Variable Modeling and Applications to Causality. New York, NY: Springer Verlag, 1997:69–117.10.1007/978-1-4612-1842-5_4Search in Google Scholar

[14] Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–972.10.1111/j.1541-0420.2005.00377.xSearch in Google Scholar PubMed

[15] van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Gene Molecul Biol. 2007;6:Article 25.10.2202/1544-6115.1309Search in Google Scholar PubMed

[16] Diaz I, van der Laan M. Doubly robust inference for targeted minimum loss-based estimation in randomized trials with missing outcome data. Stat Med. 2017;36:3807–3819.10.1002/sim.7389Search in Google Scholar PubMed

[17] Bickel PJ, Klaassen CA, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. New York, NY: Springer-Verlag, 1997.Search in Google Scholar

[18] Ellis A, Dusetzina S, Hansen R, Gaynes B, Farley J, Stumer T. Investigating differences in treatment effect estimates between propensity score matching and weighting: a demonstration using STAR*D trial data. Pharmacoepidemiol Drug Safety. 2013;22:138–144.10.1002/pds.3396Search in Google Scholar PubMed PubMed Central

Received: 2018-04-03

Accepted: 2018-05-09

Published Online: 2018-06-08

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/ijb-2018-0039

Keywords for this article

causal inference; effect modification; epidemiology; machine learning

Creative Commons

BY 4.0