Doubly Robust and Efficient Estimation of Marginal Structural Models for the Hazard Function

Wenjing Zheng; Maya Petersen; Mark J. van der Laan

doi:10.1515/ijb-2015-0036

Article Publicly Available

Doubly Robust and Efficient Estimation of Marginal Structural Models for the Hazard Function

Wenjing Zheng , Maya Petersen and Mark J. van der Laan

Published/Copyright: May 26, 2016

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal The International Journal of Biostatistics Volume 12 Issue 1

Abstract

In social and health sciences, many research questions involve understanding the causal effect of a longitudinal treatment on mortality (or time-to-event outcomes in general). Often, treatment status may change in response to past covariates that are risk factors for mortality, and in turn, treatment status may also affect such subsequent covariates. In these situations, Marginal Structural Models (MSMs), introduced by Robins (1997. Marginal structural models Proceedings of the American Statistical Association. Section on Bayesian Statistical Science, 1–10), are well-established and widely used tools to account for time-varying confounding. In particular, a MSM can be used to specify the intervention-specific counterfactual hazard function, i. e. the hazard for the outcome of a subject in an ideal experiment where he/she was assigned to follow a given intervention on their treatment variables. The parameters of this hazard MSM are traditionally estimated using the Inverse Probability Weighted estimation Robins (1999. Marginal structural models versus structural nested models as tools for causal inference. In: Statistical models in epidemiology: the environment and clinical trials. Springer-Verlag, 1999:95–134), Robins et al. (2000), (IPTW, van der Laan and Petersen (2007. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat 2007;3:Article 3), Robins et al. (2008. Estimaton and extrapolation of optimal treatment and testing strategies. Statistics in Medicine 2008;27(23):4678–721)). This estimator is easy to implement and admits Wald-type confidence intervals. However, its consistency hinges on the correct specification of the treatment allocation probabilities, and the estimates are generally sensitive to large treatment weights (especially in the presence of strong confounding), which are difficult to stabilize for dynamic treatment regimes. In this paper, we present a pooled targeted maximum likelihood estimator (TMLE, van der Laan and Rubin (2006. Targeted maximum likelihood learning. The International Journal of Biostatistics 2006;2:1–40)) for MSM for the hazard function under longitudinal dynamic treatment regimes. The proposed estimator is semiparametric efficient and doubly robust, offering bias reduction over the incumbent IPTW estimator when treatment probabilities may be misspecified. Moreover, the substitution principle rooted in the TMLE potentially mitigates the sensitivity to large treatment weights in IPTW. We compare the performance of the proposed estimator with the IPTW and a on-targeted substitution estimator in a simulation study.

Keywords: targeted maximum likelihood estimation; inverse probability weighting; doubly robust; efficient; marginal structural models

1 Introduction

In social and health sciences, many research questions involve understanding the causal effect of a longitudinal treatment on a time-to-event outcome (say mortality). Often, treatment status may change in response to past covariates that are risk factors for mortality, and in turn, treatment status may also affect such subsequent covariates. Moreover, right censoring may also be present in a study of this nature, typically in response to past covariates and treatment. In these situations, conventional time-dependent regression methods, which predict mortality rate at each time based on history up to that time, may fail to properly account for this time-varying confounding of the treatment effect (e. g. Robins et al. [1]). Marginal Structural Models (MSMs), introduced by Robins [2], are well-established and widely used tools in this setting.

To define the causal effect of a longitudinal treatment on mortality, we can use a formal causal framework (e. g. Robins [3], Pearl [4]). In it, we formulate the research question of interest in terms of an ideal experiment that randomizes interventions on the treatment variables of interest. The comparative effect of the different interventions is assessed by comparing the distribution of the would-be (counterfactual) outcome processes under said interventions. The MSMs specify the marginal distribution of these intervention-specific counterfactual outcome processes.

In survival analysis, the effect of treatment interventions on mortality can be assessed by studying how the hazard of the intervention-specific counterfactual mean outcome changes as a function of time and the intervention, possibly conditional on a set of baseline covariates. In many applications, this hazard function may be high-dimensional, in which case a MSM for the hazard can be used to summarize its key features. This hazard MSM can provide adequate dimension reduction over a potentially complicated hazard function, and may also mitigate near positivity violations (due to lack of experimental support for an intervention) by extrapolation.

Depending on the research questions of interest, the interventions can be static – treatment option is the same for all subjects, or they can be dynamic – treatment option is a function of past covariates. For example, among HIV+ patients who failed their first line antiretroviral regimen, we wish to evaluate how their survival depends on when the regimen modification takes place. On the one hand, we may be interested in comparing the lengths of delay in regimen modification: Petersen et al. [5] and [6] defined static interventions by fixed lengths of delay and used a hazard MSM to assess how the survival changes as a function of the length of delay. On the other hand, we can take a patient-centered approach (perhaps more inline with clinical practice) and conceptualize the delay in regimen modification as a function of past CD4 T-cell counts: in van der Laan and Petersen [7], dynamic interventions prescribe regimen modification only after CD4 counts dropped below a given threshold, and a hazard MSM assesses how the survival changes as a function of such CD4 count thresholds. Other examples of hazard MSMs include the study of the effect of “when to initiate” antiretroviral therapy on survival, under static interventions (Hernan et al. [8]), using Case-Cohort designs (Cole et al. [9]), or under CD4-based dynamic interventions (Cain et al. [10] and HIV-CAUSAL-Collaboration et al. [11]).

The parameters of an MSM are traditionally estimated using Inverse Probability of Treatment Weighted estimation (IPTW, Robins [1], Robins [12], van der Laan and Petersen [7], Robins et al. [13]). This estimator is intuitive, can be implemented using standard software, and admits influence curve based variance estimates and confidence intervals. However, its consistency hinges on correct specification of the conditional treatment probabilities. Moreover, in the presence of strong confounding, the inverse probability weights can become unwieldily large, thus producing very unstable estimates. In the case of static intervention, this instability can be attenuated by introducing marginal kernel weights that down-weight treatment options with little data support; but this solution has limited applicability for dynamic interventions prescribed by time-varying history.

To address the sensitivity to model misspecification of IPTW, a doubly robust and efficient Augmented-IPTW (A-IPTW) estimator for a MSM under static interventions was proposed in in Robins and Rotnitzky [14], Robins [15] and [16]. Under this framework, estimators are defined as solutions to the estimating equation given by the efficient influence curve, which also depends on nuisance parameters orthogonal to the treatment probabilities. Robins [15], Robins [17] and Bang and Robins [18] offered an innovative insight that the corresponding efficient influence curves for the MSM for the intervention-specific mean can be expressed in terms of sequential conditional expectations. This observation allowed for construction of an A-IPTW with estimation of minimal nuisance parameters beyond the treatment mechanism. While A-IPTW estimator provides efficiency gain and bias reduction over a misspecified IPTW estimator, as an estimation equation-based method, it still suffers from the same general sensitivity to large inverse probability weights, and it may involve solving estimating equations that have no unique solution.

To improve the stability of the estimates, the targeted maximum likelihood estimation (TMLE, van der Laan and Rubin [19] and Gruber and van der Laan [20]) provides a general doubly robust and efficient estimator using the plug-in principle, which incorporates global information encoded in the parameter map and the model. A TMLE estimator for longitudinal static MSMs using a stratified approach was proposed by Schnitzer et al. [21]. A stratified TMLE uses the longitudinal TMLE for mean outcomes (van der Laan and Gruber [22]) to separately estimate each intervention-specific mean; these means are then used to fit MSMs for the hazard and survival functions. This estimator readily improves upon IPTW in both robustness and efficiency. However, in applications with numerous interventions of interest, the stratified TMLE is vulnerable to insufficient support for certain interventions, as it does not directly target the MSM parameters to take advantage of the extrapolation across interventions. Most recently Petersen et al. [23], building on the results from Robins [15], Robins [17] and Bang and Robins [18], presented a pooled TMLE estimator for a longitudinal dynamic MSM for intervention-specific means (in particular, for survival functions). This estimator directly targets the parameters of the MSM, and pools over all interventions of interest in the updating step, potentially weakening the data support needed to achieve efficiency and double robustness.

This paper builds upon the work of of Petersen et al. [23] to present a pooled TMLE for the MSM for the hazard function under longitudinal dynamic interventions. The efficient influence curve of the hazard MSM requires that a plug-in estimator be updated iteratively until convergence, but we show that, in practice, a stopping rule based on the relative sizes of the empirical mean of efficient score equation and the estimated standard error will suffice to produce MSM estimates with the desired theoretical properties. The proposed estimator is efficient and doubly robust, hence offers an improvement over traditional IPTW estimator; it directly targets the MSM parameters by pooling across interventions to update the initial nuisance parameter estimates, hence is less sensitive to data support issues than the stratified TMLE.

In practice, when the research question of interest directly concerns the survival function and one can correctly specify its MSM, using a TMLE for the survival MSM, as proposed in Petersen et al. [23], would be a more straightforward approach than using our proposed estimator for the hazard MSM. However, in many applications the hazard function itself (or failure rate, as coined in some disciplines) is of substantive interest, since it describes how the rate of mortality changes with time. Moreover, in some cases, one may have more domain knowledge to specify the hazard MSM than to specify the survival MSM. In particular, the survival function is cumulative and hence must be monotonic in time, whereas the hazard function can take different shapes, and hence can capture more domain information (such as a decreasing hazard, or a U-shaped hazard with spikes at the beginning and end time points). In applications where the research question is best served (or the domain knowledge is best used) by studying the hazard function, our proposed estimator which directly estimates the parameters of a hazard MSM would be a more targeted approach. In addition to these application considerations, the hazard MSM also allows for the use of more stable weights that only need to be defined among subjects that are still alive.

1.1 Organization of paper

This paper is organized as follows. Section 2, describes the data structure and defines the parameter using a nonparametric causal framework. Section 3 first reviews the non-targeted substitution estimator (a.k.a. G-computation estimator) and the IPTW estimator for the parameters of interest, and then presents the efficient influence curve, and describes the proposed TMLE estimator. Section 4 evaluates the performance of these three estimators in a simulation study mimicking an observational cohort study. This paper concludes with a summary.

2 Parameters of interest

Consider a longitudinal data structure

O=L0,A0,L1,A1,…,Lt,At,…,LK,AK,LK+1,

where K+1 is a user-supplied maximum follow-up time of interest. Here, L0 denotes the baseline covariates; At encodes the treatment variable At1 and the censoring indicator At2, where At2=1 indicates the subject was right censored at or before time t; and Lt denotes all the time-varying covariates measured between At−1 and At. In particular, Lt includes the counting outcome process Yt⊂Lt, where Yt=1 indicates event (say death) has occurred at or before time t. For notational convenience, after death or censoring, all variables are encoded by carrying forward the last observation. We shall use the boldface notation Lt≡L0,…,Lt, Lj,t≡Lj,…,Lt and L−1≡∅; similarly for their realizations ℓt≡ℓ0,…,ℓt,ℓj,t≡ℓj,…,ℓt. Analogous notations apply to the vector At. The observed data consist of n independent and identically distributed (i.i.d.) copies of O drawn from a distribution P0. Let m be a statistical model for P0; the assumptions on this statistical model are limited to true user knowledge, in particular, we avoid strong and restrictive parametric assumptions.

To illustrate these notations and the subsequent concepts, let us consider an example from HIV research (van der Laan and Petersen [7]). In HIV+patients with virologic failure on first line antiretroviral therapy (ART), current guidelines (DHHS [24]) recommend prompt switch to a second line regimen. However, delayed switch is common in resource limited settings. Suppose we wish to assess the effect of delayed switch on mortality (e. g. van der Laan and Petersen [7], Petersen et al. [5, 6]). Our sample consists of n individuals drawn from a target population of HIV+patients with confirmed virologic failure on their first line ART. The baseline t=0 is the time of confirmed failure, and data is collected on a monthly basis thereafter. Variable Lt encompasses time varying covariates at time t, including CD4+T cell counts and Yt, an indicator of death by time t. Variable L0 consists of the baseline values of these covariates at t=0, as well as time-independent variables such as patient demography and history prior to virologic failure. The treatment variable At1 is the indicator of switching to second line regimen by time t. For simplicity sake, for now we assume no right censoring in this example.

2.1 Causal model and causal parameters

The causal assumptions implied in the notation of O can be made explicit by a nonparametric structural equations model (NPSEM, Pearl [4]):

Lt=fLtLt−1,At−1,ULt;At=fAtLt,At−1,UAt,1ptfor t=0,…,K+11pt.

This causal model assumes that each variable X in the observed data structure is an unknown deterministic function fX of certain observed variables, which we refer to as the parents of X and denote by Pa(X), and some unmeasured exogenous random factors UX. This causal model defines a random variable with distribution PO,U on a unit.

In practice, additional exclusion restriction assumptions informed by subject matter knowledge can be represented by limiting the variables in Pa(X). In the HIV example, at each time t, the investigators may specify that CD4 counts, death, and regimen modification all depend on the patient’s entire observed past and unmeasured factors. But if they know that the decision to switch regimen is based only on the most recent CD4 measurement, then Pa(At1) may be restricted to exclude all earlier CD4 measurements.

As we alluded to in Section 1, research questions can be formalized in terms of intervention rules on the treatment variables. An intervention rule d is a function that deterministically assigns treatment at time t based on covariate history according to At=dLt. This rule may be static – assigning the same treatment option at regardless of covariate history, i. e. At=dLt≡at; or it may be dynamic – assigning different treatment options to different covariate histories. We will use the boldface notation At=dLt to denote the vector Ak=dLkk=0,…t; this means that each of the variables Ak, from k=0 to t, was assigned value according to rule d and its covariate history. In the HIV example, a static rule would be to always switch regimen at m months after virologic failure, and a dynamic rule would be to switch ART at the first time t when the patient’s CD4 counts drop below a pre-specified threshold θ.

Given a set d of intervention rules of interest, investigators are often concerned about their comparative causal effect on the outcome process Yt. To be more precise, consider an ideal experiment where all subjects are assigned treatment under an intervention rule d and right censoring is also prevented; the covariates, on the other hand, take the value that they may in response to rule d. We call the variables At the intervention variables, as they are the ones that are subject to manipulation. This ideal experiment can be formalized in the NPSEM by setting the equations for At to At=dLt, and replacing the input At−1 in fLt with dLt−1, which are treatment assignments from time 0 to t−1 under rule d and history Lt−1. As a result, the only random endogenous variables of the system are the covariates; we use Lt(d) to denote the time varying covariates that result under the intervention rule d, in particular Yt(d) is the indicator of death under such a regimen. The comparative causal effect of rule d1 vs rule d2 can be assessed by comparing the distributions of the outcome processes Yt(d1) vs Yt(d2).

Suppose we wish to understand how the outcome process Yt(d) changes as a function of d, t and some baseline covariate V⊂L0. To this end, we study the intervention-specific hazard function

(1)λ(d,t,V)≡P(Yt(d)=1|Yt−1(d)=0,V)

on the space d×τ×v, where d is the set of rules we wish to compare, τ={1,…K+1} contains all the follow-up times of interest, and v is the outcome space of V.

Studying the entire function (d,t,V)↦λ(d,t,V) is often difficult due to feasibility of computation, interpretation of results, and other challenges associated with high-dimensionality. Instead, we can study a more tractable, simplified model/summary of this function which captures how λ changes as a function of (d,t,V) in only a few summarizing parameters. More specifically, following the marginal structural working model framework in Neugebauer and van der Laan [25], let mψ(d,t,V) be a working MSM for λ(d,t,V), parameterized by ψ∈RJ. To be concrete, from here on, we will use a generalized linear hazard MSM with logit link,

mψ(d,t,V)=expitψ⋅ϕ(d,t,V),

where ϕ(d,t,V) is the vector of linear predictors. However, it is important to note that the methods presented here can be generalized to other working MSM.

In addition to specifying the working MSM, we also specify a user-supplied kernel weight function h(d,t,V), and the standard log-likelihood loss. Formally, the causal parameter of interest is defined as

(2)Ψ(PO,U)≡argminψE0{∑t∈τ,d∈dh(d,t,V)I(Yt−1(d)=0)×logmψ(d,t,V)Yt(d)1−mψ(d,t,V)1−Yt(d)}.=argminψE0∑t∈τ,d∈dh(d,t,V)P(Yt−1(d)=0|L0)×logmψ(d,t,V)λ(d,t,L0)1−mψ(d,t,V)1−λ(d,t,L0).

In words, our causal parameter of interest is the vector of coefficients of the MSM weighted by h. Under the working MSM framework, this parameter can be understood as the best (assessed by the loss function) weighted (by h) approximation (given by the working model mψ) of the hazard function λ. Embedded in the definition of the parameter are the stabilizing weights h(d,t,V). To mitigate positivity violations, h(d,t,V) can be chosen to down weight a rule d which has little data support at time t within strata of marginal covariates V. But the effectiveness of this stabilization will be limited when d uses time-varying tailoring variables.

Continuing our HIV example, suppose we wish to assess how delay in switching regimen affect mortality. Let dm denote the rule that dictates switching regimen at m months after virologic failure. That is, At=dm(Lt)=0 for t<m and At=dm(Lt)=1 for t≥m, given Yt=0. Let d be the set of all possible switching times we wish to compare, i. e. d={dm:m=0,…,K,∞}. If we are only interested in the marginal hazard, then we set V=∅. Or, if we wish to assess the hazard function stratified by CD4 at the time of virologic failure, then we can set V=1CD40<500cells/μℓ. For simplicity, let us continue this example with the former option, and let the weights be h(d,t)=1. For this example, we choose the logistic working model to be mψ(dm,t,V)=expitψ0+ψ1t+ψ2max(t−m,0), where max(t−m,0)=0 if the patient has not switched by time t−1, and max(t−m,0)=t−m encodes how long the patient has been on his second line regimen if he has already switched.

2.2 Statistical parameter

Thus far, we have used the NPSEM to formulate the parameter of interest Ψ(PO,U) in eq. (2). This parameter is a function of the distributions of Lt(d) and Yt(d), which are generated within an ideal experiment. Unfortunately, such ideal experiments are not always possible in real life. Then, what are the sufficient assumptions on the data-generating process under which the parameter Ψ(PO,U) can be estimated using the observed data?

To answer this question, we review the expression in eq. (2) and note that P(Yt−1(d)=0|L0)=1−E(Yt−1(d)|L0) and

(3)λ(d,t,L0)=P(Yt(d)=1,L0)−P(Yt(d)=1,Yt−1(d)=1,L0)P(Yt−1(d)=0,L0)=E(Yt(d)|L0)−E(Yt−1(d)|L0)1−E(Yt−1(d)|L0).

Consequently, to identify Ψ(PO,U) as a function of the data generating distribution P0, it suffices to establish the identification of the time-dependent causal dose-response curve {E(Yt(d)|L0):d∈d,t∈τ}.

To this end, we make the sequential randomization assumption (Robins [3]):

(4)Ak⊥L(d)|Pa(Ak),

and the positivity assumption

(5)P0Ak=d(Lk)|Lk,Ak−1=d(Lk−1)>0 a.e.,

for every d∈d, k∈τ, and V∈v with h(d,t,V)>0. Assumption (4) specifies that at each k, the variable Ak is randomized conditional on its parent variables. In our HIV example, this can be satisfied if the parent variables of Ak contain all common determinants of mortality and the decision to switch regimen. Assumption (5) requires that under P0, for each rule d and its compatible covariate history for which the weight h(d,k,V) is non-zero, there is non-zero probability that the subject’s treatment will continue to follow this rule. In our HIV example with h(d,t,V)=1, given one has not yet switched regimen, there should be non-zero probability that the patient will switch at any given time m in d. Assumption (5) would be violated if, say, d contains the rule to switch at 6 months but no patient in the study population actually switches at 6 months post-failure. This assumption would also be violated if, say, all patients with switch regimen at month 1.

Under assumptions (4) and (5), we obtain identification of the intervention-specific mean:

(6)E(Yt(d)|L0)=∑ℓ1,tyt∏j=1tP0(Lj=ℓj|L0,ℓ1,j−1, Aj−1=d(L0,ℓ1,j−1))≡Q1d,t(P0)(L0).

We are reminded that ℓ1,j−1≡ℓ1,...,ℓj−1. This is generally known as the G-computation formula (Robins [3]) for the intervention-specific mean. This in turn identifies the intervention-specific hazard function:

(7)λ(d,t,L0)=Q1d,t(P0)(L0)−Q1d,t−1(P0)(L0)1−Q1d,t−1(P0)(L0).

Consequently, the causal parameter of interest Ψ(PO,U) in eq. (2) is identified as a function of the observed data distribution given by

(8)ψ0≡Ψ(P0)≡argminψ{EP0∑t∈τ,d∈dh(d,t,V)1−Q1,0d,t−1(L0)×Q1,0d,t(L0)−Q1,0d,t−1(L0)1−Q1,0d,t−1(L0)logmψ(d,t,V)+1−Q1,0d,t(L0)1−Q1,0d,t−1(L0)log1−mψ(d,t,V)}.

Note that the sequential randomization assumption (4) is an untestable assumption on the data-generating process under which we can claim the causal parameter (2) equals the statistical parameter (8). It provides a guiding principle to judge the causal interpretation of the statistical parameter. By contrast, the positivity assumption (5) is a testable assumption on the data-generating distribution P0 under which the statistical parameter (8) is well-defined.

The remainder of this paper focuses on statistical inference for ψ0.

3 Estimators for ψ0

Recall that the observed data consist of n i.i.d. copies of O∼P0∈m. Before we introduce the proposed efficient and doubly robust estimator, we will review two available estimators for ψ0: the non-targeted substitution G-computation estimator (Robins [3], Taubman et al. [26]) and the aforementioned IPTW estimator. The proposed targeted maximum likelihood estimator uses either of these to obtain an initial estimator of ψ0. But before we proceed, we shall agree on the following notation.

3.1 Notations

We use Pn to denote the empirical distribution of n i.i.d. copies of O∼P0. Given a function O↦f(O), Pnf denotes the empirical mean Pnf≡1n∑i=1nf(Oi). More generally, for any P∈m, Pf≡EPf(O).

For a generic P∈m. We use QL0 to denote the marginal distribution of L0. Generalizing the functional in eq. (6), for a given t∈τ and d∈d, denote at P∈m:

(9)Qkd,t(P)(Lk−1)≡∑ℓk,tyt∏j=ktP(Lj=ℓj|Lk−1,ℓk,j−1,Aj−1=d(Lk−1,ℓk,j−1)),1pt for k≤t,1pt

and Qt+1d,t(⋅)≡Yt. In the above notation, the superscript t signals the expectant outcome variable Yt and the subscript k signals the length of the conditioning covariate history. Under assumptions (4) and (5), Qkd,t(P0)(Lk−1)=E(Yt(d)|Lk−1,Ak−1=d(Lk−1)), which is the conditional mean of Yt(d) given an observed past that has followed the rule d up to time k−1. For simplicity, we may sometimes write Qkd,t instead of Qkd,t(P) when referring to the functional evaluated at a generic P∈m, and Qk,0d,t(Lk−1) instead of Qkd,t(P0)(Lk−1) when the functional is evaluated at P0∈m. Bang and Robins [18] made a key observation that these functionals satisfy the relation

(10)Qkd,t(Lk−1)=EP|Qk+1d,t(Lk)Lk−1.

For a counting outcome process, these functionals are also monotonic in t, i. e. for a given k, Qkd,t≤Qkd,t+1 for all t≥k. We denote the components corresponding to the non-intervention variables as Q≡QL0,Qkd,t:,t∈τ,k≤t,d∈d. For the intervention variables, we denote the treatment allocation probabilities P(Ak|Lk,Ak−1) as g(Ak|Lk,Ak−1), and the product ∏k=1tg(Ak|Lk,Ak−1) as g(At|Lt). For our purposes, the couple (Q,g) readily specifies a distribution P∈m, so sometimes we may abuse notation and write P=(Q,g). At the data generating distribution P0, we adopt the subscripts Q0 and g0.

3.2 G-computation Estimator

The G-computation formula in eq. (6) readily delivers a non-targeted substitution estimator (as opposed to the targeted substitution estimator that is TMLE), which is generally known as the parametric G-computation (Gcomp) estimator. More precisely, using the notations in Section (3.1), the statistical parameter Ψ(P0) in eq. (8) can be expressed as Ψ(Q0). Therefore, fitting the hazard MSM using a non-targeted estimator Qn of Q0 yields a non-targeted substitution estimator Ψ(Qn) of Ψ(Q0).

Recall that Q0 consists of the marginal distribution of L0, QL0(P0), and the conditional means of Yt(d), Qkd,t(P0) defined in eq. (9). To estimate the marginal distribution QL0(P0), we can use the empirical distribution of L0, denoted QnL0. To estimate the conditional means Qkd,t(P0), one approach is to estimate the conditional densities for each Lt given its parents and use the definition in eq. (9). While this density-based approach ensures that the monotonicity in t of Qkd,t is preserved, the dimension of the nuisance parameter and the computational cost grows with the dimension of Lt and the number of time points. One way to implement this approach is to use simplifying parametric modeling assumptions on these nuisance parameters. We refer to Taubman et al. [26] and Young et al. [27] for expositions of this technique.

Another approach to estimate the conditional means Qkd,t(P0) is to minimize estimation of nuisance parameters by exploiting the recursive relation eq. (10) noted by Bang and Robins [18], which allows one to use the many available regression techniques (parametric or data-adaptive) in the literature. We can implement this regression-based approach by running the following two-level algorithm. Recall that in our notation Qkd,t(P0), d is the intervention of interest, t is the time of the expectant outcome, and k is the length of the conditioning history.

Starting at t=K+1, estimate the conditional means Qkd,t(P0): k≤t,d∈d as follows:
1. Initiate at k=t: Recall that Qtd,t(P0)≡EP0Yt|Lt−1,At−1=d(Lt−1) and Qt+1d,t≡Yt. We obtain an estimator Qt,nd,t of Qtd,t(P0) by regressing Yt on the observed values Lt−1 and At−11 among observations that remain uncensored by time t−1, and evaluate the fitted function at the observed Lt−1 and the intervened values At−11=d(Lt−1), for these uncensored observations. For a subject that has died before time t, the value is deterministically set to 1. We can organize the data by having one row for each patient i and each regimen d.
2. At each subsequent k=t−1,…,1: At the previous step, we have thus far obtained an estimator Qk+1,nd,t of Qk+1d,t(P0). To obtain an estimator Qk,nd,t of Qkd,t(P0), we regress Qk+1,nd,t(Lk) on the observed values Lk−1 and Ak−11 among those observations that remained uncensored by time k−1, and evaluate the fitted function at the observed Lk−1 and the intervened values Ak−11=d(Lk−1) on these uncensored observations. For observations that had died before time k, the value is deterministically set to 1.
3. After iterating step (b) in order of decreasing k, we have obtained a sequence of means Qk,nd,t:k≤t,d∈d.
Repeat step 1 in order of decreasing t, from t=K to t=1. At the end, we will have obtained estimators Qk,nd,t:t∈τ,k≤t,d∈d

For each history k, monotonicity in t≥k of the expectations, i. e. Qk,nd,t≤Qk,nd,t+1, can be enforced in step 1.b with respect to Qk,nd,t+1 obtained at the previous t+1 level through a simple sequential truncation. To do this, for subjects that have not died or censored, we set Qk,nd,t=Qk,nd,t+1 whenever Qk,nd,t>Qk,nd,t+1. Other more sophisticated approaches are available, but they are outside the scope of this paper.

At the end of this algorithm, we have the sequence of conditional means. {Q1,nd,t(L0): t∈τ,d∈d} for each of the n observations. Pooling together these estimates Q1,nd,t(L0) over all d and t, we have one row per patient i, rule d∈d and t∈τ. The G-computation estimator ψnGcomp≡Ψ(Qn) is obtained by fitting a weighted logistic regression of Q1,nd,t(L0)−Q1,nd,t−1(L0)1−Q1,nd,t−1(L0) according to the MSM, with weights h(d,t,V)(1−Q1,nd,t−1(L0)).

Consistency of ψnGcomp relies on consistency of Qn. In the density-based approach, this means consistent estimation of all the conditional densities of Lt given its parents; in the regression approach, this means consistent estimation of the conditional means Qkd,t(P0). In either case, correct specification of Q0 under a finite dimensional parametric model is rarely possible in practice. Alternatively, we may use machine learning algorithms, such as Super Learner. This option is more enticing, especially when used with the regression-based approach, since there are more data-adaptive techniques available to estimate a conditional mean via regression. However, unlike the Inverse Probability of Treatment Weighted Estimator (Section 3.3) and the Targeted Maximum Likelihood Estimator (Section 3.4) which satisfy a central limit theorem under appropriate empirical process conditions, analogous results for ψnGcomp are not available, regardless of model size. In addition, a non-targeted estimator Qn of Q0 is obtained by minimizing a global loss function for Q0, not for Ψ(Q0). This means, in particular, that the bias-variance tradeoff in Qn is optimized for the high-dimensional nuisance parameter Q0, instead of a much lower-dimensional parameter of interest Ψ(Q0). The proposed targeted estimator in Section 3.4 aims to address these two issues by providing a substitution estimator that is asymptotically linear (under appropriate regularity conditions), and optimizes the bias-variance tradeoff of Qn towards Ψ(Q0) via an updating step.

3.3 Inverse probability of treatment weighted estimator

Inverse probability of treatment weighted estimation is the standard methodology for estimating the parameters of a marginal structural model in the presence of time-varying confounding [1]), due to its ease of implementation and its asymptotic linearity, which allows for construction of Wald-type confidence intervals.

To begin, we first note that the parameter in eq. (8) can be rewritten in an IPTW form:

(11)Ψ(P0)=ΨIPTW(g0)≡argminψ{EP0∑t∈τ,d∈dh(d,t,V)I(At−1=d(Lt−1))g0At−1=d(Lt−1)|Lt−11−Yt−1×logmψ(d,t,V)Yt1−mψ(d,t,V)1−Yt}.

The IPTW estimator ψnIPTW is obtained by fitting a weighted logistic regression of Yt according to the MSM, with weights h(d,t,V)I(At−1=d(Lt−1))gnAt−1=d(Lt−1)|Lt−11−Yt−1, where gn is an estimator for g0.

The asymptotic theory of the IPTW estimator is well understood in the literature. We refer the reader to Robins [12], van der Laan and Robins [28] and van der Laan and Petersen [7], where the last reference specifically addresses dynamic intervention rules. In summary, ψnIPTW described above satisfies the estimating equation PnDIPTW(ψ,gn)=0, where

(12)DIPTW(ψ,g)=∑t∈τ,d∈dh˜(d,t,V)I(At−1=d(Lt−1))gAt−1=d(Lt−1)|Lt−1(1−Yt−1)Yt−mψ(d,t,V),

with h˜(d,t,V)≡h(d,t,V)ϕ(d,t,V). The estimator ψnIPTW is well-defined if it is a unique minimizer, and it is a consistent estimator of ψ0 provided gn is a consistent estimator of g0. Moreover, ψnIPTW is asymptotically linear with influence curve

(13)ICIPTW(g0)=MIPTW(ψ0,g0,P0)−1DIPTW(ψ0,g0),

where

MIPTW(ψ′,g,P)≡−∂PDIPTW(ψ,g)∂ψ|ψ=ψ′=EP{∑t∈τ,d∈dh(d,t,V)ϕ(d,t,V)ϕ(d,t,V)TI(At−1=d(Lt−1))gAt−1=d(Lt−1)|Lt−1(1−Yt−1)×mψ′(d,t,V)(1−mψ′(d,t,V))}.

The sample covariance matrix of the estimated influence curve is given by ΣnIPTW≡MIPTW(ψnIPTW,gn,Pn)−1DIPTW(ψnIPTW,gn). The variance of n(ψnIPTW−ψ0) can be estimated using ΣnIPTW. Consequently, we can construct Wald confidence intervals of level (1−α) as ψnIPTW±ξ1−α/2(ΣnIPTW/n)1/2, where ξ1−α/2 is the (1−α/2)-quantile of the standard normal distribution. These confidence intervals ought to be conservative when g0 is estimated using a correctly specified parametric model. But no such theoretical guarantee for data-adaptive estimation of g. Note that this influence curve based variance estimate assumes that the weights h(d,t,V) are known functions; when these weights are estimated, this variance estimate should be interpreted as estimating the variance of an estimator of MSM parameters defined by the estimated weights.

Though both G-computation estimator and IPTW estimator properly account for time-varying confounding, the popularity of IPTW over G-computation estimator is apparent from its straightforward implementation and its theoretical validity for confidence intervals. Moreover, the treatment probabilities g0 may arguably be easier to specify correctly than the sequential conditional means Qkd,t(P0). However, the IPTW estimator may be generally more susceptible to near positivity violations due to the inverse probability weighting in eq. (11). Often, the kernel functions h(d,t,V) are chosen to stabilize these inverse probability weights. This remedy, however, is less effective when the rules d are dynamic and use time varying covariates as tailoring variables, since the marginal function h(d,t,V) cannot depend on the time varying covariates Lt.

3.4 Targeted maximum likelihood estimator

As discussed earlier, consistency of the IPTW estimator relies on consistency of gn, while consistency of the G-computation estimator relies on consistency of Qn. In this section, we propose a semiparametric efficient estimator that is robust against misspecification of either Q0 or g0. These theoretical promises hinge on the use of one important ingredient – the efficient influence curve for ψ0.

3.4.1 The efficient influence curve

Central to our methods is viewing the parameter of interest ψ0 as the value of the map Ψ: m→RJ evaluated at P0, where Ψ(P) is given by eq. (8), with the corresponding functionals at P∈m. It is also straightforward to note from eq. (8) that Ψ(P)=Ψ(Q). From its definition, we see that ψ=Ψ(P) satisfies the characterizing equation

(14)0=U(ψ,Q,P)≡EP∑t∈τ,d∈dh˜(d,t,V)1−Q1d,t−1(L0)Q1d,t(L0)−Q1d,t−1(L0)1−Q1d,t−1(L0)−mψ(d,t,V).

The mapping Ψ is pathwise differentiable on m; its efficient influence curve (EIC) sheds light on the asymptotic properties of all regular and asymptotically linear estimators of Ψ(P0). The latter statement is formalized in the following lemma. We refer the reader to Bickel et al. [29], van der Laan and Robins [28] van der Vaart and Wellner [30] for definitions and proofs about properties of efficient influence curves in general. The efficient influence curve for the mapping P↦Ψ(P) in 8 can be derived using the characterizing eq. (14) via the functional delta method; this derivation can be found in Petersen et al. [23] and Schnitzer et al. [21], we provide it in the appendix for completeness sake.

Lemma 1 (influence curve for Ψ)

Suppose the mapping Ψ:m→RJ is well-defined at P, in the sense that it is a unique minimizer and hence characterized by the equation U(Ψ(P),Q,P)=0. Then, its efficient influence curve at P=(Q,g), denoted D∗(Q,g), is given by D∗Ψ(Q),Q,g, where

(15)D∗ψ,Q,g(O)≡M(ψ,Q,P)−1×∑t∈τ∑d∈dDd,t(ψ,Q,g)(O)+DL0(ψ,Q)(O),

with

M(ψ,Q,P)≡−|∂U(ψ′,Q,P)∂ψ′ψ′=ψ=EP{∑t∈τ,d∈Dh(d,t,V)ϕ(d,t,V)ϕ(d,t,V)T1−Q1d,t−1(L0)mψ(d,t,V)(1−mψ(d,t,V))},Dd,t(ψ,Q,g)(O)≡Jψ(d,t,V)∑k=1tI(Ak−1=d(Lk−1))gAk−1=d(Lk−1)|Lk−1Qk+1d,t(Lk)−Qkd,t(Lk−1),DL0(ψ,Q)(O)≡∑t∈τ,d∈dh˜(d,t,V)1−Q1d,t−1(L0)Q1d,t(L0)−Q1d,t−1(L0)1−Q1d,t−1(L0)−mψ(d,t,V).

with Jψ(d,t,V)=h˜(d,t,V)−I(t≤K)h˜(d,t+1,V)(1−mψ(d,t+1,V)).

The variance VarPD∗Ψ(Q),Q,g(O) is a generalized Cramér-Rao lower bound for the asymptotic variance of any regular and asymptotically linear estimator of Ψ(P).

Moreover, if either Q=Q0 or g=g0, then P0D∗(Ψ(Q),Q,g)=0 implies that Ψ(Q)=Ψ(Q0).

Proof. See appendix for derivation of eq. (15) and proof of double robustness; see Chapters 2 and 3 of Bickel et al. [29], Chapters 1 and 2 of van der Laan and Robins [28] and Chapter 3 of Tsiatis [31] for the statement regarding variance bounds.□

Now, we are ready to describe the implementation of a TMLE estimator using D∗.

3.4.2 The loss function, the fluctuation model, and the algorithm

In a glimpse, our strategy consists in targetedly updating given initial estimators Qn of Q0 by minimizing a pre-specified loss along a least favorable (with respect to ψ0) submodel through Qn; iterate this updating procedure until the estimating equation PnD∗(Qn∗,gn)=0 is solved at some final targeted estimate Qn∗ of Q0, and then evaluate Ψ at this Qn∗.

More specifically, for each d∈d, t=K+1,…,1 and k≤t, the relation in eq. (10) allows us to consider Qkd,t as a conditional expectation of Qk+1d,t, therefore we specify the quasi negative log likelihood loss for Qkd,t, indexed by its expectant Qk+1d,t:

(16)LQkd,t(O)≡−IAk−1=dLk−1logQkd,t(Lk−1)Qk+1d,t(Lk)1−Qkd,t(Lk−1)1−Qk+1d,t(Lk).

We suppressed the indexing by Qk+1d,t in the notation. The corresponding least favorable submodel through Qkd,t, parametrized as Qkd,t(ε):ε∈RJ (recall that J is the dimension of the parameter ψ0), is chosen to satisfy the score condition

∑k=1t|∂LQkd,t(ε)∂εε=0=Dd,tψ,Q,g.

In particular, we can choose

(17)Qkd,t(ε)=expitlogit Qkd,t+εHkd,t(ψ,g),

where Hkd,t(ψ,g)(Lk−1)≡Jψ(d,t,V)gAk−1=d(Lk−1)|Lk−1. Note that the dependency of Qkd,t(ε) on ψ and g are suppressed in the notation.

Before describing the algorithm, we make the following observation. Due to the form of the efficient influence curve, the direction of fluctuation Hkd,t(ψ,g) depends on the parameter ψ itself; consequently, the implementation of the TMLE for the hazard MSM conceals more subtleties than its counterpart for the survival MSM in Schnitzer et al. [21] and [23]. For one, in addition to non-targeted initial estimates Qn and gn, an initial estimate of ψ0 is also needed to perform the first update for Qn. One should choose a consistent initial estimator (albeit not doubly robust), such as the G-computation or the IPTW estimator. Since iterations will be performed, choosing either one of these should have the same asymptotic implications. However, it may make a difference in finite sample performance when Qn is misspecified, since the fluctuation serves to provide bias reduction over such misspecified Qn. The second subtlety is that this TMLE estimator requires iterative updates, wherein each iteration uses the previously obtained updated estimate of ψ0 to steer the direction of fluctuation. The goal of these iterations is to produce targeted estimators Qn∗ of Q0 that satisfy the efficient score equation PnD∗(Qn∗,gn)=0 in order to achieve proper bias reduction. However, once the residual term PnD∗(Qn∗,gn) becomes smaller than the standard error of the estimator, computational efforts spent to further minimize it will only yield diminishing returns on bias reduction, as can be evidenced by studying the linear expansion of the proposed estimator. Therefore, we use as stopping rule PnD∗(Qn∗,gn)/SEˆn<1/n, where SEˆ is estimated using Varˆ(D∗(Qn∗,gn))/n.

With these issues in mind, we are now ready to describe the algorithm.

Obtain initial estimators gn of g0 and Qn of Q0. For the latter, estimate the marginal distribution of L0 using the empirical distribution, and estimate the conditional means Qkd,t(P0) using the regression-based technique of Bang and Robins [18] described in Section 3.2. Obtain initial estimator ψn of ψ0 using either the IPTW estimator ΨIPTW(gn) or the G-computation estimator Ψ(Qn).
Given initial estimators ψn, gn and Qn=QnL0,Qk,nd,t,t∈τ,k≤t,d∈d, we sequentially update the conditional means Qk,nd,t in a two-level algorithm as follows:
1. Starting at t=K+1, estimate the sequence Qkd,t(P0):k≤t,d∈d in order of decreasing k as follows (recall that Qt+1d,t≡Yt):
  1. At each k=t,…,1: We have an initial estimator Qk,nd,t of Qkd,t(P0), and at the previous step, we have obtained an updated estimator Qk+1,nd,t,∗ of Qk+1d,t(P0), of which Qkd,t(P0) is the expectation. The updated estimator Qk,nd,t,∗ is given by Qk,nd,t,∗≡Qk,nd,t(εkt) where
    εk,nt≡argminPn∑d∈dLexpitlogitQk,nd,t+εJψn(d,t,V)gAk−1=d(Lk−1)|Lk−1.
    We can obtain this coefficient by regressing the expectant Qk+1,nd,t,∗(Lk) on Jψn(d,t,V) with offset logitQk,nd,t(Lk−1) and weights
    IAk−11=dLk−1/gAk−1=d(Lk−1)|Lk−1,
    among observations that remained uncensored by time k−1. For observations that had died before time k, the value of the updated estimator is deterministically set to 1.
  2. After iterating step (i) in order of decreasing k, we have obtained updated estimators Qk,nd,t,∗: k≤t,d∈d.
2. Repeat step (a) in order of decreasing t, from t=K to t=1.
At the end of the two-level algorithm in step (2) above, we have obtained targeted estimators Qn∗=QnL0,Qk,nd,t,∗(⋅): t∈τ,k≤t,d∈d. In particular, we have Q1,nd,t,∗(L0): t∈τ,d∈d for each of the n observations. Pool together these estimates Q1,nd,t,∗(L0) over all d and t, and we have one row per patient i, rule d∈d and t∈τ. We can then update the parameter estimate using ψn∗≡Ψ(Qn∗); this can be implemented by fitting a weighted logistic regression of Q1,nd,t,∗(L0)−Q1,nd,t−1,∗(L0)1−Q1,nd,t−1,∗(L0) according to the MSM, with weights h˜(d,t,V)(1−Q1,nd,t−1,∗(L0)).
Repeat step 2 and 3 using ψn∗ and Qn∗ as initial estimators. Iterate this procedure until the stopping criteria PnD∗(Qn∗,gn)/SEˆn<1/n, where SEˆn=VarˆD∗(Qn∗,gn)/n, is satisfied.
The final updates Qn∗ and ψn∗ are the targeted maximum likelihood estimators for Q0 and ψ0, respectively.

3.4.3 Statistical inference of TMLE

As we alluded to earlier, given the pre-specified loss function (16), the corresponding least favorable submodel eq. (17) is chosen so that, by design, the TMLE estimator Qn∗ satisfies PnD∗(Qn∗,gn)≈0. From this property stems the doubly robust and locally efficient properties of ψn∗.

Specifically, under regularity and empirical process conditions (e. g. van der Laan and Rose [32]), if both Qn∗ and gn are consistent estimators, then ψn∗ is asymptotically linear with influence curve D∗(Q0,g0); if gn converges to g0, but Qn∗ converges to some Q∗ (which may be correctly specified or otherwise), then the influence curve of ψn∗ equals D∗(Q0,g0) minus its projection onto the tangent space of m for g0. In either case, n(ψn∗−ψ0) converges weakly to a normal distribution with each diagonal element of the covariance matrix equal or greater than its counterpart on the covariance matrix of D∗(ψ0,Q∗,g0). Consequently, the asymptotic variance of n(ψn∗−ψ0) can be conservatively estimated using Σn∗, the sample covariance matrix of D∗(ψn∗,Qn∗,gn). We can construct asymptotically conservative confidence intervals of level (1−α) as ψn∗±ξ1−α/2(Σn∗/n)1/2. The same comments as for IPTW regarding the estimated kernel weights also apply here.

4 Simulation study

In this section, we evaluate the relative performances of the G-computation estimator, IPTW estimator and TMLE estimator for the parameters of a marginal structural model for the hazard function. For each estimator, we assess the bias, variance, mean squared error (MSE) and coverage estimates for the influence-curve based 95 % confidence intervals.

4.1 Data generating process

We use a data generating process that resembles the running example, but now including right censoring. A sample consists of n i.i.d. copies of

O=W,CD40,A02,A01,Y1,CD41,A12,A11,…,YK,CD4K,AK2,AK1,YK+1,

with K+1=5. The baseline variable W encodes sex, the baseline age and disease stage, the time varying covariate CD4t encodes the most recent CD4 count measurement at time t, and Yt is the indicator of death by time t. The intervention variables are At=At1,At2, where the treatment variable At1 is the indicator of having switched regimen by time t, and the censoring variable At2 indicates loss to follow up by time t.

We briefly summarize the data generating process here, and defer the details of the data generating distributions are given in the appendix. First, we draw the baseline time independent covariates W. At each time t≥0, if still alive and uncensored by t−1, then first draw mortality status Yt based on W, past CD4 counts CD4t−1 and past switching status At−11 (note at time t=0, Yt=0 by default). If still alive, then draw the CD4 count measurement for time t based on baseline covariate W, prior CD4 counts CD4t−1, and treatment status at previous time point At−11. After drawing CD4, draw censoring by loss to followup At2 based on W and current CD4 counts. If still uncensored and have not switched regimen, draw indicator of whether to switch regiment at time t based on W and current CD4 counts CD4t.

4.2 Target parameter

An intervention rule of interest dm switches regimen at m months after confirmed virologic failure and prevents right censoring, given the subject is still alive. The set of regimens of interest are indexed by switching times of interest, d={dm: m=0,…,K+1,∞}. Note that each dm is in fact a static rule assigning At1=atm, where atm=0 at t<m and atm=1 at t≥m. We summarize the hazard function λ(dm,t)=P(Yt(dm)=1|Yt−1(dm)=0) as

mψ(dm,t)=expitψ0+ψ1t+ψ2max(t−m,0),

and use kernel weights

h(dm,t)=P0At−1=at−1mI(t<t∗),

where t∗ is the first time point where all subjects have either died or censored. Both P0At−1=at−1m and t∗ are estimated from data, but are treated as given in our parameter definition. This weight mitigates near positivity violation by down-weighting rules with little support in data.

Under the given data-generating distribution, the probabilities of regimen switch are bounded between 0.15 and 0.57. The probabilities of right censoring are bounded between 0.11 and 0.2. The theoretical bounds for the cumulative function g for an individual across all regimens of interest are between 0.0069 and 0.22.

4.3 Estimators

We use the regression-based approach in Section 3.2 to estimate the initial estimator Qn of Q0. A correctly specified Qk,nd,t is obtained by regressing Qk+1,nd,t on W and the entire past covariate history Lk−1≡(CD40,…,CD4k−1) and treatment history Ak−11. A misspecified Qk,nd,t uses an intercept model. A correctly specified gn estimates the conditional censoring probabilities and the treatment allocation probabilities by adjusting for W and CD4t. A misspecified gn uses an intercept model. Both gn and Qn are fitted using a logistic regression model.

We implement the Gcomp and the IPTW estimators as described in their respective sections. To implement the TMLE estimator as described in 3.4, we use the IPTW estimator as an initial estimate of ψ0. In both IPTW and TMLE, the product of g in the denominator is truncated below at 0.01. The iterations in TMLE stops at either PnD∗(Qn∗,gn)/SEˆn<1/n, or at the 17th iteration, whichever comes first. For TMLE and IPTW, the asymptotic variances are estimated by the sample covariance of their respective influence curves, Σn∗ and ΣnIPTW; consequently, finite sample variance of these estimators can be approximated using the sample variance of their influence curves divided by n. As we mentioned in Section 3, there are no influence curve based variance estimators for the Gcomp. But in the spirit of exploration, we implemented a naive version of such variance estimators for the Gcomp by using the variance estimate for the TMLE but using the initial estimators for Q, instead of the updated ones. It is important to remember that the G-comp’s IC-based variances estimates are not backed by empirical process theory.

4.4 Results

We assess bias, variance, MSE and coverage probability of the confidence intervals of the estimators over 500 samples. The true values of the target parameters are approximated by generating large datasets where the treatment variables are assigned according to the regimens of interest and the covariates and outcomes are drawn according to the specified data-generating distributions. We display here the results for the TMLE implemented with IPTW as initial estimator. While we had also implemented the TMLE with Gcomp as initial estimator. The results between the two TMLEs are similar up to the displayed digits of significance, so we omit the Gcomp version here for clarity of exposition.

The TMLE implementations in all simulations ended by the stopping rule after one or two iterations, and are deemed convergent for our purposes. In these tables “IC-varEst” indicates the average of the influence-curve based variance estimates, “Coverage” indicates the empirical coverage of the influence-curve based Wald confidence intervals, and “Coverage truevarCI” indicates the empirical coverage of the Wald confidence intervals given by the true sample variance of the estimator.

The simulation results are displayed in Table 1 (for sample size n=500) and Table 2 (for sample size n=200). When both Q0 and g0 are correctly specified, theoretical results predict that Gcomp, IPTW and TMLE are all consistent. Indeed, the MSE is reduced at the rate of sample size increase for all three estimators. At both sample sizes, the efficiency of TMLE and IPTW are quite similar under this data generating distribution. When one of the nuisance parameters are misspecified, lemma 1 predicts that the double robustness properties of TMLE should provide bias reduction. Indeed, when gn is misspecified and Qn is correctly specified, TMLE provides significant bias reduction over the misspecified IPTW, with only slight increase in variance, hence leading to overall smaller MSE. The benefit of this bias-variance tradeoff is more apparent across sample sizes, since the overall MSE of the TMLE estimators will decrease at the speed of sample size growth, whereas those of IPTW will stabilize. When Qn is misspecified and gn is correct, TMLE also provides substantial bias reduction over the misspecified Gcomp. Though this TMLE may have a larger variance than the biased Gcomp, the bias reduction is significant enough to yield smaller MSE for TMLE and ensure MSE converges at the rate of sample size increase.

In addition to parameter estimation, we are also interested in providing variance estimate and confidence intervals for our estimators. When both Q0 and g0 are correctly specified, both IPTW and TMLE are asymptotically normal, and hence one should achieve the desired coverage using Wald confidence intervals based on influence curve-based variance estimates (“IC-varEst”). Firstly we see that IC-varEst tends to underestimate the true variance of the estimators, but the approximation gets better with sample size. At sample size 500, the Wald 95 % confidence intervals of TMLE have empirical coverage that is below the nominal coverage. Is this low coverage attributed to the IC-based variance estimates or to an underlying problem with achieving asymptotic normality? To answer this, we construct a confidence intervals using the true variance of the estimators (as estimated by the sample variance across the 500 simulations). The coverage of these confidence intervals is reported under “Coverage: true varCI”. We see that this second set of confidence intervals have coverage close to the 95 % confidence level, given the moderate sample size. This suggest that the suboptimal coverage of the IC-based confidence intervals is attributed to the performance of the IC-based variance estimator, not to the asymptotic normality. Indeed, at sample size 2000, the coverage of the IC-based confidence intervals of TMLE approximate the nominal coverage. This underscores the need for variance estimators with improved finite sample performance, possibly exploiting high-order derivatives of the parameter estimators, but this is beyond the scope of this paper. We remind the readers that we also report here a IC-based variance estimate and corresponding confidence intervals for the Gcomp, only for the sake of comparison, as the performance of these are not backed up by theoretical results.

Table 1:

Simulation Results for 500 simulations with n=500 and K+1=5. “IC-varEst”: average of the influence curve based variance estimates across 500 simulations; “Covarage”: coverage of the Wald 95 % confidence interval using IC-based variance estimate. “true var. CI”: coverage of the Wald confidence interval using true variance (approximated through the 500 simulations).

	Qn and gn correct			Correct Qn, misspec. gn		Misspec Qn, correct. gn
	Gcomp	IPTW	TMLE	IPTW	TMLE	Gcomp	TMLE
Bias
ψˆ0	−1.05e-02	−7.30e-03	−1.02e-02	−6.43e-02	−4.96e-02	−9.31e-02	6.93e-03
ψˆ1	−1.35e-02	−4.39e-03	−6.35e-04	−2.17e-01	2.89e-02	−2.55e-02	−1.90e-02
ψˆ2	9.19e-03	−4.52e-03	−9.63e-03	4.16e-01	−2.55e-02	7.93e-02	1.33e-03
Variance
ψˆ0	4.57e-02	5.43e-02	5.76e-02	3.66e-02	6.33e-02	4.65e-02	4.79e-02
ψˆ1	1.08e-02	1.57e-02	1.75e-02	6.29e-03	2.12e-02	1.19e-02	1.21e-02
ψˆ2	1.02e-02	1.45e-02	1.46e-02	2.75e-04	1.69e-02	1.08e-02	1.16e-02
MSE
ψˆ0	4.58e-02	5.43e-02	5.76e-02	4.06e-02	6.56e-02	5.50e-02	4.79e-02
ψˆ1	1.09e-02	1.57e-02	1.74e-02	5.33e-02	2.20e-02	1.25e-02	1.25e-02
ψˆ2	1.03e-02	1.45e-02	1.47e-02	1.73e-01	1.75e-02	1.71e-02	1.16e-02
IC-varEst
ψˆ0	4.98e-02	5.20e-02	4.75e-02	7.90e-02	4.99e-02	4.72e-02	4.09e-02
ψˆ1	1.25e-02	1.34e-02	1.13e-02	2.56e-02	1.15e-02	1.14e-02	9.86e-03
ψˆ2	1.30e-02	1.46e-02	1.23e-02	1.93e-02	1.36e-02	1.14e-02	1.08e-02
Coverage
ψˆ0	9.56e-01	9.40e-01	9.26e-01	9.86e-01	9.06e-01	9.40e-01	9.30e-01
ψˆ1	9.62e-01	9.34e-01	8.76e-01	8.52e-01	8.14e-01	9.38e-01	9.24e-01
ψˆ2	9.66e-01	9.44e-01	9.26e-01	3.20e-02	9.06e-01	8.88e-01	9.38e-01
Coverage truevarCI
ψˆ0	9.52e-01	9.56e-01	9.52e-01	9.32e-01	9.40e-01	9.28e-01	9.46e-01
ψˆ1	9.40e-01	9.50e-01	9.54e-01	2.26e-01	9.50e-01	9.38e-01	9.42e-01
ψˆ2	9.38e-01	9.48e-01	9.54e-01	0.00e+00	9.54e-01	8.86e-01	9.46e-01

Table 2:

Simulation results for 500 simulations with n=2000 and K+1=5. “IC-varEst”: average of the influence curve based variance estimates across 500 simulations; “Covarage”: coverage of the Wald 95 % confidence interval using IC-based variance estimate. “true var. CI”: coverage of the Wald confidence interval using true variance (approximated through the 500 simulations).

	Qnandgncorrect			Correct Qn, misspec. gn		Misspec Qn, correct. gn
	Gcomp	IPTW	TMLE	IPTW	TMLE	Gcomp	TMLE
Bias
ψˆ0	1.92e-03	−1.44e-02	−1.36e-02	−7.49e-02	−1.88e-02	−1.00e-01	4.25e-03
ψˆ1	−1.06e-02	5.81e-03	5.46e-03	−2.10e-01	9.12e-03	−1.79e-02	−1.19e-02
ψˆ2	2.04e-03	−4.74e-03	−4.47e-03	4.17e-01	−6.12e-03	8.07e-02	1.83e-03
Variance
ψˆ0	1.14e-02	1.28e-02	1.26e-02	9.18e-03	1.32e-02	1.19e-02	1.20e-02
ψˆ1	2.52e-03	3.30e-03	3.25e-03	1.56e-03	3.45e-03	2.78e-03	2.78e-03
ψˆ2	2.64e-03	3.44e-03	3.34e-03	7.09e-05	3.44e-03	2.76e-03	2.83e-03
MSE
ψˆ0	1.14e-02	1.30e-02	1.28e-02	1.48e-02	1.36e-02	2.19e-02	1.19e-02
ψˆ1	2.63e-03	3.32e-03	3.28e-03	4.55e-02	3.53e-03	3.10e-03	2.92e-03
ψˆ2	2.63e-03	3.45e-03	3.35e-03	1.74e-01	3.47e-03	9.26e-03	2.83e-03
IC-varEst
ψˆ0	1.25e-02	1.29e-02	1.22e-02	1.97e-02	1.25e-02	1.17e-02	1.03e-02
ψˆ1	3.19e-03	3.33e-03	3.04e-03	6.28e-03	3.05e-03	2.77e-03	2.50e-03
ψˆ2	3.26e-03	3.48e-03	3.13e-03	4.63e-03	3.41e-03	2.75e-03	2.67e-03
Coverage
ψˆ0	9.58e-01	9.48e-01	9.42e-01	9.88e-01	9.48e-01	8.40e-01	9.36e-01
ψˆ1	9.74e-01	9.52e-01	9.38e-01	6.60e-02	9.28e-01	9.42e-01	9.30e-01
ψˆ2	9.72e-01	9.58e-01	9.52e-01	0.00e+00	9.54e-01	6.46e-01	9.60e-01
Coverage truevarCI
ψˆ0	9.50e-01	9.50e-01	9.48e-01	8.66e-01	9.52e-01	8.36e-01	9.48e-01
ψˆ1	9.46e-01	9.54e-01	9.52e-01	0.00e+00	9.56e-01	9.38e-01	9.46e-01
ψˆ2	9.56e-01	9.54e-01	9.56e-01	0.00e+00	9.56e-01	6.58e-01	9.60e-01

5 Summary

In this paper, we have presented a doubly robust and efficient substitution estimator that directly targets a longitudinal dynamic (or static) marginal structural model for the hazard function. This work builds upon the pooled TMLE methodology in Petersen et al. [23] for the survival function, as well as earlier work by Robins [15], Robins [17] and Bang and Robins [18].

Unlike TMLE for MSM for survival functions or means, TMLE for MSM for a hazard function is a bona fide iterated estimator and requires an initial estimator of the parameter of interest. There is no theoretical predictions on which initial estimator is more advantageous. The simulations have also offer little evidence in favor of either. This seems to suggests that the choice of initial estimator has little effect, at least in moderate and large sample sizes. In terms of computing costs, like the TMLE for survival MSM, the main computational resources for TMLE for hazard MSM are expended to obtain the initial estimators for Qkd,t(P0); these costs are heavily dependent on sample size and the algorithms used to fit these estimators, but only have to be performed once. Beyond those, each iteration of TMLE updates fit O(K2) logistical models of dimension J. TMLE for survival MSM only performs one iteration of such updates, whereas TMLE for hazard MSM may perform more than one (though in our simulation two iterations was enough). Further simulations are needed to assess the actual added computational burden in situations with big datasets, large number of outcome times, or that require higher number of iterations.

The proposed TMLE estimator offers theoretical advantages over the popular IPTW estimator and a non-targeted substitution Gcomp estimator: 1) it offers protection against model misspecifications, 2) it is locally efficient, 3) it may also provide protection against unstable probability weights. The bias reduction over a misspecified IPTW or Gcomp estimator is clear in the simulation studies even for a moderate sample size. We also see that the influence curve based variance estimate of IPTW and TMLE tend to be conservative estimates (underestimate) of the true variance, but this variance estimate can be improved with increased sample size. As mentioned in Section 3.4, under certain empirical process conditions, these influence-curve based estimators provide conservative variances estimates when only one of the nuisance parameters are correctly specified. Such conditions depend on the rates of convergence for the nuisance parameter estimates and the complexity of the correctly specified model. In the case of IPTW, the consistencies of both the parameter estimate and the variance estimate hinge on correct specification of the treatment mechanisms. Influence curve based estimates provide a viable alternative or complement to bootstrapping when the computing burden of the latter is high. But their empirical process conditions also suggest that a more sophisticated theory-based variance estimates, as well as appropriate diagnostics, are needed for the TMLE and the IPTW estimator. Future research priorities should focus on variance estimation and inference methods that remain resilient in the face of moderate confounding and multiple time points.

Acknowlegements

We thank Josh Schwab for his help in the software implementation. This work was supported by R01 A1074345-07 (P.I. M. van der Laan).

Paper Appendix

Derivation of efficient influence curve

In order to apply the functional delta method, we first rewrite the characterizing eq. (14) as

(18)U(ψ,Q,P)=EP∑t∈τ,d∈dh˜(d,t,V)1−Q1d,t−1(L0)Q1d,t(L0)−Q1d,t−1(L0)1−Q1d,t−1(L0)−mψ(d,t,V)=EP∑t∈τ,d∈dh˜(d,t,V)Q1d,t(L0)−Q1d,t−1(L0)−mψ(d,t,V)+mψ(d,t,V)Q1d,t−1(L0)=EP∑t∈τ,d∈dJψ(d,t,V)Q1d,t(L0)−h˜(d,t,V)mψ(d,t,V).

Since 0=U(Ψ(P),Q,P), it follows from implicit differentiation that

dΨ(P)dP=−dU(Ψ(P),Q,P)dΨ(P)−1dU(ψ,Q,P)dP.

Straightforward computations shows that indeed

−dU(ψ,Q,P)dψ=EP∑d∈d∑t=1τh(d,t,V)1−Q1d,t−1(L0)mψ(d,t,V)(1−mψ(d,t,V))ϕ(d,t,V)ϕ(d,t,V)T.

This proves the expression for M(ψ,Q,P) in eq. (15).

Next, note that dU(ψ,Q,P)dP=dU(ψ,Q,P)dQL0dQL0dP+∑t∈τ∑d∈ddU(ψ,Q,P)dQ1d,tdQ1d,tdP. can be obtained by a simple application of the functional delta method method, using the efficient influence functions for QL0 and Q1d,t obtained in [23]. Therefore, we have

DL0(ψ,Q)(O)≡∑t∈τ∑d∈dh˜(d,t,V)1−Q1d,t−1(L0)Q1d,t(L0)−Q1d,t−1(L0)1−Q1d,t−1(L0)−mψ(d,t,V),

Dd,t(ψ,Q,g)(O)=Jψ(d,t,V)∑k=1tI(Ak−1=d(Lk−1))gkAk−1=d(Lk−1)|Lk−1Qk+1d,t(Lk)−Qkd,t(Lk−1).

Now, we show the robustness property. If Q=Q0, the result is trivial by definition of Ψ(Q) and Qkt. We only need to check the second case. When g=g0, at each t, the sum from k=1 to k=t forms a telescopic sum, leaving only the first and the last term. Using expression in eq. (18) to rewrite DL0(Ψ(Q),Q), we have, up to a constant normalizing matrix,

0=P0D*(Ψ(Q),Q,g0)=P0∑t=1τ∑d∈DJΨ(Q)(d,t,V)Q1,0d,t(L0)−P0∑t=1τ∑d∈DJΨ(Q)(d,t,V)Q1d,t(L0)+P0∑t=1τ∑d∈DJΨ(Q)(d,t,V)Q1d,t(L0)−h˜(d,t,V)mΨ(Q)(d,t,V)=P0∑t=1τ∑d∈DJΨ(Q)(d,t,V)Q1,0d,t(L0)−h˜(d,t,V)mΨ(Q)(d,t,V)=U(Ψ(Q),Q0,P0).

Therefore, Ψ(Q)=Ψ(Q0).

Data Generating Distribution for the Simulation Study

The baseline covariate W consists of W=(W1,W2,W3,W4), where W1=I(30≤age≤39), W2=I(age>39), W3 indicates sex, and W4=disease stage.

Let ε1,t be errors drawn from a standard normal distribution. The data generating distributions are given as follows:

W1∼Bern(0.3),(W2|W1=0)∼Bern(0.5),W3∼Bern(0.5),W4∼Bern(0.3);P(Yt=1|W,CD4t−1,At−11,Yt−1=0)={0 if t=0;expit−2−0.5W1−0.1W2+0.2W3+0.3W4−0.7CD4t−1−0.9At−11,else.CD4t=maxmin(μ(W,CD4t−1,At−11)+ε1,t,2),−2,where μ(W,CD4t−1,At−11)={−W4 if t=0,0.1W1−0.5W2−0.1W3−0.2W4+0.4CD4t−1+0.5At−11−0.1CD4t−1At−11,else;P(At2=1|W,CD4t,At−12=0,Yt=0)=1−expit1.5+0.1W1+0.2W2+0.1W3+0.1W4+0.1CD4tP(At1=1|At−11,W,CD4t,At2=0,Yt=0)={1,if t≥1 and At−11=1;expit−0.5−0.1W1−0.2W2−0.2W3+0.1W4−0.4CD4t,else.

References

1. Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000b;11:550–60.10.1097/00001648-200009000-00011Search in Google Scholar

2. Robins JM. Marginal structural models Proceedings of the American Statistical Association. Section on Bayesian Statistical Science, 1997, 1–10.Search in Google Scholar

3. Robins J. A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Math Model 1986;7:1393–512.10.1016/0270-0255(86)90088-6Search in Google Scholar

4. Pearl J. Causality: models, reasoning and inference, 2nd ed. New York: Cambridge University Press, 2009.10.1017/CBO9780511803161Search in Google Scholar

5. Petersen M, van der Laan M, Napravnik S, Eron J, Moore R, Deeks SG. Long term consequences of the delay between virologic failure of highly active antiretroviral therapy and regimen modification. AIDS 2008;22:2097–106.10.1097/QAD.0b013e32830f97e2Search in Google Scholar PubMed

6. Petersen M, Tran L, Geng E, Reynolds S, Kambugu A, Wood R, et al. Delayed switch of antiretroviral therapy after virologic failure associated with elevated mortality among HIV-infected adults in Africa. AIDS 2014b;28:2097–107.10.1097/QAD.0000000000000349Search in Google Scholar PubMed PubMed Central

7. van der Laan M, Petersen M. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat 2007;3:Article 3.10.2202/1557-4679.1022Search in Google Scholar PubMed PubMed Central

8. Hernan M, Brumback B, Robins J. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000;11:561–70.10.1097/00001648-200009000-00012Search in Google Scholar PubMed

9. Cole S, Hudgens M, Tien P, Anastos K, Kingsley L, Chmiel J, et al. Marginal structural models for case-cohort study designs to estimate the association of antiretroviral therapy initiation with incident aids or death. Am J Epidemiol 2012;175:381–90.10.1093/aje/kwr346Search in Google Scholar PubMed PubMed Central

10. Cain L, Robins JM, Lanoy E, Logan R, Costagliola D, Hernan M. When to start treatment? A systematic approach to the comparison of dynamic regimes using observational data. Int J of Biostat 2010;6:Article 18.10.2202/1557-4679.1212Search in Google Scholar PubMed PubMed Central

11. HIV-CAUSAL-Collaboration, Cain L, Logan R, Robins J, Sterne J, Sabin C, et al. When to initiate combined antiretroviral therapy to reduce mortality and aids-defining illness in HIV-infected persons in developed countries: an observational study. Ann Intern Med 2011;154:509–15.10.7326/0003-4819-154-8-201104190-00001Search in Google Scholar PubMed PubMed Central

12. Robins J. Marginal structural models versus structural nested models as tools for causal inference. In: Halloran E, Berry D. editors. Statistical models in epidemiology: the environment and clinical trials. Springer-Verlag, 1999:95–133.Search in Google Scholar

13. Robins JM, Orellana L, Rotnitzky A. Estimaton and extrapolation of optimal treatment and testing strategies. Stat Med 2008;27(23):4678–721.10.1002/sim.3301Search in Google Scholar PubMed

14. Robins J, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell N, Dietz K, Farewell V, editors. AIDS epidemiology. Bikhäuser: Methodological issues, 1992:297–331.10.1007/978-1-4757-1229-2_14Search in Google Scholar

15. Robins J. Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, 2000.Search in Google Scholar

16. Robins J, Rotnitzky A, van der Laan M. Comment on “on profile Likelihood” by S.A. Murphy and A.W. van der Vaart. J Am Stat Assoc Theory Methods 2000a;450:431–5.10.2307/2669391Search in Google Scholar

17. Robins J. Commentary on “using inverse weighting and predictive inference to estimate the effects of time-varying treatments on the discrete-time hazard. Stat Med 2002;210:1663–80.10.1002/sim.1110Search in Google Scholar

18. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005;61:962–72.10.1111/j.1541-0420.2005.00377.xSearch in Google Scholar PubMed

19. van der Laan M, Rubin D. Targeted maximum likelihood learning. Int J Biost 2006;2:1–40.10.2202/1557-4679.1043Search in Google Scholar

20. Gruber S, van der Laan M. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Int J Biostat 2010;6:Article 26.10.2202/1557-4679.1260Search in Google Scholar PubMed PubMed Central

21. Schnitzer ME, Moodie E, van der Laan MJ, Platt R, Klein MB. Modeling the impact of hepatitis c viral clearance on end-stage liver disease in an HIV co-infected cohort with targeted maximum likelihood estimation. Biometrics 2014;70:144–52.10.1111/biom.12105Search in Google Scholar PubMed PubMed Central

22. van der Laan MJ, Gruber S. Targeted minimum loss based estimation of causal effects of multiple time point interventions. The International Journal of Biostatistics 2012;8:Article 8.10.1515/1557-4679.1370Search in Google Scholar PubMed

23. Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan M. Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. J Causal Inference 2014a;2:147–85.10.1515/jci-2013-0007Search in Google Scholar PubMed PubMed Central

24. DHHS.Guidelines for the use of antiretroviral agents in HIV-1-infected adults and adolescents. Department of health and human services, 2014. http://aidsinfo.nih.gov/contentfiles/lvguidelines/adultandadolescentgl.pdf.Search in Google Scholar

25. Neugebauer R, van der Laan M. Nonparametric causal effects based on marginal structural models. J Stat Plan Inference 2007;137:419–34.10.1016/j.jspi.2005.12.008Search in Google Scholar

26. Taubman S, Robins J, Mittleman M, Hernan MA. Intervening on risk factors for coronary heart disease: an application of the parametric g-formula. Int J Epidemiol 2009;380:1599–611.10.1093/ije/dyp192Search in Google Scholar PubMed PubMed Central

27. Young JG, Cain LE, Robins JM, OReilly EJ, Hernán MA. Comparative effectiveness of dynamic treatment regimes: an application of the parametric g-formula. Stat Biosci 2011;3:119–43.10.1007/s12561-011-9040-7Search in Google Scholar PubMed PubMed Central

28. van der Laan M, Robins J. Unified methods for censored longitudinal data and causality. New York: Springer, 2003.10.1007/978-0-387-21700-0Search in Google Scholar

29. Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. New York: Springer-Verlag, 1997.Search in Google Scholar

30. van der Vaart A, Wellner J. Weak convergence and empirical processes. New York: Springer-Verlag, 1996.10.1007/978-1-4757-2545-2Search in Google Scholar

31. Tsiatis A. Semiparametric theory and missing data. New York: Springer, 2006Search in Google Scholar

32. van der Laan M, Rose S. Targeted Learning: Causal Inference for Observational and Experimental Data, Springer Series in Statistics, Springer, first edition, 2011.10.1007/978-1-4419-9782-1Search in Google Scholar

Published Online: 2016-5-26

Published in Print: 2016-5-1

Articles in the same Issue

https://doi.org/10.1515/ijb-2015-0036

Keywords for this article

targeted maximum likelihood estimation; inverse probability weighting; doubly robust; efficient; marginal structural models