Targeted maximum likelihood based estimation for longitudinal mediation analysis

Zeyi Wang; Lars van der Laan; Maya Petersen; Thomas Gerds; Kajsa Kvist; Mark van der Laan

doi:10.1515/jci-2023-0013

Artikel Open Access

Targeted maximum likelihood based estimation for longitudinal mediation analysis

Zeyi Wang , Lars van der Laan , Maya Petersen , Thomas Gerds , Kajsa Kvist und Mark van der Laan

Veröffentlicht/Copyright: 31. Januar 2025

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Journal of Causal Inference Band 13 Heft 1

Abstract

Causal mediation analysis with random interventions has become an area of significant interest for understanding time-varying effects with longitudinal and survival outcomes. To tackle causal and statistical challenges due to the complex longitudinal data structure with time-varying confounders, competing risks, and informative censoring, there exists a general desire to combine machine learning techniques and semiparametric theory. In this article, we focus on targeted maximum likelihood estimation (TMLE) of longitudinal natural direct and indirect effects defined with random interventions. The proposed estimators are multiply robust, locally efficient, and directly estimate and update the conditional densities that factorize data likelihoods. We utilize the highly adaptive lasso (HAL) and projection representations to derive new estimators (HAL-EIC) of the efficient influence curves (EICs) of longitudinal mediation problems and propose a fast one-step TMLE algorithm using HAL-EIC while preserving the asymptotic properties. The proposed method can be generalized for other longitudinal causal parameters that are smooth functions of data likelihoods, and thereby provides a novel and flexible statistical toolbox.

Keywords: longitudinal mediation analysis; stochastic intervention; random intervention; targeted maximum likelihood estimation; efficient influence curve; efficient estimator; highly adaptive lasso

MSC 2010: 62G05; 62G20

1 Introduction

There is an increasing need of methods that can analyze mechanisms of temporally varying effects, for example, in clinical trials and studies of electronic health record data [1–3]. For example, a weight management program may involve repeated scheduled visits where measurements of body mass index, blood pressure, cholesterol levels, and other health conditions are collected. One may want to analyze the proportion of the effect of weight loss on cholesterol (continuous outcome) or risks of cardiovascular events (time-to-event outcomes) that is mediated by how well blood pressure is under control. It may further be of interest to study how this proportion changes over time. Mediation analysis provides estimands that address these questions. However, traditional mediation analysis faces identification challenges (see, e.g., the discussion of natural effects being nonparametrically nonidentifiable in presence of “recanting witnesses” in [4]). Generally, longitudinal mediation is a challenging problem [4–9] in the presence of time-dependent confounders when the longitudinal causal mediation target parameters, such as natural direct and indirect effects (NDEs and NIEs, respectively), are defined with counterfactual interventions, in which the mediator is set (deterministically) to a counterfactual value.

Mediation analysis based on random (or stochastic) interventions has been proposed as an alternative, [7–12]. Instead of deterministically enforcing specific “cross-world” counterfactual mediator values in structural causal models [13], the random interventions define causal targets through hypothetical interventions that draw mediators from stochastic distributions. For example, in the single time point case, defining mediation parameters by enforcing the counterfactual mediator distributions provides an alternative identification approach that results in the same g-computation formula of traditionally defined NDE and NIE. This approach further provides a straightforward generalization of NDE and NIE to longitudinal settings [7,8,10,12]. We focus on these generalized longitudinal mediation target parameters (referred to as longitudinal NDE and NIE defined by random intervention within this article) as an example to illustrate our estimation method, but note that causal parameters defined using random (i.e., stochastic) interventions cover a wider range of statistical problems. Our method applies to these other corresponding g-computation formulas as well; some examples are discussed in Appendix A.1. In addition, another excellent line of literature [9,14–16] provides various related developments and in-depth discussions following the previous studies [10,12], where the definition of randomized interventional direct and indirect effects can lead to equivalent g-computation formulas as our longitudinal NDE and NIE examples in Section 3. Finally, NDE and NIE analyze the mechanistic pathways of average effects defined on the population level, which may or may not provide valid individualized sharp nulls without stringent assumptions such as cross-world assumptions; we refer interested readers to the study by Miles [17] for more details.

Such random intervention-based generalization is crucial for longitudinal mediation analysis, because it allows flexible time-varying confounding adjustment, whereas nonparametric identification fails for natural effects using static interventions with the existence any posttreatment mediator-outcome confounders [4]. In the nonlongitudinal settings without mediator-outcome confounders impacted by intervention (the so-called recanting witnesses), random intervention targets can be identified with the same g-computation formulas known from the corresponding classical mediation parameter definitions. However, random intervention targets allow us to relax the cross-world identification assumptions for NDE and NIE, i.e., the untestable conditional independence between counterfactual mediators and outcomes defined with different static interventions [18].

In this article, we develop the targeted likelihood-based method [19,20] for longitudinal mediation parameters and construct targeted maximum likelihood estimations (TMLEs). We derive conditions under which the TMLEs become consistent and asymptotically linear. We also provide a projection representation (HAL-EIC) for the efficient influence curves (EICs) for longitudinal mediation problems and use it to derive a fast one-step TMLE algorithm. Throughout the article, we will focus on the longitudinal analyses of NIE and NDE, but the methodology immediately applies to controlled direct effects (CDEs) (Appendix A.1). This flexibility is because the g-computation formulas for these target parameters can be represented as functions of the same observed data likelihood. To explain the main idea of our targeted likelihood-based method for multivariate target parameters, consider a discrete time scale with K time points. Let O = ( X 0 , X 1 , … , X K ) denote the observed data and let O 1 , … , O n ∼ P 0 be an i.i.d. sample, and let p = d P ⁄ d μ be the density function with respect to a dominating measure for a distribution P . A g-computation formula can be labeled by the counterfactual distribution P ( g ) of O under an intervention g , where P ( g ) is identified as a function of the factorized observed data likelihood p 0 ( O ) = p 0 , X 0 ( X 0 ) ∏ k = 1 K p 0 , X k ( X k ∣ X 0 , … , X k − 1 ) . Suppose that the aim of the analysis is not just a single target parameter but a set of target parameters { Ψ s ( P 0 ) , s = 1 , … , S } , where S is the total number of target parameters. This occurs when there are multiple interventions and/or multiple endpoints are of interest; each combination of an intervention and an outcome defines a target parameter. We require each target parameter to be pathwise differentiable and denote D s ( P ) for the EIC of Ψ s . The list of target parameters could contain the NIE and NDE for more than one intervention and for a sequence of evaluation time points. Our two-stage targeted likelihood-based estimation approach thus starts with an initial estimate of the full likelihood p n 0 of p 0 and then searches for an updated estimate of the likelihood p n * , which solves the EIC equations P n D s ( p n * ) = 0 , s = 1 , … , S of all target parameters simultaneously. We show that the plug-in estimators Ψ s ( p n * ) , s = 1 , … , S (TMLEs) derived from the same updated estimate of the likelihood p n * are consistent and asymptotically efficient. We also show that the TMLEs respect the parameter space of all targets simultaneously, that is, in the sense of the study by Robins et al. [21], the estimates stay in the range of all parameter mappings implied by the statistical model.

We argue that our unified likelihood-based approach for longitudinal mediation parameters has advantages compared to the recent work [9] that focuses on restricted marginal models where natural effects no longer decompose the total effect and are not readily suitable for survival outcomes, or sequential regression components [8], with iteratively defined conditional expectations similar to the study by Bang and Robins [22], where are not shared across different target parameters. As pointed out earlier, our full likelihood-based approach is natural and flexible for simultaneous analysis across multiple time points and multiple targets. In settings with time-varying confounders, survival outcomes, informative right censoring, and competing risks, the plug-in estimators derived from the same updated likelihood will respect the parameter space of all the targets. The alternative sequential regression approach may significantly reduce the computational cost of estimation and may also respect the parameter space [21], but only in each of its dimensions marginally. This makes it more difficult to conduct loss-based collaborative adjustment with a sequence of candidate propensity score models and the corresponding TMLE updates [23]. With likelihood-based estimation, the loss and adjustment procedure can be naturally defined even for multivariate targets by minimizing negative log-likelihood losses of the same set of conditional density factors. Identification assumptions and multiple robustness conditions are more communicable when the targets of inference are identified and represented as functions of the observed data distributions without additional iterated definitions and nonintuitive restrictions. We note that the conditions for multiple robustness in Section 7 involve only conditional densities of observed data, and thus our approach avoids the need to specify models for sequential regression. Simple and direct robustness conditions are more suitable for conceptual verification, which lead to more effective use of experts’ data knowledge and improved reliability of research.

This article presents a novel contribution to the field of longitudinal mediation analysis through the proposed HAL-EIC representation and the development of a fast one-step TMLE algorithm using HAL-EIC. The highly adaptive lasso (HAL) [24–26] is a general maximum likelihood estimator (MLE) for d -variate real-valued cadlag functions bounded in the sectional variation norm, which converges to the true function at a rate of n − 2 ⁄ 3 ( log n ) d in loss-based dissimilarity. Recent developments [27] have demonstrated that HAL estimators offer a promising alternative representation of EICs for many pathwise differentiable target parameters. We propose the use of the HAL-EIC representation for longitudinal mediation problems and develop a fast one-step TMLE algorithm using HAL-EIC while preserving the asymptotic properties.

The outline of this article is as follows. Section 2 introduces the notation and the data structure for discrete time mediation analysis with outcomes that may be suitable for the proposed method, such as survival outcomes, right censoring, and competing risks. Section 3 defines the causal mediation target parameters, and identifies them as statistical targets using g-computation formulas, and provides the EICs. Sections 4–5 discusses the two-stage TMLE procedure. In Section 4, we briefly discuss the steps and considerations for constructing initial density estimators. In Section 5, we give the iterative and one-step TMLE algorithms [28] and discuss the regularity conditions for local efficiency. We review a method of simultaneous inference on multivariate target parameters introduced by [29] in our setting. In Section 6, we propose a projection representation (HAL-EIC) for the EICs using the HAL [24,30] and propose the HAL-EIC-based one-step TMLE algorithm. In the following sections, we analyze the multiple robustness properties [8,31–33] using simulated data and show results of the proposed algorithms under finite-sample challenges such as near-violation of the positivity assumptions.

2 Data and model

On a discrete time scale, { 0 , 1 , … , K } , we denote for all 0 ≤ t ≤ K by X ¯ t a vector of random variables measured until time t , i.e., X ¯ t = ( X k : k ∈ { 0 , … , t } ) . Similarly, for k ≤ t , we define X ¯ k t = ( X k , … , X t ) . Throughout, we consider the following data structure:

O = ( L 0 , A 1 , R 1 , Z 1 , L 1 , … , A t , R t , Z t , L t … , A K , R K , Z K , L K ) ,

where each node is a random variable or a random vector. L 0 are the baseline covariates, A ¯ K are treatment variables, Z ¯ K are time-varying mediators, and R ¯ K and L ¯ K are time-varying variables before and after the mediator nodes. The vector ( A t , R t , Z t , L t ) is the observation at time point t . Given this time order, we denote by Pa ( X t ) the parent variables preceding X t and by Ch ( X t ) the child variables after X t . Similar notations are used for parent and child nodes of a realization o = ( l 0 , a 1 … , l K ) in the range of O . For example, Pa ( L 1 ) = ( L 0 , A 1 , R 1 , Z 1 ) , whereas Ch ( a K ) = ( r K , z K , l K ) . In the context of a static intervention a ¯ K , which sets treatment values, we denote by Pa ( X t ∣ a ¯ K ) the parent nodes of X t under the intervention A ¯ K = a ¯ K , e.g., Pa ( L k ∣ a ¯ k ) = ( L 0 , a 1 , R 1 , Z 1 , L 1 , … , a k , R k , Z k ) , which is an ordered vector of observed random variables and imputed treatment values. The outcome of interest is defined as a vector of functions of L ¯ K , that is, Y ¯ = ψ ¯ ( L ¯ K ) for a multidimensional functional ψ ¯ = ( ψ s : s = 1 , … , S ) . Examples include cumulative measures or survival status at different time points. A simple special case is a univariate variable Y measured at the last time point such that Y ∈ L K .

Let ℳ be a statistical model for the distribution of O , which is dominated by a measure μ such that for P ∈ ℳ the density is given by p = d P ⁄ d μ . For clarity of representation, we assume μ to be a counting measure so that integrals such as (2) and (3) are simplified as summations, but the results generalize to continuous probability distributions dominated by Lebesgue measures and corresponding integrals. We identify P ∈ ℳ with the corresponding density p and treat distributions with identical densities as equivalent. We consider n i.i.d. copies O 1 , … , O n of O that follow the true data generating distribution P 0 ∈ ℳ . The joint data likelihood can be factorized by the conditional densities,

(1) p ( O ) = p L 0 ( L 0 ) ∏ k = 1 K [ p A k ( A k ∣ Pa ( A k ) ) p R k ( R k ∣ Pa ( R k ) ) p Z k ( Z k ∣ Pa ( Z k ) ) p L k ( L k ∣ Pa ( L k ) ) ] .

We use the notation P f ( O ) = E P [ f ( O ) ] for the expectation with respect to P and likewise P n f ( O ) = 1 n ∑ i = 1 n f ( O i ) for the empirical distribution P n .

2.1 Structural causal model and random interventions

Consider the following structural causal model [13,34], where the randomness of the observed data O is captured by the so-called exogenous nodes U X ’s, and the f X mappings are deterministic for each of the endogenous variables:

X t = f X t ( Pa ( X t ) , U X t ) .

For some static treatment A ¯ K = a ¯ K , we define counterfactuals by static interventions on the structural causal model.

L 0 = f L 0 ( U L 0 ) , A t = a t R t ( a ¯ ) = f R t ( a ¯ t , R ¯ t − 1 ( a ¯ ) , Z ¯ t − 1 ( a ¯ ) , L ¯ t − 1 ( a ¯ ) , U R t ) Z t ( a ¯ ) = f Z t ( a ¯ t , R ¯ t ( a ¯ ) , Z ¯ t − 1 ( a ¯ ) , L ¯ t − 1 ( a ¯ ) , U Z t ) L t ( a ¯ ) = f L t ( a ¯ t , R ¯ t ( a ¯ ) , Z ¯ t ( a ¯ ) , L ¯ t − 1 ( a ¯ ) , U L t ) .

For a pair of static treatment and control interventions, a ¯ and a ¯ ′ , denote by Γ t a ¯ ′ the conditional density of the control intervention counterfactual Z t ( a ¯ ′ ) given parent nodes under a ¯ ′ , that is, Γ t a ¯ ′ ( z t ∣ r ¯ t , z ¯ t − 1 , l ¯ t − 1 ) = p ( Z t ( a ¯ ′ ) = z t ∣ R ¯ t ( a ¯ ′ ) = r ¯ t , Z ¯ t − 1 ( a ¯ ′ ) = z ¯ t − 1 , L ¯ t − 1 ( a ¯ ′ ) = l ¯ t − 1 ) . The vector Γ ¯ a ¯ ′ = ( Γ t a ¯ ′ : t = 1 , … , K ) thus represents a sequence of counterfactual conditional densities for the mediator process under the control intervention. We define random intervention counterfactuals under ( a ¯ , Γ ¯ a ¯ ′ ) by forcing mediator variables to follow the distributions of the control intervention counterfactuals, Z t ( a ¯ , Γ ¯ a ¯ ′ ) ∼ Γ t a ¯ ′ , and for nodes X t ∈ R ¯ ∪ L ¯ inserting A ¯ t = a ¯ t so that X t ( a ¯ , Γ ¯ a ¯ ′ ) = f X t ( Pa ( X t ( a ¯ , Γ ¯ a ¯ ′ ) ∣ a ¯ ) , U X ) . That is,

L 0 ( a ¯ , Γ ¯ a ¯ ′ ) = f L 0 ( U L 0 ) , A t = a t R t ( a ¯ , Γ ¯ a ¯ ′ ) = f R t ( a ¯ t , R ¯ t − 1 ( a ¯ , Γ ¯ a ¯ ′ ) , Z ¯ t − 1 ( a ¯ , Γ ¯ a ¯ ′ ) , L ¯ t − 1 ( a ¯ , Γ ¯ a ¯ ′ ) , U R t ) Z t ( a ¯ , Γ ¯ a ¯ ′ ) ∼ Γ t a ¯ ′ ( z ∣ R ¯ t ( a ¯ , Γ ¯ a ¯ ′ ) , Z ¯ t − 1 ( a ¯ , Γ ¯ a ¯ ′ ) , L ¯ t − 1 ( a ¯ , Γ ¯ a ¯ ′ ) ) L t ( a ¯ , Γ ¯ a ¯ ′ ) = f L t ( a ¯ t , R ¯ t ( a ¯ , Γ ¯ a ¯ ′ ) , Z ¯ t ( a ¯ , Γ ¯ a ¯ ′ ) , L ¯ t − 1 ( a ¯ , Γ ¯ a ¯ ′ ) , U L t ) .

In this framework, a causal target parameter that describes the mediation of treatment effects on an outcome Y is denoted by E [ Y ( a ¯ , Γ ¯ a ¯ ′ ) ] . Then the decomposition of the total effect into a NIE and a NDE is given by

E [ Y ( a ¯ , Γ ¯ a ¯ ) ] − E [ Y ( a ¯ ′ , Γ ¯ a ¯ ′ ) ] = ( E [ Y ( a ¯ , Γ ¯ a ¯ ) ] − E [ Y ( a ¯ , Γ ¯ a ¯ ′ ) ] ) + ( E [ Y ( a ¯ , Γ ¯ a ¯ ′ ) ] − E [ Y ( a ¯ ′ , Γ ¯ a ¯ ′ ) ] ) .

To explain the cross-world assumptions, we also define counterfactuals X ( a ¯ , z ¯ ) with static interventions A ¯ = a ¯ and Z ¯ = z ¯ . Note that if after a random draw under ( a ¯ , Γ ¯ a ¯ ′ ) we have that Z ¯ ( a ¯ , Γ ¯ a ¯ ′ ) = z ¯ , then the structural causal model implies X ( a ¯ , Γ ¯ a ¯ ′ ) = X ( a ¯ , z ¯ ) . However, defining NIE and NDE without random intervention usually requires additional assumptions such as conditional independence between L ( a ¯ , z ¯ ) and Z ( a ¯ ′ ) across the two counterfactual worlds. Such “cross-world” assumptions are not desirable because they are not verifiable, neither empirically nor conceptually [35]. Cross-world assumptions are also incompatible with time-varying covariates R ¯ . On the other hand, the random intervention framework does not require “cross-world” assumptions. We focus on static treatment rules A ¯ = a ¯ or a ¯ ′ , but the methodology can be generalized to dynamic treatment regimes, where the treatment and control interventions on A t are decided by deterministic functions of the available history Pa ( A t ) .

2.2 Right censored survival outcomes

Suppose T is the time point where an event of interest happens. Let Y t = I T > t ∈ L t be the monotone process of staying event free in the study at the t th time point. The counting process Y t starts with 1 at t = 0 and only jumps to 0 if an event happens at the time point t . If an event happens at the time point t , then for all X ∈ Ch ( Y t ) \ Y ¯ , conditional on Y t = 0 , we set X to be a degenerated discrete variable such that X = ∅ with conditional probability 1. The outcome of interest can be the survival beyond the study length K , in which case the target parameters take the form of

E [ Y K ( a ¯ , Γ ¯ a ¯ ′ ) ] = P ( T ( a ¯ , Γ ¯ a ¯ ′ ) > K ) .

In real applications, one typically only observes T ˜ = min { T , C } , where C is the censoring time. To incorporate censored data, we create bivariate treatment nodes A t = ( A t C , A t E ) that consist of the monotone process of remaining uncensored, A t C = I C > t , and the treatment assignment, A t E . The process A t C starts with value 1 and jumps to 0 at the time point where censoring occurs. Conditioning on C t = 0 , for all X ∈ Ch ( C t ) , we set X to be a discrete variable that equals a degenerated value ∅ with conditional probability 1; i.e., no information is available for the observed data likelihood after censoring occurs.

2.3 Competing risks

Consider a competing events framework [36–38], where at the time T where an individual reaches one of several absorbing states the event type Δ ∈ { 1 , 2 , … , J } is observed. For example, Δ = 1 may indicate the onset of a cancer, and Δ = 2 death due to other causes. One can define a multi-dimensional counting process in discrete time, e.g., when J = 2 by { Y t = ( N t ( 1 ) , N t ( 2 ) ) : t = 1 , … , K } , such that N t ( 1 ) = I T ≤ t , Δ = 1 , N t ( 2 ) = I T ≤ t , Δ = 2 . Note that 1 − N t ( 1 ) − N t ( 2 ) = I T > t now indicates that the individual is alive and event-free. Suppose Y t ∈ L t . For Y t = ( 0 , 0 ) , we set X = ∅ for any X ∈ Ch ( Y t ) \ Y ¯ with probability 1. If N t ( j ) = 1 for the j th type of event, we additionally set N t ′ ( j ) = 1 and N t ′ ( j ′ ) = ∅ for all j ′ ≠ j , t ′ ≥ t . The target parameter in a competing risk framework is typically multidimensional and can be the risks of all events under ( a ¯ , Γ ¯ a ¯ ′ ) across all time points, E [ Y ¯ ( a ¯ , Γ ¯ a ¯ ′ ) ] = ( E [ N t ( j ) ( a ¯ , Γ ¯ a ¯ ′ ) ] : t = 1 , … , K ; j = 1 , … , J ) .

3 Target parameters

3.1 NDE and NIE

We define NDE and NIEs for a multi-dimensional outcome of interest Y ¯ = ψ ¯ ( L ¯ K ) with random interventions as defined in Section 2:

NIE: E [ Y ¯ ( a ¯ , Γ ¯ a ¯ ) − Y ¯ ( a ¯ , Γ ¯ a ¯ ′ ) ] , NDE: E [ Y ¯ ( a ¯ , Γ ¯ a ¯ ′ ) − Y ¯ ( a ¯ ′ , Γ ¯ a ¯ ′ ) ] .

Examples for the choice of Y ¯ include: Y ∈ L K for an univariate outcome variable, see Section 2.1, Y K = I T > K ∈ L K for censored survival endpoints, see Section 2.2, and Y ¯ = ( N t ( 1 ) , N t ( 2 ) : t = 1 , … , K ) for competing risk outcomes, see Section 2.3.

NDE is the direct effect of a treatment while forcing mediators to have the same distribution as their control group counterfactuals. NIE is the indirect treatment effect achieved by not changing treatment values but by changing mediator distributions. The structural causal model of Section 2.1 implies that NDE and NIE decompose the total effect, which can be defined with or without random intervention:

NIE + NDE = E [ Y ¯ ( a ¯ , Γ ¯ a ¯ ) ] − E [ Y ¯ ( a ¯ ′ , Γ ¯ a ¯ ′ ) ] = E [ Y ¯ ( a ¯ ) ] − E [ Y ¯ ( a ¯ ′ ) ] .

3.2 Identification

To identify the causal mediation targets as statistical parameters of the observed data distribution, we adopt the following assumptions from [8]. For any random intervention ( a ¯ , Γ ¯ a ¯ ′ ) of interest:

(A1) Sequential exchangeability: R ¯ t K ( a ¯ ′ ) , Z ¯ t K ( a ¯ ′ ) , L ¯ t K ( a ¯ ′ ) , R ¯ t K ( a ¯ , z ¯ ) , L ¯ t K ( a ¯ , z ¯ ) ⊥ A t ∣ Pa ( A t ) .
(A2) Mediator randomization: R ¯ t + 1 K ( a ¯ , z ¯ ) , L ¯ t K ( a ¯ , z ¯ ) ⊥ Z t ∣ Pa ( Z t ) .
(A3) (Strong) Positivity: p 0 ( a t * ∣ Pa ( a t * ∣ a ¯ * ) ) > 0 if p 0 ( Pa ( a t * ∣ a ¯ * ) ) > 0 for a ¯ * = a ¯ or a ¯ ′ ; also, p 0 ( r t ∣ Pa ( r t ∣ a ¯ t ′ ) ) > 0 if p 0 ( r t ∣ Pa ( r t ∣ a ¯ t ) ) > 0 , p 0 ( l t ∣ Pa ( l t ∣ a ¯ t ′ ) ) > 0 if p 0 ( l t ∣ Pa ( l t ∣ a ¯ t ) ) > 0 , and p 0 ( z t ∣ Pa ( z t ∣ a ¯ t ) ) > 0 if p 0 ( z t ∣ Pa ( z t ∣ a ¯ t ′ ) ) > 0 .

Note that the consistency assumption [39] is implicitly made via the structural causal model. In the special case where Y ¯ = Y ∈ L k (for some k ∈ { 1 , … , K } ) is the univariate outcome variable, the assumptions are stronger than what is necessary and the necessary assumptions would be as given i the previous work. The following theorem is a direct application of Lemma 1 in the study by Zheng and van der Laan [8] to multivariate outcome Y ¯ = ( Y s : s ) on each dimension.

Theorem 3.1

(G-computation formula) Suppose that the multi-dimensional outcome of interest is Y ¯ = ( Y s : s = 1 , … , S ) , where Y s = ψ s ( L ¯ K ) is defined by measurable functions ψ s . For any two interventions a ¯ , a ¯ ′ under the identification assumptions (A1)–(A3) listed in Section 3.2, we have

(2) Ψ s a ¯ , a ¯ ′ ( P ) ≡ E [ Y s ( a ¯ , Γ ¯ a ¯ ′ ) ] = ∑ l ¯ ψ s ( l ¯ K ) p ( L ¯ K ( a ¯ , Γ ¯ a ¯ ′ ) = l ¯ K ) = ∑ l ¯ , z ¯ , r ¯ ψ s ( l ¯ K ) p ( L 0 = l 0 ) ∏ t = 1 K p L t ( l t ∣ Pa ( l t ∣ a ¯ t ) ) p Z t ( z t ∣ Pa ( z t ∣ a ¯ t ′ ) ) p R t ( r t ∣ Pa ( r t ∣ a ¯ t ) ) .

Also, for NIE and NDE with respect to each of the dimensions of the outcome,

NIE s ( P ) = Ψ s a ¯ , a ¯ ( P ) − Ψ s a ¯ , a ¯ ′ ( P ) NDE s ( P ) = Ψ s a ¯ , a ¯ ′ ( P ) − Ψ s a ¯ ′ , a ¯ ′ ( P ) .

3.3 Comparison with sequential regression

We have defined and identified our target parameters as functions of the observed data likelihood p , and we will focus on such full likelihood representation in the following sections. In this subsection, we compare with conditional expectation components. This reparametrization can be combined with density ratio estimation techniques [40–43] to achieve efficiency augmentation without estimating conditional densities.

For any pair of interventions a ¯ and a ¯ ′ , one can identify the postintervention distribution P a ¯ , a ¯ ′ of the counterfactual nodes under random interventions ( a ¯ , Γ ¯ a ¯ ′ ) given the assumptions in Section 3.2. That is, for any data realization o that enforces a value a ¯ K in the intervention nodes, we have

p a ¯ , a ¯ ′ ( O = o ) = p ( O ( a ¯ , Γ a ¯ ′ ) = o ) = p ( L ¯ K ( a ¯ , Γ a ¯ ′ ) = l ¯ K , Z ¯ K ( a ¯ , Γ a ¯ ′ ) = z ¯ K , R ¯ K ( a ¯ , Γ a ¯ ′ ) = r ¯ K ) = ∏ t = 1 K p L t ( l t ∣ Pa ( l t ∣ a ¯ t ) ) p Z t ( z t ∣ Pa ( z t ∣ a ¯ t ′ ) ) p R t ( r t ∣ Pa ( r t ∣ a ¯ t ) ) .

For each dimension s of the statistical target parameter, one can rewrite the g-computation formulas with the following sequence of regression expressions (the dependence of Q on P is suppressed except for the last equation):

Q s , R K + 1 a ¯ , a ¯ ′ = Y s = ψ s ( L ¯ K ) Q s , L t a ¯ , a ¯ ′ ( R ¯ t , Z ¯ t , L ¯ t − 1 ) = E P a ¯ , a ¯ ′ [ Q s , R t + 1 a ¯ , a ¯ ′ ∣ R ¯ t , Z ¯ t , L ¯ t − 1 ] = E P [ Q s , R t + 1 a ¯ , a ¯ ′ ∣ R ¯ t , Z ¯ t , L ¯ t − 1 , A ¯ t = a ¯ t ] Q s , Z t a ¯ , a ¯ ′ ( R ¯ t , Z ¯ t − 1 , L ¯ t − 1 ) = E P a ¯ , a ¯ ′ [ Q s , L t a ¯ , a ¯ ′ ∣ R ¯ t , Z ¯ t − 1 , L ¯ t − 1 ] = E P [ Q s , L t a ¯ , a ¯ ′ ∣ R ¯ t , Z ¯ t − 1 , L ¯ t − 1 , A ¯ t = a ¯ t ′ ] Q s , R t a ¯ , a ¯ ′ ( R ¯ t − 1 , Z ¯ t − 1 , L ¯ t − 1 ) = E P a ¯ , a ¯ ′ [ Q s , Z t a ¯ , a ¯ ′ ∣ R ¯ t − 1 , Z ¯ t − 1 , L ¯ t − 1 ] = E P [ Q s , Z t a ¯ , a ¯ ′ ∣ R ¯ t − 1 , Z ¯ t − 1 , L ¯ t − 1 , A ¯ t = a ¯ t ] Ψ s a ¯ , a ¯ ′ ( P ) = E P Q s , R 1 a ¯ , a ¯ ′ ( P ) ( L 0 ) .

Note that each of the regression expressions can also be written as a function of conditional densities by analytically carrying out the nested conditional expectations. For example, for simplicity assuming that L k is a discrete variable, we have

(3) Q s , L K a ¯ , a ¯ ′ = E P [ Q s , R K + 1 a ¯ , a ¯ ′ ∣ R ¯ K , Z ¯ K , L ¯ K − 1 , A ¯ t = a ¯ t ] = ∑ l K ψ s ( L ¯ K − 1 , l K ) p L K ( l k ∣ R ¯ K , Z ¯ K , L ¯ K − 1 , A ¯ K = a ¯ K ) .

The sequential regression formulation thus provides an alternative representation of the g-computation formula of Ψ s a ¯ , a ¯ ′ ( P ) as a functional of

Q ¯ s a ¯ , a ¯ ′ ( P ) = ( Q s , X a ¯ , a ¯ ′ ( P ) : X ∈ { L t , Z t , R t : t = 1 , … , K } ) .

While targeting the same g-computation formula, the plug-in estimation with sequential regression requires only estimating regression models, Q s , X a ¯ , a ¯ ′ ( P ) , of P instead of the factorized data likelihood (1), which leads to a potential gain in scalability when conditional density estimation is challenging. However, this comes at a cost of capturing less information from the data distribution P and only works for one target parameter at a time. When targeting multiple g-computation formulas, such as those defined in longitudinal settings for different outcomes Y s = ψ s ( L ¯ K ) across multiple time points and for different interventions defined by a ¯ , a ¯ ′ , Γ ¯ a ¯ , and Γ ¯ a ¯ ′ , the sequential regression approach has to define and estimate a different set of intermediate regression components Q ¯ s a ¯ , a ¯ ′ ( P ) for each dimension s of the outcome of interest and each random intervention ( a ¯ , Γ a ¯ ′ ) of interest. This potentially variational dependent representation is not intuitive, and it potentially leads to results which do not obey the parameter space; for example, when estimating a survival curve, sequential regression when independently applied to different time horizons may yield survival functions that are increasing. While the resulting estimates still respect the marginal parameter spaces for each dimension of the target parameter due to substitution estimation, sequential regression cannot guarantee the “boundedness” property [21] for the joint parameter space. The target parameter is usually multi-dimensional in longitudinal mediation analysis. For example, when the outcome of interest is the whole survival process Y ¯ = ( Y t = I T > t : t = 1 , … , K ) , the estimations of E [ Y t ( a ¯ , Γ ¯ a ¯ ′ ) ] are expected to be monotonic in t = 1 , 2 , … , K . Another example is the competing risks in Section 2.3, where the estimations also need to respect that E [ Y t ( 1 ) ( a ¯ , Γ ¯ a ¯ ′ ) ] + E [ Y t ( 2 ) ( a ¯ , Γ ¯ a ¯ ′ ) ] = E [ Y t * ( a ¯ , Γ ¯ a ¯ ′ ) ] ∈ [ 0 , 1 ] , increasing in t = 1 , … , K . Our TMLE algorithms on the other hand derive estimates of all target parameters from the same set of estimated likelihood factors and hence achieve the desired joint boundedness property. Finally, sequential regression relies on nested regression models starting at the final time point. While this first regression fit has the least data support (due to deviations from regimes of interest, right censoring, and potentially rare outcomes), challenges and potential biases in estimating the first sequential regression are inherited by subsequent regression fits.

3.4 Efficient influence curve (EIC)

Theorem 3.2

(EIC) For each pair of interventions a ¯ , a ¯ ′ and the sth element of the outcome vector Y ¯ , the EIC D s a ¯ , a ¯ ′ ( P ) of Ψ s a ¯ , a ¯ ′ at P is given by (some dependence on P is suppressed):

D s , L t a ¯ , a ¯ ′ = I A ¯ t = a ¯ t ∏ j = 1 t p A ( a j ∣ Pa ( A j ∣ a ¯ j − 1 ) ) ∏ j = 1 t p Z ( Z j ∣ Pa ( Z j ∣ a ¯ j ′ ) ) p Z ( Z j ∣ Pa ( Z j ∣ a ¯ j ) ) { Q s , R t + 1 a ¯ , a ¯ ′ ( R ¯ t , Z ¯ t , L ¯ t ) − Q s , L t a ¯ , a ¯ ′ ( R ¯ t , Z ¯ t , L ¯ t − 1 ) } D s , Z t a ¯ , a ¯ ′ = I A ¯ t = a ¯ t ′ ∏ j = 1 t p A ( a ′ ∣ Pa ( A J ∣ a ¯ j − 1 ′ ) ) ∏ j = 1 t − 1 p L ( L j ∣ Pa ( L j ∣ a ¯ j ) ) p L ( L j ∣ Pa ( L j ∣ a ¯ ′ ) ) ∏ j = 1 t p R ( R j ∣ Pa ( R j ∣ a ¯ j ) ) p R ( R j ∣ Pa ( R j ∣ a ¯ ′ ) ) × { Q L t a ¯ , a ¯ ′ ( R ¯ t , Z ¯ t , L ¯ t − 1 ) − Q Z t a ¯ , a ¯ ′ ( R ¯ t , Z ¯ t − 1 , L ¯ t − 1 ) } D s , R t a ¯ , a ¯ ′ = I A ¯ t = a ¯ t ∏ j = 1 t p A ( a j ∣ Pa ( A j ∣ a ¯ j − 1 ) ) ∏ j = 1 t − 1 p Z ( Z j ∣ Pa ( Z j ∣ a ¯ j ′ ) ) p Z ( Z j ∣ Pa ( Z j ∣ a ¯ j ) ) { Q s , Z t a ¯ , a ¯ ′ ( R ¯ t , Z ¯ t − 1 , L ¯ t − 1 ) − Q s , R t a ¯ , a ¯ ′ ( R ¯ t − 1 , Z ¯ t − 1 , L ¯ t − 1 ) } D s a ¯ , a ¯ ′ = D s , L 0 a ¯ , a ¯ ′ + ∑ t = 1 K D s , L t a ¯ , a ¯ ′ + D s , Z t a ¯ , a ¯ ′ + D s , R t a ¯ , a ¯ ′ .

The proof is given in Appendix A.

Note that here the Q functionals are defined as functions of conditional densities by iteratively carrying out all the integrals as in equation (3). For example,

Q s , L t a ¯ , a ¯ ′ ( R ¯ t , Z ¯ t , L ¯ t − 1 ) = ∑ l t , Ch ( l t ) \ a ¯ ψ s ( L ¯ t − 1 , l ¯ t K ) ∏ j = t K p L ( l j ∣ R ¯ t , Z ¯ t , L ¯ t − 1 , a ¯ j , r ¯ t + 1 j , z ¯ t + 1 j , l ¯ t j − 1 ) × ∏ j = t + 1 K p Z ( z j ∣ R ¯ t , Z ¯ t , L ¯ t − 1 , a ¯ j ′ , r ¯ t + 1 j , z ¯ t + 1 j − 1 , l ¯ t j − 1 ) × p R ( r j ∣ R ¯ t , Z ¯ t , L ¯ t − 1 , a ¯ j , r ¯ t + 1 j − 1 , z ¯ t + 1 j − 1 , l ¯ t j − 1 ) .

4 Initial estimation of data likelihood

In this section, we briefly describe the steps and considerations for constructing an initial estimator,

p n 0 = p n , L 0 0 ∏ t = 1 K p n , A t 0 ∏ t = 1 K p n , R t 0 p n , Z t 0 p n , L t 0 ,

for the conditional densities involved in the observed data likelihood (1) at P 0 .

For each binary variable X ∈ A ¯ ∪ R ¯ ∪ Z ¯ ∪ L ¯ , we have p X ( 1 ∣ Pa ( X ) ) = E [ X ∣ Pa ( X ) ] , and p X ( 0 ∣ Pa ( X ) ) = 1 − E [ X ∣ Pa ( X ) ] . For categorical variables, we define a sequence of binary dummy variables as X j = I { X = j } or X j = I { X = j , X ∉ { 1 , … , j − 1 } } , where we let Pa ( X j ) = Pa ( X ) ∪ { X s : s = 1 , … , j − 1 } , then the estimation of p X can be achieved by modeling the set of conditional expectations of { E [ X j ∣ Pa ( X j ) ] : j } . For continuous variables, the modeling of conditional densities is more challenging. The modeling options include parametric assumptions, as well as discretized conditional densities, which can be modeled with pooled hazard regression as specified in the study by Díaz Muñoz and van der Laan [44] where one specifies a model for E [ I { X ∈ [ α v − 1 , α v ) } ∣ A ≥ α v − 1 , Pa ( X ) ] over a grid ( α v : v = 0 , 1 , … ) , which spans the range of the continuous variable. Based on a library of models for all the variables, the super learner [45] can be applied to estimate E [ X ∣ Pa ( X ) ] either as a convex ensemble of a library of candidate learners (convex learner) or as the best estimator among the candidates (discrete learner), decided by the cross-validated loss performance. Due to the finite sample oracle inequality [45,46], the super learner is in general asymptotically equivalent to the oracle learner which is defined as the learner among the candidate learners that minimizes the loss under the true data-generating distribution.

The HAL [24,25] is a nonparametric estimator with fast convergence rates that can be applied to model the conditional expectation objects E [ X ∣ Pa ( X ) ] . It has been shown that under the assumption that the true conditional expectation is cádlàg and bounded in sectional variation norm, the corresponding HAL estimator E ˆ [ X ∣ Pa ( X ) ] has the rate of convergence ‖ E ˆ [ X ∣ Pa ( X ) ] − E [ X ∣ Pa ( X ) ] ‖ P 0 = O P ( n − 1 ⁄ 3 ) up to log ( n ) factors. Therefore, any binary or categorical density estimator p n , X 0 that is derived from a HAL (or a super learner that included HAL in the library of candidate learners) estimator of E [ X ∣ Pa ( X ) ] would preserve the same convergence rate of ‖ p 0 , X − p n , X 0 ‖ P 0 = O P ( n − 1 ⁄ 3 ) up to log ( n ) factors, where p 0 = p 0 , L 0 ∏ t = 1 K p 0 , A t ∏ t = 1 K p 0 , R t p 0 , Z t p 0 , L t is the factorized joint density at the true data distribution P 0 .

Finally, we note that recent work by Hejazi et al. [47] allows flexible estimation of conditional densities based on HAL for discrete and continuous variables. It is recommended to consider a super learner with a library that includes HAL-based estimators along with other density estimators such as kernel based or neural network-based estimators.

In addition to conditional density estimation, in Section 6.1, HAL is also used in an alternative representation of EIC (HAL-EIC).

5 Targeted maximum likelihood estimation

Given an initial density estimator p n 0 = p n , L 0 0 ∏ t = 1 K p n , A t 0 ∏ t = 1 K p n , R t 0 p n , Z t 0 p n , L t 0 , we now define the TMLE updates p n * = p n , L 0 * ∏ t = 1 K p n , A t * ∏ t = 1 K p n , R t * p n , Z t * p n , L t * such that the plug-in estimators ( Ψ s a ¯ , a ¯ ( P n * ) , Ψ s a ¯ , a ¯ ′ ( P n * ) , Ψ s a ¯ ′ , a ¯ ′ ( P n * ) : s = 1 , … , S ) are consistent and asymptotically linear under regularity assumptions.

5.1 Iterative TMLE

Define the following (locally) least favorable paths for the components of the likelihood:

p ˜ R t ( p R t , ε ¯ R t ) = 1 + ∑ s = 1 S ε s , R t a ¯ , a ¯ D s , R t a ¯ , a ¯ ( P ) + ε s , R t a ¯ , a ¯ ′ D s , R t a ¯ , a ¯ ′ ( P ) + ε s , R t a ¯ ′ , a ¯ ′ D s , R t a ¯ ′ , a ¯ ′ ( P ) p R t p ˜ Z t ( p Z t , ε ¯ Z t ) = 1 + ∑ s = 1 S ε s , Z t a ¯ , a ¯ D s , Z t a ¯ , a ¯ ( P ) + ε s , Z t a ¯ , a ¯ ′ D s , Z t a ¯ , a ¯ ′ ( P ) + ε s , Z t a ¯ ′ , a ¯ ′ D s , Z t a ¯ ′ , a ¯ ′ ( P ) p Z t p ˜ L t ( p L t , ε ¯ L t ) = 1 + ∑ s = 1 S ε s , L t a ¯ , a ¯ D s , L t a ¯ , a ¯ ′ ( P ) + ε s , L t a ¯ , a ¯ ′ D s , L t a ¯ , a ¯ ′ ( P ) + ε s , L t a ¯ ′ , a ¯ ′ D s , L t a ¯ ′ , a ¯ ′ ( P ) p L t .

Suppose that at each node X ∈ R ¯ ∪ Z ¯ ∪ L ¯ we have constructed a submodel of ℳ with a multi-dimensional parameter ε ¯ X = ( ε s , X a ¯ , a ¯ , ε s , X a ¯ , a ¯ ′ , ε s , X a ¯ ′ , a ¯ ′ : s = 1 , … , S ) , where p ˜ X ( p X , ε ¯ X = 0 ¯ ) = p X , and the scores d d ε ¯ X ∣ ε ¯ X = 0 log p ˜ X ( p X , ε ¯ X ) at ε ¯ X = 0 equal the vector of the corresponding components of the EIC, ( D s , X a ¯ , a ¯ , D s , X a ¯ , a ¯ ′ , D s , X a ¯ ′ , a ¯ ′ : s = 1 , … , S ) . The maximum likelihood solutions of the parametric submodels at p = p n 0 are for t = 1 , … , K ,

ε ¯ R t , n = argmax ε ¯ R t P n log p ˜ R t ( p n , R t 0 , ε ¯ R t ) ε ¯ Z t , n = argmax ε ¯ Z t P n log p ˜ Z t ( p n , Z t 0 , ε ¯ Z t ) ε ¯ L t , n = argmax ε ¯ L t P n log p ˜ L t ( p n , R t 0 , ε ¯ L t ) ,

which lead to the first round of TMLE updates

p ˜ n , R t = p ˜ R t ( p n , R t 0 , ε ¯ R t , n ) p ˜ n , Z t = p ˜ Z t ( p n , Z t 0 , ε ¯ Z t , n ) p ˜ n , L t = p ˜ L t ( p n , L t 0 , ε ¯ L t , n ) ,

and p ˜ n = p n , L 0 0 ∏ t = 1 K p n , A t 0 ∏ t = 1 K p ˜ n , R t p ˜ n , Z t p ˜ n , L t .

The procedure is repeated after replacing the initial estimator p n 0 with the TMLE update p ˜ n of the last iteration, till P n D s a ¯ , a ¯ ( P ˜ n ) → 0 , P n D s a ¯ , a ¯ ′ ( P ˜ n ) → 0 , P n D s a ¯ ′ , a ¯ ′ ( P ˜ n ) → 0 , for s = 1 , … , S . We now define the final TMLE update p n * as the update of the I th iteration, where I = I ( n ) is large enough so that P n D s a ¯ , a ¯ ( P n * ) , P n D s a ¯ , a ¯ ′ ( P n * ) , P n D s a ¯ ′ , a ¯ ′ ( P n * ) are all o P ( n − 1 ⁄ 2 ) . The last statement is an application of the Result 1 and Theorem 1 in the study by van der Laan and Rubin [48] under the regularity conditions that p ˜ n is in the interior of ℳ so that P n D s , X a ¯ , a ¯ ( P ˜ n ) p ˜ n , X = P n D s , X a ¯ , a ¯ ′ ( P ˜ n ) p ˜ n , X = P n D s , X a ¯ ′ , a ¯ ′ ( P ˜ n ) p ˜ n , X = 0 , s = 1 , … , S , and that ε ¯ X , n → 0 for all X ∈ R ¯ ∪ Z ¯ ∪ L ¯ as the log-likelihood components P n log p ˜ n , X increase and converge with iterations.

Algorithm 1 Iterative TMLE
1:	repeat
2:	for X ∈ R ¯ ∪ Z ¯ ∪ L ¯ do
3:	Calculate clever covariates H s , X I = D s , X I ( P n 0 ) , which are equal to the EIC components for the s th outcome s = 1 , … , S and intervention I = ( a ¯ , a ¯ ) , ( a ¯ , a ¯ ′ ) , ( a ¯ ′ , a ¯ ′ ) .
4:	Find MLE of ε ¯ X = ( ε s , X I : s , I ) along the multivariate submodel ( 1 + ε ¯ ⊺ H ¯ ) p
	ε ¯ X , n = arg max ε ¯ X P n log ( 1 + ∑ s , I ε s , X I H s , X I ) p n , X 0 .
5:	Define updates p ˜ n , X of p n , X 0 by plugging ε ¯ X , n back to the submodel.
6:	end for
7:	Replace p n 0 with updated p ˜ n .
8:	until P n D s I ( P ˜ n ) ≤ var n D s I ( P ˜ n ) n ⁄ log n for all s , I . Let P n * = P ˜ n .

5.2 One-step TMLE

We are interested to restrict the step size of ε ¯ in each iteration of the algorithm, which searches for the MLEs. This should increase the performance of the algorithm. In fact, with the Euclidian norm ‖ ε ¯ ‖ < d x for some small enough d x > 0 , due to linear approximations, we have the close form MLEs

ε ¯ X , n = ( P n D s , X a ¯ , a ¯ ( P n 0 ) , P n D s , X a ¯ , a ¯ ′ ( P n 0 ) , P n D s , X a ¯ ′ , a ¯ ′ ( P n 0 ) : s = 1 , … , S ) d x ⁄ ‖ P n D ¯ ( P n 0 ) ‖ ,

where ε ¯ = ( ε ¯ X : X ∈ R ¯ ∪ Z ¯ ∪ L ¯ ) is the vector of all elements of ε ¯ X ’s and D ¯ = ( D s , X a ¯ , a ¯ ( P n 0 ) , D s , X a ¯ , a ¯ ′ ( P n 0 ) , D s , X a ¯ ′ , a ¯ ′ ( P n 0 ) : X ∈ R ¯ ∪ Z ¯ ∪ L ¯ , s = 1 , … , S ) is the vector of all involved EIC components (Section 8.2 of [28]).

Denote by P d x the TMLE update of P under the restriction above that ‖ ε ¯ ‖ < d x , and by P 2 d x , P 3 d x , … the TMLE updates of the next iterations. Then we can design a univariate universal least favorable path P u l f m ( P , x ) with the following analytic expression:

p u l f m ( p , x ) = p exp ∫ 0 x { P n D ¯ ( P u l f m ( p , u ) ) } ⊤ D ¯ ( P u l f m ( p , u ) ) ‖ P n D ¯ ( P u l f m ( p , u ) ) ‖ d u ,

such that for all x ∈ ( 0 , a ) with some a > 0 , d d x P n log p u l f m ( p n 0 , x ) = ‖ P n D ¯ ( P u l f m ( P n 0 , x ) ) ‖ . Furthermore, { P j d x : j } approximates the universal path in the sense that P u l f m ( P n 0 , j d x ) = P j d x for j = 1 , 2 , … , which can be verified under mild regularity conditions with a Taylor expansion of p u l f m ( p n 0 , x ) at x = 0 , d x , 2 d x , … , while the log-likelihood P n log p j d x is increasing as j → ∞ .

We assume that { P u l f m ( P n 0 , x ) : x ∈ ( 0 , a ) } ⊂ ℳ is a submodel contained in ℳ and that the log-likelihood P n log p u l f m ( P n 0 , x ) , x ∈ ( 0 , a ) is nondecreasing along x = d x , 2 d x , … and achieves the closest local maximum at some x m < a , where for small enough d x and some j 0 we have j 0 d x = x m and ‖ P n D ¯ ( P j 0 d x ) ‖ = 0 . One can choose the final TMLE update to be P n * = P I d x with large enough I = I ( n ) such that ‖ P n D ¯ ( P I d x ) ‖ = o P ( n − 1 ⁄ 2 ) . This is a one-step procedure of searching for the closest local maximum x m of P n log p u l f m ( p n 0 , x ) around x = 0 , where ‖ P n D ¯ ( P ) ‖ = 0 is solved exactly at P = P u l f m ( P n 0 , x m ) ; hence the term “one-step TMLE”.

Similar to the iterative TMLE where each TMLE update P ˜ n stays in the interior of ℳ , the one-step TMLE indicates that interim updates P j d x ’s remain in the interior of ℳ . However, for one-step TMLE, it can be satisfied so long as P n 0 is an interior point of ℳ and d x is chosen small enough, whereas for iterative TMLE, it can be potentially violated due to unrestricted ε ¯ in likelihood maximization. By conceptually searching for a local likelihood maximum of a univariate submodel, and practically updating along known directions instead of iteratively solving multi-dimensional MLEs, the one-step TMLE achieves not only greater stability but also reduced computational costs.

Algorithm 2 One-step TMLE
1:	repeat
2:	for X ∈ R ¯ ∪ Z ¯ ∪ L ¯ do
3:	Calculate clever covariates H s , X I = P n D s , X I ( P n 0 ) for the s th outcome s = 1 , … , S and intervention I = ( a ¯ , a ¯ ) , ( a ¯ , a ¯ ′ ) , ( a ¯ ′ , a ¯ ′ ) .
4:	end for
5:	for X ∈ R ¯ ∪ Z ¯ ∪ L ¯ do
6:	Define known multivariate MLE of ε ¯ X = ( ε s , X I : s , I ) with small enough d x :
	ε s , X , n I = H s , X I d x ⁄ ∑ X H s , X I : s , I
7:	Define updates p ˜ n , X of p n , X 0 by plugging ε ¯ X , n back to the submodel.
8:	end for
9:	Replace p n 0 with updated p ˜ n .
10:	until P n D s I ( P ˜ n ) ≤ var n D s I ( P ˜ n ) n ⁄ log n for all s , I . Let P n * = P ˜ n .

5.3 Asymptotic efficiency and simultaneous inference

For the purpose of longitudinal mediation analysis, suppose that the target parameter is a vector Ψ ¯ F ( P ) of a differentiable transformations of ( Ψ s a ¯ , a ¯ ( P ) , Ψ s a ¯ , a ¯ ′ ( P ) , Ψ s a ¯ ′ , a ¯ ′ ( P ) : s = 1 , … , S ) , where the EIC vector D ¯ F ( P ) is calculated as functions of ( D s a ¯ , a ¯ ( P ) , D s a ¯ , a ¯ ′ ( P ) , D s a ¯ ′ , a ¯ ′ ( P ) : s = 1 , … , S ) according to the functional delta method (see A.3 of the study by van der Laan and Rose [34] and Section 3 of the study by van der Laan et al. [49]). In this section, we consider P n * to be an iterative or one-step TMLE, as described in Sections 5.1 and 5.2, which under regularity conditions achieves P n D s a ¯ , a ¯ ( P n * ) = o P ( n − 1 ⁄ 2 ) , P n D s a ¯ , a ¯ ′ ( P n * ) = o P ( n − 1 ⁄ 2 ) , P n D s a ¯ ′ , a ¯ ′ ( P n * ) = o P ( n − 1 ⁄ 2 ) for s = 1 , … , S .

We define the remainder for the target parameter as follows:

R ¯ F ( P , P 0 ) = Ψ ¯ F ( P ) − Ψ ¯ F ( P 0 ) + P 0 D ¯ F ( P ) .

The remainder for the target parameters are functions of the remainders ( R s a ¯ , a ¯ ( P , P 0 ) , R s a ¯ , a ¯ ′ ( P , P 0 ) , R s a ¯ ′ , a ¯ ′ ( P , P 0 ) : s = 1 , … , S ) for each of the corresponding targets in ( Ψ s a ¯ , a ¯ ( P ) , Ψ s a ¯ , a ¯ ′ ( P ) , Ψ s a ¯ ′ , a ¯ ′ ( P ) : s = 1 , … , S ) . The TMLE P n * satisfies the following expansion:

(4) Ψ ¯ F ( P n * ) − Ψ ¯ F ( P 0 ) = R ¯ F ( P n * , P 0 ) − P 0 D ¯ F ( P n * ) = ( P n − P 0 ) D ¯ F ( P 1 ) + ( P n − P 0 ) ( D ¯ F ( P n * ) − D ¯ F ( P 1 ) ) − P n D ¯ F ( P n * ) + R ¯ F ( P n * , P 0 ) .

Asymptotic linearity of P n * is achieved for Ψ ¯ F ( P ) under the following conditions:

The vector of EIC ( D s a ¯ , a ¯ ( P ) , D s a ¯ , a ¯ ′ ( P ) , D s a ¯ ′ , a ¯ ′ ( P ) : s = 1 , … , S ) at P = P n * converges to its limit at P = P 1 in L 2 ( P 0 ) norm on each dimension, and falls in a P 0 -Donsker class;
Elements of ( R s a ¯ , a ¯ ( P n 0 , P 0 ) , R s a ¯ , a ¯ ′ ( P n 0 , P 0 ) , R s a ¯ ′ , a ¯ ′ ( P n 0 , P 0 ) : s = 1 , … , S ) are o P ( n − 1 ⁄ 2 ) .

Asymptotic efficiency is achieved under the following additional condition:

The limit in Assumption (B1) is achieved at P 1 = P 0 .

We can then construct simultaneous confidence intervals based on the asymptotic linearity and the normal limit distribution [29,50]. Note that under the above conditions, the following empirical covariance matrix,

Σ ¯ ( P n , P n * ) = P n D ¯ F ( P n * ) D ¯ F ( P n * ) ⊤ ,

provides consistent estimation of the limit covariance matrix Σ ¯ ( P 0 , P 0 ) . Therefore, the 95 % simultaneous confidence interval can be constructed as follows:

Ψ ¯ F ( P n * ) ± q 0.95 , n σ ¯ n ⁄ n ,

where σ ¯ n is the vector of diagonal elements of { diag ( Σ ¯ ( P n , P n * ) ) } 1 ⁄ 2 , and q 0.95 , n can be a Monte-Carlo estimate of the 0.95 quantile of the maximum of element-wise absolute values of a random vector Z ¯ that follows multivariate normal with mean 0 ¯ and covariance matrix { diag ( Σ ¯ ( P n , P n * ) ) } − 1 ⁄ 2 Σ ¯ ( P n , P n * ) { diag ( Σ ¯ ( P n , P n * ) ) } − 1 ⁄ 2 .

6 Numerical improvements

One computational challenge of conducting TMLE updates for all factors of the likelihood simultaneously lies in the need of repeatedly calculating nested integrals with respect to conditional densities of the kind shown in equation (3). Here, we derive an alternative projection representation of EIC, which is called HAL-EIC, and then show that a numerical approximation of HAL-EIC can reduce the computational costs while preserving the asymptotic properties under mild conditions. Throughout this section, for notation simplicity, we focus on the EIC D a ¯ , a ¯ ′ of a single real-valued target parameter. But the methods also apply to multidimensional target parameters in our setting.

6.1 Numerical approximation of the EIC based on HAL-EIC

For any variable X ∈ R ¯ ∪ Z ¯ ∪ L ¯ let T X ( P ) = { f ∈ L 2 ( P ) : E [ f ∣ Pa ( X ) ] = 0 } denote the tangent subspace, and for any f ∈ L 2 ( P ) define the projection onto T X ( P ) with respect to L 2 ( P ) norm as ∏ ( f ∣ T X ( P ) ) = arg min h ∈ T X ( P ) P { f − h } 2 .

Lemma 6.1

(Projection representation of EIC) Define initial mappings as follows:

G L R a ¯ , a ¯ ′ = G L a ¯ , a ¯ ′ = G R a ¯ , a ¯ ′ = Y I A ¯ = a ¯ ∏ j = 1 K p A ( a j ∣ Pa ( A j ∣ a ¯ j − 1 ) ) ∏ j = 1 K p Z ( Z j ∣ Pa ( Z j ∣ a ¯ j ′ ) ) ∏ j = 1 K p Z ( Z j ∣ Pa ( Z j ∣ a ¯ j ) ) G Z a ¯ , a ¯ ′ = Y I A ¯ = a ′ ¯ ∏ j = 1 K p A ( a j ′ ∣ Pa ( A j ∣ a ¯ j − 1 ′ ) ) ∏ j = 1 K p L ( L j ∣ Pa ( L j ∣ a ¯ j ) ) ∏ j = 1 K p L ( L j ∣ Pa ( L j ∣ a ¯ j ′ ) ) ∏ j = 1 K p R ( R j ∣ Pa ( R j ∣ a ¯ j ) ) ∏ j = 1 K p R ( R j ∣ Pa ( R j ∣ a ¯ j ′ ) ) .

The following projection representation holds for t = 1 , … , K :

D L t a ¯ , a ¯ ′ ( P ) = ∏ ( G L a ¯ , a ¯ ′ ( P ) ∣ T L t ( P ) ) D R t a ¯ , a ¯ ′ ( P ) = ∏ ( G R a ¯ , a ¯ ′ ( P ) ∣ T R t ( P ) ) D Z t a ¯ , a ¯ ′ ( P ) = ∏ ( G Z a ¯ , a ¯ ′ ( P ) ∣ T Z t ( P ) ) .

The proof is given in Appendix B.

Note that the projection terms in Lemma 6.1 can be considered as true risk minimizers in tangent subspaces, if P is considered as the “true” distribution and the risk ‖ G X a ¯ , a ¯ ′ − f ‖ P 2 = P { G X a ¯ , a ¯ ′ − f } 2 is defined for all f ∈ T X ( P ) ⊂ L 2 ( P ) . Given an i.i.d. sample following P , the approximation of the projection terms can be done by empirical risk minimizers over a class of functions that contains the true EIC. Here, we adopt an additional regularity condition in order to introduce the HAL approximation of the EIC and to achieve fast convergence rates [25]:

For X ∈ R ¯ , Z ¯ , L ¯ , the corresponding EIC components at the true distribution P 0 , the initial estimator P n 0 , and at any TMLE update P ˜ n are cadlag with bounded sectional variation norm.

Under condition (B4), we construct the centered HAL basis as follows:

ϕ j , X ( P ) = I { Pa ( X ) ≥ Pa ( x ) ( μ j ) } ( I { X ≥ x ( μ j ) } − P ( X ≥ x ( μ j ) ∣ Pa ( X ) ) ) ,

where { u j : j } is the set of knot points on ( X , Pa ( X ) ) , and u j = ( x ( u j ) , Pa ( x ) ( u j ) ) are the corresponding subvectors. Note that these centered HAL bases satisfy (as we show in Appendix C): (1) T X ( P ) is spanned by the collection { ϕ j , X ( P ) : j } , and (2) given an i.i.d. sample of size N following P , the lasso regression of G X a ¯ , a ¯ ′ ( P ) uses a finite subset of { ϕ j , X ( P ) : j } with a bound on sectional variation norms decided by cross-validation can achieve a guaranteed convergence rate, where the lasso estimator

D ˜ X a ¯ , a ¯ ′ ( P ) = ∑ j = 1 J β ˆ j , X a ¯ , a ¯ ′ ( P ) ϕ j , X ( P )

approximates the EIC with ‖ D ˜ X a ¯ , a ¯ ′ ( P ) − D X a ¯ , a ¯ ′ ( P ) ‖ P = O P ( N − 1 ⁄ 3 ( log N ) d ⁄ 2 ) . We call the numerical approximation

D ˜ a ¯ , a ¯ ′ ( P ) = ∑ X ∈ R ¯ ∪ Z ¯ ∪ L ¯ D ˜ X a ¯ , a ¯ ′ ( P )

the HAL-EIC for D a ¯ , a ¯ ′ ( P ) . In what follows, we refer to the iterative and the one-step TMLE, which replaces the EIC with the HAL-EIC as iterative and one-step HAL-EIC TMLE, respectively.

Note that the approximation D ˜ X a ¯ , a ¯ ′ ( ⋅ ) ≈ D X a ¯ , a ¯ ′ ( ⋅ ) can be obtained for the initial estimate P n 0 or for a TMLE update P ˜ n . But the coefficients β ˆ j , X a ¯ , a ¯ ′ ( P ) of the LASSO estimator are functions of P , and thus, the estimation of the β coefficients requires an i.i.d. sample of N random vectors with joint distribution P . The resampling size N can be chosen to be larger than the observed sample size n .

To achieve computationally fast HAL-EIC TMLE updates, we note that data resampling of size N needs not happen for all P in real time as P changes. Instead, we define HAL-EIC with delayed coefficient estimation by

D ˜ X , P n 0 a ¯ , a ¯ ′ ( P ) = ∑ j = 1 J β ˆ j , X a ¯ , a ¯ ′ ( P n 0 ) ϕ j , X ( P ) ,

and

D ˜ P n 0 a ¯ , a ¯ ′ ( P ) = ∑ X ∈ R ¯ ∪ Z ¯ ∪ L ¯ D ˜ X , P n 0 a ¯ , a ¯ ′ ( P ) ,

where we keep the β coefficients unchanged with respect to a given initial estimate P n 0 . Resampling and re-estimation ofthese β coefficients will only happen when the value of P n 0 is changed. For the fast version of HAL-EIC TMLE, define P ˜ n as the (iterative or one-step) TMLE update of P n 0 that solves P n D ˜ X , P n 0 a ¯ , a ¯ ′ ( P ˜ n ) = o P ( n − 1 ⁄ 2 ) . Repeat the procedure for I iterations by replacing P n 0 with P ˜ n at the end of each iteration. Then, under the same regularity conditions of iterative or one-step TMLE, there exists a large enough integer I = I ( n ) such that at the I th iteration we have P n D ˜ X a ¯ , a ¯ ′ ( P ˜ n ) = P n D ˜ X , P ˜ n a ¯ , a ¯ ′ ( P ˜ n ) = o P ( n − 1 ⁄ 2 ) , and we define the final TMLE update P n * as the TMLE update P ˜ n of the I th iteration.

Algorithm 3 HAL-EIC (One-step) TMLE
1:	for X ∈ R ¯ ∪ Z ¯ ∪ L ¯ do
2:	Initialize numerical HAL-EIC D ˜ s , X , P fixed I ( P n 0 ) for X ∈ R ¯ ∪ Z ¯ ∪ L ¯ with fixed β coefficients fitted at generated i.i.d. samples from P fixed = P n 0 , for the s th outcome s = 1 , … , S and intervention I = ( a ¯ , a ¯ ) , ( a ¯ , a ¯ ′ ) , ( a ¯ ′ , a ¯ ′ ) . Calculate clever covariates H s , X I ( P n 0 ) = P n D ˜ s , X , P fixed I ( P n 0 ) .
3:	end for
4:	repeat
5:	repeat
6:	for X ∈ R ¯ ∪ Z ¯ ∪ L ¯ do
7:	Define known multivariate MLE of ε ¯ X = ( ε s , X I : s , I ) with small enough d x :
	ε s , X , n I = H s , X I ( P n 0 ) d x ⁄ ∑ X H s , X I ( P n 0 ) : s , I
8:	Define updates p ˜ n , X of p n , X 0 by plugging ε ¯ X , n back to the submodel.
9:	end for
10:	Replace p n 0 with updated p ˜ n .
11	until P n D ˜ s , P fixed I ( P ˜ n ) ≤ var n D ˜ s , P fixed I ( P ˜ n ) n ⁄ log n for all s , I .
12:	Refit β coefficients in the numerical EIC objects D ˜ s , X , P fixed I ( P ) and also update H s , X I ( P n 0 ) = D ˜ s , X , P fixed I ( P n 0 ) by replacing P fixed = P n 0 where P n 0 has just been replaced with the most recent TMLE update P ˜ n .
13:	until P n D ˜ s , P fixed I ( P ˜ n ) ≤ var n D ˜ s , P fixed I ( P ˜ n ) n ⁄ log n for all s , I with updated β coefficients.

The asymptotic linearity and efficiency of HAL-EIC TMLE are preserved under the following additional conditions on the resampling size N for the lasso estimators of the β coefficients (see Appendix D):

N = n and P 0 { p 0 − p ˜ n } 2 = o P ( n − 1 ⁄ 3 ) , or N ( n ) increases faster than n 3 ⁄ 2 , under the strong positivity condition that p 0 ( o ) > δ > 0 and p n * ( o ) > δ > 0 over the supports for some δ > 0 .

6.2 Modeling with summary covariates

Due to the curse of dimensionality and challenges of modeling conditional densities, it is of interest in practice to consider dimension reductions by introducing summary covariates. For example, for each node X ∈ { A ¯ , R ¯ , Z ¯ , L ¯ } , suppose that there exists a vector-valued deterministic function C X : Pa ( X ) ↦ C X ( Pa ( X ) ) that summarizes the information in Pa ( X ) hopefully without loosing information such that for all P ∈ ℳ :

(5) p X ( X ∣ Pa ( X ) ) = p X ( X ∣ C X ( Pa ( X ) ) ) .

Equation (5) can be considered as enforcing an extra restriction on the statistical model ℳ , where the number of independent variables in each conditional density can now be decided by the dimensionality of the summary vector C X , not necessarily increasing with the number of time points.

The summary covariates C X may be chosen in a data adaptive manner so that an analysis under assumption (5) can be conducted. For example, note that for a discretized categorical variable X with possible levels 1 , … , d X , a natural oracle choice of C X is

C X ( Pa ( X ) ) = ( P ( X = x ∣ Pa ( X ) ) : x ∈ { 1 , … , d X } ) ,

which satisfies (5) by iterated expectations. Intuitively, this observation is related to propensity score matching or covariate adjustment [51,52]. Although in practice such C X is not observed, and using estimated conditional probabilities might as well introduce bias, it is possible to augment C X with additional terms while utilizing HAL in modeling the conditional densities as functions of C X , so that the desired asymptotic properties can be preserved. We refer readers to the technical report of meta-HAL super-learners for theoretical details. Under mild conditions, the summary covariates C X can be obtained from training samples, and the resulting CV-TMLE [53,54] will still be locally efficient for the target parameter of interest.

In practice, the conditional density models may as well be achieved by applying actual knowledge of the data generating process. For example, if it is known that the propensity of prescribing a medicine only depends on recent onsets of a specific symptom and pre-existing conditions at the time of the prescription, then in this case, C A t is taking a subset of the vector Pa ( A t ) . It may be advisable to replace Pa ( A t ) with its interaction set with the most recent time points prior to t . Then the same expressions of g-computation formulas, EIC, and tangent subspaces hold, and algorithms in the previous sections apply.

7 Multiple robustness

Multiple robustness [33,55] of the proposed estimators is obtained if two of the following three sets of conditional density estimators: (1) p n , A t 0 , (2) p n , Z t 0 , (3) p n , R t 0 and p n , L t 0 , are correct or at least consistent with o P ( n − 1 ⁄ 4 ) error rates in L 2 ( P 0 ) norms. Then the remainders defined in ( R s a ¯ , a ¯ ( P , P 0 ) , R s a ¯ , a ¯ ′ ( P , P 0 ) , R s a ¯ ′ , a ¯ ′ ( P , P 0 ) : s = 1 , … , S ) are all o P ( n − 1 ⁄ 2 ) (Appendix E). However, in practice, the multiple robustness conditions for mediation analysis are not trivially satisfied even with randomized control trials, where only p n , A t 0 ’s are guaranteed to be correct or consistently estimated. Therefore, it is recommended to include HAL as one of the estimators in the super learner for P n 0 , so that the error rate conditions are all satisfied [25,45,49].

As a variant of the current setting, if one were to define a data adaptive framework so that the mediator random intervention Γ t a ¯ ′ ( z t ∣ r ¯ t , z ¯ t − 1 , l ¯ t − 1 ) = p ( Z t ( a ¯ ′ ) = z t ∣ R ¯ t ( a ¯ ′ ) = r ¯ t , Z ¯ t − 1 ( a ¯ ′ ) = z ¯ t − 1 , L ¯ t − 1 ( a ¯ ′ ) = l ¯ t − 1 ) = p Z t ( z t ∣ Pa ( z t ∣ a ¯ t ′ ) ) is replaced by, for example, an estimated control group mediator distribution Γ n , t a ¯ ′ ( z t ∣ r ¯ t , z ¯ t − 1 , l ¯ t − 1 ) = p n , Z t 0 ( z t ∣ Pa ( z t ∣ a ¯ t ′ ) ) , then the counterfactuals X ( a ¯ , Γ ¯ n a ¯ ′ ) and resulting target parameters would also become data adaptive [54]. If it is further assumed that Γ t a ¯ ′ = Γ n , t a ¯ ′ , then it is a generalization of [12] to longitudinal data. In those generalizations, the multiple robustness conditions may be reliably satisfied in randomized trials with known treatment randomization and dynamic rules, but different interpretation follows for the new targets of inference, which now depends on the choice of mediator interventions. The influence curves and implementation for such generalized stochastic direct and indirect effects are discussed in Appendix A.1.

8 Simulations

In this section, we investigate the properties of the proposed algorithms in simulated data. Throughout this section, we focus on the following data structure,

O = ( L 01 , L 02 , A 1 C , A 1 E , R 1 , Z 1 , Y 1 , A 2 C , A 2 E , R 2 , Z 2 , Y 2 ) ∼ P 0 .

We focus on the survival outcome Y ¯ = ( Y 1 , Y 2 ) , where Y 1 and Y 2 are binary such that the event { Y 1 = 1 } implies { Y 2 = 1 } . We target the multivariate parameter ( E [ Y ¯ ( 1 ¯ , Γ 1 ¯ ) ] , E [ Y ¯ ( 1 ¯ , Γ 0 ¯ ) ] , E [ Y ¯ ( 0 ¯ , Γ 0 ¯ ) ] ) . We calculate the TMLE (one-step TMLE with restricted step sizes) using both the exact EIC and the HAL-EIC, and the g-formula plug-in with different correct or misspecified initial estimators. This subsection aims to verify the consistency, asymptotic linearity, and multiple robustness properties for both exact EIC or HAL-EIC based TMLE.

Within each iteration, we generate an i.i.d. sample of size n = 1,000 with respect to the following data generating process:

L 01 ∼ Bernoulli ( 0.4 ) L 02 ∼ Bernoulli ( 0.6 ) A t C ∼ Bernoulli ( expit ( 1.5 − 0.4 L 01 − 0.8 L 02 + I t > 1 0.5 A t − 1 E ) ) A t E ∼ Bernoulli ( expit ( λ ( − 0.55 + 0.35 L 01 + 0.6 L 02 − I t > 1 0.05 A t − 1 E ) ) ) R t ∼ Bernoulli ( expit ( − 0.8 + 0.1 L 01 + I t = 1 0.3 L 02 + I t > 1 0.3 R t − 1 + A t E ) ) Z t ∼ Bernoulli ( expit ( − 0.25 + 0.4 L 02 + 0.4 A t E + 0.5 R t ) ) Y t ∼ Bernoulli ( expit ( 0.05 + 0.375 L 02 + 0.25 R t − 0.075 A t E − 0.075 Z t − I t > 1 0.025 R t − 1 ) ) ,

where we also vary the value of the propensity scaling factor λ from 1 to 5 to simulate different degrees of finite-sample near-violation of the positivity assumptions. Each scenario iterates for 1,000 times and detailed results are reported in Appendix F.

8.1 Multiple robustness

We present results of simulation study, which is a proof-of-concept study for the basic multiple robustness of exact-EIC TMLE and the comparable performance of HAL-EIC TMLE. Model misspecification was enforced to: (1) none of the conditional density estimators; (2) initial p A estimators; (3) initial p Z estimators; and (4) initial p Y estimators. Correct conditional density models were fitted with correct main-term logistic regressions. Misspecified models set conditional expectations of each of the variables given the past as observed sample means with an additional bias of 0.05 while bounded between ( 0.01 , 0.99 ) . Exact-EIC and HAL-EIC based TMLE achieved similar performance in all four scenarios (Tables A1, A2, A3, A4).

Table A1

Misspecification: None

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0001	0.0274	0.0008	0.9465	0.1064
HAL-EIC Initial	0.0001	0.0274	0.0008	0.9455	0.1060
Exact-EIC TMLE	0.0000	0.0274	0.0008	0.9465	0.1064
HAL-EIC TMLE	0.0001	0.0274	0.0008	0.9455	0.1060

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0003	0.0282	0.0008	0.9687	0.1259
HAL-EIC Initial	0.0004	0.0283	0.0008	0.9444	0.1085
Exact-EIC TMLE	0.0005	0.0323	0.0010	0.9414	0.1251
HAL-EIC TMLE	0.0004	0.0283	0.0008	0.9444	0.1085

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0001	0.0275	0.0008	0.9455	0.1086
HAL-EIC Initial	0.0000	0.0275	0.0008	0.9434	0.1070
Exact-EIC TMLE	‒ 0.0001	0.0276	0.0008	0.9455	0.1085
HAL-EIC TMLE	0.0000	0.0275	0.0008	0.9434	0.1070

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0003	0.0287	0.0008	0.9737	0.1327
HAL-EIC Initial	0.0003	0.0287	0.0008	0.9404	0.1108
Exact-EIC TMLE	0.0002	0.0347	0.0012	0.9303	0.1306
HAL-EIC TMLE	0.0003	0.0287	0.0008	0.9404	0.1108

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0021	0.0255	0.0007	0.9606	0.1056
HAL-EIC Initial	0.0021	0.0255	0.0007	0.9545	0.1031
Exact-EIC TMLE	0.0020	0.0261	0.0007	0.9576	0.1055
HAL-EIC TMLE	0.0021	0.0255	0.0007	0.9545	0.1031

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	‒ 0.0006	0.0267	0.0007	0.9808	0.1290
HAL-EIC Initial	‒ 0.0006	0.0268	0.0007	0.9414	0.1051
Exact-EIC TMLE	‒ 0.0005	0.0338	0.0011	0.9222	0.1266
HAL-EIC TMLE	‒ 0.0006	0.0268	0.0007	0.9414	0.1051

Table A2

Misspecification: A

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0002	0.0274	0.0008	0.8983	0.0894
HAL-EIC Initial	0.0002	0.0275	0.0008	0.9003	0.0889
Exact-EIC TMLE	0.0002	0.0274	0.0008	0.9003	0.0894
HAL-EIC TMLE	0.0002	0.0275	0.0008	0.9003	0.0889

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0004	0.0284	0.0008	0.8983	0.0940
HAL-EIC Initial	0.0003	0.0284	0.0008	0.8602	0.0865
Exact-EIC TMLE	0.0007	0.0318	0.0010	0.8592	0.0937
HAL-EIC TMLE	0.0003	0.0284	0.0008	0.8602	0.0865

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0001	0.0275	0.0008	0.9075	0.0914
HAL-EIC Initial	0.0002	0.0276	0.0008	0.9034	0.0903
Exact-EIC TMLE	0.0000	0.0277	0.0008	0.9096	0.0913
HAL-EIC TMLE	0.0002	0.0276	0.0008	0.9034	0.0903

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0003	0.0289	0.0008	0.9013	0.0990
HAL-EIC Initial	0.0003	0.0289	0.0008	0.8654	0.0890
Exact-EIC TMLE	0.0004	0.0333	0.0011	0.8602	0.0983
HAL-EIC TMLE	0.0003	0.0289	0.0008	0.8654	0.0890

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0021	0.0255	0.0007	0.9609	0.1048
HAL-EIC Initial	0.0022	0.0255	0.0007	0.9579	0.1043
Exact-EIC TMLE	0.0021	0.0255	0.0007	0.9568	0.1048
HAL-EIC TMLE	0.0022	0.0255	0.0007	0.9579	0.1043

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	‒ 0.0005	0.0268	0.0007	0.9712	0.1187
HAL-EIC Initial	‒ 0.0004	0.0269	0.0007	0.9353	0.1012
Exact-EIC TMLE	‒ 0.0001	0.0298	0.0009	0.9466	0.1185
HAL-EIC TMLE	‒ 0.0004	0.0269	0.0007	0.9353	0.1012

Table A3

Misspecification: Z

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	‒ 0.0000	0.0274	0.0007	0.9477	0.1064
HAL-EIC Initial	0.0001	0.0274	0.0008	0.9443	0.1061
Exact-EIC TMLE	0.0001	0.0274	0.0007	0.9477	0.1064
HAL-EIC TMLE	0.0002	0.0274	0.0008	0.9455	0.1061

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0003	0.0287	0.0008	0.9682	0.1260
HAL-EIC Initial	0.0003	0.0287	0.0008	0.9318	0.1087
Exact-EIC TMLE	0.0007	0.0329	0.0011	0.9364	0.1253
HAL-EIC TMLE	0.0005	0.0287	0.0008	0.9352	0.1086

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0016	0.0274	0.0008	0.9511	0.1065
HAL-EIC Initial	0.0017	0.0274	0.0008	0.9455	0.1062
Exact-EIC TMLE	0.0002	0.0277	0.0008	0.9455	0.1068
HAL-EIC TMLE	0.0008	0.0274	0.0008	0.9420	0.1062

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0017	0.0287	0.0008	0.9693	0.1261
HAL-EIC Initial	0.0017	0.0287	0.0008	0.9341	0.1094
Exact-EIC TMLE	‒ 0.0000	0.0345	0.0012	0.9273	0.1263
HAL-EIC TMLE	0.0008	0.0290	0.0008	0.9284	0.1094

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0041	0.0257	0.0007	0.9568	0.1055
HAL-EIC Initial	0.0041	0.0258	0.0007	0.9534	0.1031
Exact-EIC TMLE	0.0029	0.0261	0.0007	0.9580	0.1055
HAL-EIC TMLE	0.0031	0.0255	0.0007	0.9591	0.1031

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0016	0.0277	0.0008	0.9773	0.1291
HAL-EIC Initial	0.0016	0.0278	0.0008	0.9352	0.1048
Exact-EIC TMLE	0.0006	0.0342	0.0012	0.9136	0.1268
HAL-EIC TMLE	0.0006	0.0273	0.0007	0.9455	0.1048

Table A4

Misspecification: Y

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0532	0.0186	0.0032	0.5111	0.1072
HAL-EIC Initial	0.0533	0.0188	0.0032	0.5069	0.1070
Exact-EIC TMLE	0.0003	0.0273	0.0007	0.9504	0.1068
HAL-EIC TMLE	0.0001	0.0276	0.0008	0.9409	0.1063

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0494	0.0183	0.0028	0.8110	0.1281
HAL-EIC Initial	0.0494	0.0183	0.0028	0.6473	0.1105
Exact-EIC TMLE	0.0005	0.0320	0.0010	0.9525	0.1263
HAL-EIC TMLE	‒ 0.0005	0.0279	0.0008	0.9472	0.1118

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0549	0.0186	0.0034	0.4984	0.1096
HAL-EIC Initial	0.0549	0.0187	0.0034	0.4794	0.1082
Exact-EIC TMLE	0.0005	0.0276	0.0008	0.9493	0.1090
HAL-EIC TMLE	0.0009	0.0278	0.0008	0.9440	0.1074

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0508	0.0183	0.0029	0.8332	0.1354
HAL-EIC Initial	0.0508	0.0183	0.0029	0.6547	0.1134
Exact-EIC TMLE	0.0002	0.0342	0.0012	0.9356	0.1319
HAL-EIC TMLE	‒ 0.0001	0.0283	0.0008	0.9535	0.1147

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0590	0.0186	0.0038	0.3749	0.1067
HAL-EIC Initial	0.0590	0.0188	0.0038	0.3601	0.1045
Exact-EIC TMLE	0.0021	0.0260	0.0007	0.9599	0.1057
HAL-EIC TMLE	0.0020	0.0255	0.0007	0.9588	0.1037

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0536	0.0183	0.0032	0.7635	0.1330
HAL-EIC Initial	0.0537	0.0184	0.0032	0.5586	0.1105
Exact-EIC TMLE	‒ 0.0005	0.0342	0.0012	0.9166	0.1277
HAL-EIC TMLE	‒ 0.0002	0.0271	0.0007	0.9599	0.1119

Additional simulations with more time points and continuous, time-varying covariates are reported in Appendix G, to illustrate finite-sample performance with moderate dimensions and sample sizes.

8.2 Finite sample positivity challenge

As we are pushing the positivity parameter λ from 1 to 5, the true treatment propensity score gets closer to 0, and therefore in finite samples, we see that the product of the inverse estimated propensity scores at the initial estimate p n 0 and the follow-up updates p ˜ n violate the boundedness assumptions (B2), (B4), and (B5).

Interestingly, in all scenarios ( λ = 1 , 2 , 3 , 4 , 5 ) for all dimensions of the target parameter, HAL-EIC TMLE had less increase in MSE and less drop in confidence interval coverage compared to exact-EIC TMLE (Figures 1 and 2). This illustrates one potential advantage of the HAL-based projection representation: the HAL algorithm automatically searches for a bounded EIC approximation while maintaining the asymptotic linearity of the final TMLE estimate. This is a desirable property and more appealing than to arbitrarily set bounds on the IPW value or on the influence curve estimates. The latter tend to create asymptotic bias, while the HAL-EIC maintains desired multiple robustness.

Figure 1

MSE comparison with increasing finite-sample positivity violation.

Figure 2

Coverage comparison with increasing finite-sample positivity violation.

Further investigation is needed to study the performance of the HAL-EIC in more complex scenarios such as rare events in combination of finite-sample positivity violation. This and further numerical optimizations will be addressed in future venue.

9 Discussion

We construct a scalable, likelihood, and random intervention based estimation framework for longitudinal mediation. One advantage of our framework is its flexibility being a likelihood based method. This leads to an estimation procedure that is able to target multidimensional parameters simultaneously by updating estimators of the same estimate of the longitudinal likelihood process. Moreover, our HAL-EIC approximation further reduces the burden on practitioners, as the required input of the algorithms is reduced to modeling constraints of conditional distributions, which is a well-known task for data scientists and can be implemented without further understanding of either the analytic calculation of the EIC or the sequential regression equations.

An important extension of our work is the application of collaborative TMLE [23], which can be crucial in longitudinal problems where finite-sample positivity violations occur [56]. While the numerical HAL-EIC representation proposed in Section 6.1 implicitly searches for an optimal bound of the sectional variation norms on the EIC components and demonstrates robustness in the simulation, collaborative TMLE may be combined with the proposed estimators to achieve even more reliable inference. Previous results show that this is a promising direction to optimize the finite-sample performance [57].

Another idea that needs further work is a data adaptive choice of a dimension reduction algorithm for applications with high dimensional time-varying covariates. This can be conducted so that scalability is further improved while maintaining asymptotic properties for the estimators of the target parameters.

Future research should also consider generalize to continuous time TMLE [58] for applications where it is of interest to apply longitudinal mediation analysis to data structures where events are happening and are recorded in continuous time. Higher order TMLE [59] should be utilized to improve efficiency.

Acknowledgements

This research is partially funded by NIH-grant R01AI074345-10A1. Additional funding was provided by a philanthropic gift from Novo Nordisk to the Center for Targeted Machine Learning and Causal Inference at UC Berkeley.

Funding information: This research is partially funded by NIH-grant R01AI074345-10A1. Additional funding was provided by a philanthropic gift from Novo Nordisk to the Center for Targeted Machine Learning and Causal Inference at UC Berkeley.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and consented to its submission to the journal, reviewed all the results and approved the final version of the manuscript. The study and manuscript have received significant contributions from all authors.
Conflict of interest: Prof. Maya Petersen and Prof. Mark van der Laan are members of the Editorial Advisory Board of the Journal of Causal Inference but were not involved in the review process of this article.

Appendix A Efficient influence curves (EIC)

To prove the EIC representation in Section 3.4, we utilize the pathwise differentiability of Ψ s a ¯ , a ¯ ′ ( P ) and have that (suppressing its dependence on a ¯ , a ¯ ′ , s , and let Y = ψ s ( L ¯ K ) )

d Ψ ( P ε h ) d ε ε = 0 = ⟨ D ( P ) , h ⟩ P

for all submodel P ε h through P , such that h = d d ε ∣ ε = 0 log p ε h . Note that h ∈ T ( P ) can be decomposed as h = ∑ X h X where X ∈ L ¯ ∪ R ¯ ∪ Z ¯ , which corresponds to the factorization of the submodel

p ε h = ∏ X ∈ L ¯ ∪ R ¯ ∪ Z ¯ p ε , X h = ∏ X ∈ L ¯ ∪ R ¯ ∪ Z ¯ p X ( 1 + ε h X ) .

For all h ∈ T ( P ) such that h Z t = 0 for all t = 1 , … , K , we have that

d Ψ ( P ε h ) d ε ∣ ε = 0 = d Ψ ( ∏ t = 1 K p Z t ∏ X ∈ L ¯ ∪ R ¯ p ε , X h ) d ε ε = 0 = d ∑ l ¯ , z ¯ , r ¯ y ∏ X ∈ L ¯ ∪ R ¯ p ε , X h X ( x ∣ Pa ( x ∣ a ¯ ) ) ∏ t = 1 K I A t = a t ∏ t = 1 K p Z t ( z t ∣ Pa ( z t ∣ a ¯ t ′ ) ) d ε ε = 0 = ∑ X ∈ L ¯ ∪ R ¯ D X ( P ) , ∑ X ∈ L ¯ ∪ R ¯ h X P ,

which corresponds to the pathwise derivative of the treatment specific mean under interventions A t = a t and Z t ∼ p Z t ( Z t ∣ Pa ( Z t ∣ a ¯ ′ ) ) . Apply Γ t ( Z t ∣ Pa ( Z t ) ) = p Z t ( Z t ∣ Pa ( Z t ∣ a ¯ ′ ) ) in Lemma A.1. This gives the efficient influence function ∑ X ∈ L ¯ ∪ R ¯ D X ( P ) as specified in Section 3.4.

On the other hand, to calculate D Z t ( P ) , consider h ∈ T ( P ) such that h X = 0 for all X ∈ L ¯ ∪ R ¯ . Then

d Ψ ( P ε h ) d ε ε = 0 = d ∑ l ¯ , z ¯ , r ¯ y ∏ t = 1 K I { A t = a t ′ } ∏ X ∈ L ¯ ∪ R ¯ p X ( x ∣ Pa ( x ∣ a ¯ ) ) ∏ t = 1 K p ε , Z t h Z t ( z t ∣ Pa ( z t ∣ a ¯ t ′ ) ) d ε ε = 0 = ∑ t = 1 K D Z t ( P ) , ∑ t = 1 K h Z t P ,

which corresponds to the pathwise derivative of the treatment specific mean under interventions A t = a t ′ and X ∼ p X ( X ∣ Pa ( X ∣ a ¯ ) ) for all X ∈ L ¯ ∪ R ¯ . This proves the rest of the EIC components ∑ t = 1 K D Z t ( P ) in Section 3.4.

Finally, note that the orthogonal decomposition gives that ⟨ D Z t ( P ) , ∑ X ∈ L ¯ ∪ R ¯ h X ⟩ P = 0 and ⟨ D R t ( P ) + D L t ( P ) , ∑ X ∈ Z ¯ h X ⟩ P = 0 . Therefore, d Ψ ( P ε h ) d ε ∣ ε = 0 = ⟨ ∑ t = 1 K D Z t ( P ) + D R t ( P ) + D L t ( P ) , h ⟩ P for all h ∈ T ( P ) . This proves that D ( P ) = ∑ X ∈ L ¯ ∪ R ¯ ∪ Z ¯ D X ( P ) is the gradient and hence the unique canonical gradient in the nonparametric tangent space T ( P ) .

A.1 EIC of other longitudinal interventions

A.1.1 Controlled direct effect (CDE)

Suppose that we replace the random intervention Z t ( a ¯ , Γ ¯ a ¯ ′ ) ∼ Γ t a ¯ ′ in 2.1 with enforcing a random draw Z t ( a ¯ , Γ ¯ ) ∼ Γ t that does not depend on the control group counterfactuals. A typical choice is letting Γ t be a degenerated discrete density function that puts probability 1 to a certain value z t , in which case the g-computation formula for Y ( a ¯ , Γ ¯ ) is identical to that of Y ( a ¯ , z ¯ ) following static intervention. Another example is when we believe that some estimator Γ ˆ t is a satisfactory proxy for the unobserved Γ ¯ a ¯ ′ , and we focus on Y ( a ¯ , Γ ¯ = Γ ˆ ¯ ) despite the caveat that interpretation may be limited when Γ ˆ ¯ is too deviated from Γ ¯ a ¯ ′ . In general, these correspond to some joint intervention on A ¯ and Z ¯ , where the (random/stochastic) CDE,

E [ Y ( a ¯ , Γ ¯ ) − Y ( a ¯ ′ , Γ ¯ ) ] ,

becomes a standard average treatment effect (ATE) parameter under random intervention of both A ¯ and Z ¯ . For either Ψ ( P ) = E [ Y ( a ¯ , Γ ¯ ) ] or E [ Y ( a ¯ ′ , Γ ¯ ) ] , p Z t becomes nuisance in the sense that any submodel P ε h with score h ∈ T Z ( P ) leads to d Ψ ( P ε h ) d ε = 0 . This leads to D ( P ) ⊥ T Z ( P ) and D Z ( P ) = 0 . For X ∈ L ¯ ∪ R ¯ , similar proof of EIC decomposition over X ∈ L ¯ ∪ R ¯ applies by replacing Z t ∼ p Z t ( Z t ∣ Pa ( Z t ∣ a ¯ ′ ) ) with Z t ∼ Γ t ( Z t ∣ Pa ( Z t ) ) (Lemma A.1). The same methodology for NIE and NDE applies except that the EIC D ( P ) now has D Z t = 0 and that a known and fixed function, Γ t ( Z t ∣ Pa ( Z t ) ) , replacing p Z t ( Z t ∣ Pa ( Z t ∣ a ¯ ′ ) ) in D R t ( P ) and D L t ( P ) .

A.1.2 General longitudinal stochastic intervention

For the average treatment effect of a general longitudinal stochastic intervention (which could be a mixture of fixed and random interventions), suppose that A ¯ denotes the static intervention nodes and Z ¯ denotes the random intervention nodes onto the SCM in Section 2.1. At the contrast of ( a ¯ , Γ ¯ ( 1 ) ) and ( a ¯ ′ , Γ ¯ ( 2 ) ) with distinct Γ ¯ ( 1 ) and Γ ¯ ( 2 ) , one can define the generalized ATE under this mixed intervention as follows:

E [ Y ( a ¯ , Γ ¯ ( 1 ) ) − Y ( a ¯ ′ , Γ ¯ ( 2 ) ) ] .

For Γ ¯ = Γ ¯ ( 1 ) or Γ ¯ ( 2 ) , we have the following lemma for the EIC of Ψ ( P ) = E [ Y ( a ¯ , Γ ¯ ) ] .

Lemma A.1

For X ¯ = R ¯ or L ¯ and Ψ ( P ) = E [ Y ( a ¯ , Γ ¯ ) ] , where Γ ¯ is a set of known density function for Z ¯ (in the sense that Z ¯ ( O ) is a fixed function of O), assume that E [ Y ( a ¯ , Γ ¯ ) ] is identified by the similar g-computation formula as Equation (2) but replacing p Z ( Z t ∣ Pa ( Z t ∣ a ¯ ′ ) ) with Γ t ( Z t ∣ Pa ( Z t ) ) . Define

Q R K + 1 a ¯ , Γ ¯ = Y Q L t a ¯ , Γ ¯ ( R ¯ t , Z ¯ t , L ¯ t − 1 ) = E P a ¯ , Γ ¯ [ Q R t + 1 a ¯ , Γ ¯ ∣ R ¯ t , Z ¯ t , L ¯ t − 1 ] = E P [ Q R t + 1 a ¯ , Γ ¯ ∣ R ¯ t , Z ¯ t , L ¯ t − 1 , A ¯ t = a ¯ t ] Q Z t a ¯ , Γ ¯ ( R ¯ t , Z ¯ t − 1 , L ¯ t − 1 ) = E P a ¯ , Γ ¯ [ Q L t a ¯ , Γ ¯ ∣ R ¯ t , Z ¯ t − 1 , L ¯ t − 1 ] = ∑ z t Q L t a ¯ , Γ ¯ Γ t ( z t ∣ Pa ( Z t ) ) Q R t a ¯ , Γ ¯ ( R ¯ t − 1 , Z ¯ t − 1 , L ¯ t − 1 ) = E P a ¯ , Γ ¯ [ Q Z t a ¯ , Γ ¯ ∣ R ¯ t − 1 , Z ¯ t − 1 , L ¯ t − 1 ] = E P [ Q Z t a ¯ , Γ ¯ ∣ R ¯ t − 1 , Z ¯ t − 1 , L ¯ t − 1 , A ¯ t = a ¯ t ] Ψ a ¯ , Γ ¯ ( P ) = E P Q R 1 a ¯ , Γ ¯ ( P ) ( L 0 )

and

D L t a ¯ , Γ ¯ = I A ¯ t = a ¯ t ∏ j = 1 t p A ( a j ∣ Pa ( A j ∣ a ¯ j − 1 ) ) ∏ j = 1 t Γ t ( Z j ∣ Pa ( Z j ) ) p Z ( Z j ∣ Pa ( Z j ∣ a ¯ j ) ) { Q R t + 1 a ¯ , Γ ¯ ( R ¯ t , Z ¯ t , L ¯ t ) − Q L t a ¯ , Γ ¯ ( R ¯ t , Z ¯ t , L ¯ t − 1 ) } D R t a ¯ , Γ ¯ = I A ¯ t = a ¯ t ∏ j = 1 t p A ( a j ∣ Pa ( A j ∣ a ¯ j − 1 ) ) ∏ j = 1 t − 1 Γ t ( Z j ∣ Pa ( Z j ) ) p Z ( Z j ∣ Pa ( Z j ∣ a ¯ j ) ) { Q Z t a ¯ , Γ ¯ ( R ¯ t , Z ¯ t − 1 , L ¯ t − 1 ) − Q R t a ¯ , Γ ¯ ( R ¯ t − 1 , Z ¯ t − 1 , L ¯ t − 1 ) } .

Then the EIC of Ψ ( P ) is D a ¯ , Γ ¯ ( P ) = ∑ X t ∈ R ¯ ∪ L ¯ D X t a ¯ , Γ ¯ ( P ) .

B Projection representation of EIC

For Lemma 6.1, recall that in Appendix A, we proved that ∑ X ∈ R ¯ ∪ L ¯ D X ( P ) is the influence curve of the treatment specific mean (TSM) under interventions A t = a t and Z t ∼ p Z t ( Z t ∣ Pa ( Z t ∣ a ¯ ′ ) ) . For such TSM parameter, an influence curve in the semiparametric model, where p A and p Z are assumed known can be derived from the IPW estimator 1 n ∑ t = 1 K Y I A ¯ = a ¯ ∏ j = 1 K p A ( a j ∣ Pa ( A j ∣ a ¯ j − 1 ) ) ⋅ ∏ j = 1 K p Z ( Z j ∣ Pa ( Z j ∣ a ¯ j ′ ) ) ∏ j = 1 K p Z ( Z j ∣ Pa ( Z j ∣ a ¯ j ) ) [60]. Then the influence function in the nonparametric model is the projection of this semiparametric influence curve onto the tangent space. Therefore, for X ∈ R ¯ ∪ L ¯ ,

D X a ¯ , a ¯ ′ ( P ) = ∏ ( ∏ ( G L R a ¯ , a ¯ ′ ∣ L 0 2 ( P ) ) ∣ T X ( P ) ) = ∏ ( G L R a ¯ , a ¯ ′ ∣ T X ( P ) ) ,

where L 0 2 ( P ) is the collection of finite variance functions f ( O ) such that E P [ f ( O ) ] = 0 . The same applies to ∑ X ∈ Z ¯ D X ( P ) that is derived as the influence curve of the TSM under intervention A t = a t ′ , R t ∼ p R t ( R t ∣ Pa ( R t ∣ a ¯ ) ) , and L t ∼ p L t ( L t ∣ Pa ( L t ∣ a ¯ ) ) . Lemma 6.1 can also be verified with algebraic calculation using ∏ ( f ∣ T X ( P ) ) = E P [ f ∣ X , Pa ( X ) ] − E P [ f ∣ Pa ( X ) ] and expanding the integral terms of conditional expectations.

C Centered HAL basis

Suppose that D X * ( P ) is one of the EIC component listed in Section 3.4 associated with the tangent space T X ( P ) = { h ( X ∣ Pa ( X ) ) : E P { h ( X ∣ Pa ( X ) ) ∣ Pa ( X ) } = 0 } . In this section, we present an alternative representation of D X * ( P ) as a linear combination of centered HAL basis functions, and the TMLE updates based on the approximation.

If we have an initial gradient G X ( P ) such that its projection onto the tangent space ∏ ( G X ( P ) ∣ T X ( P ) ) equals the EIC component D X * ( P ) (Lemma 6.1 as an example), and if T X ( P ) is well approximated by the linear span of a set of basis functions { ϕ j , X ( P ) : j } , then we can have the following EIC representation:

D X * ( P ) = ∑ j β j , X ( P ) ϕ j , X ( P ) ,

where the coefficients β X ( P ) = ( β j , X ( P ) : j ) are defined by the least squared projection

β X ( P ) = arg min β P G X ( P ) − ∑ j β j ϕ j , X ( P ) 2 .

In practice, a large sample of size N can be generated from P , and a follow-up cross-validated lasso regression against a subset of { ϕ j , X ( P ) : j } of size J will decide at most N − 1 nonzero coefficients in the following approximated representation with a finite sectional variation norm

D ˆ X * ( P ) = ∑ j = 1 J β ˆ j , X ( P ) ϕ j , X ( P ) .

Under assumption (B4), T X ( P ) is a subspace of the space of cadlag functions of ( X , Pa ( X ) ) with bounded sectional variation norms, where the latter is the space spanned by the (uncentered) HAL basis in the following form (see Section 6.2 of the study by van der Laan et al. [26]):

ϕ j , X = I { U ( X ) ≥ μ j } ,

where U ( X ) = ( X , Pa ( X ) ) . u j is a knot point in the range of a subvector of U ( X ) , where each element of u j takes a value in R ∪ { − ∞ } . Conducting the projection onto T X ( P ) in two steps gives

D X * ( P ) = ∑ j β j , X ( P ) ϕ j , X ( P ) ,

where ϕ j , X ( P ) = Π ( ϕ j , X ∣ T X ( P ) ) = ϕ j , X − E P [ ϕ j , X ∣ Pa ( X ) ] = I { Pa ( X ) ≥ Pa ( x ) ( μ j ) } ( I { X ≥ x ( μ j ) } − P ( X ≥ x ( μ j ) ∣ Pa ( X ) ) ) . By the construction of finite-sample HAL estimator [26], using the regenerated i.i.d. sample following P of size N , a finite subset of { ϕ j , X ( P ) : j } of size J can be chosen such that the corresponding cross-validated lasso estimator D ˆ X * ( P ) = ∑ j J β ˆ j , X ( P ) ϕ j , X ( P ) satisfies the convergence rate ‖ D ˆ X * ( P ) − D X * ( P ) ‖ P = O P ( N − 1 ⁄ 3 ( log N ) d ⁄ 2 ) .

D HAL-EIC TMLE

Note that there is only one additional term imposed to the expansion (4),

P n { D F ( P n * ) − D ˆ F ( P n * ) } = ( P n − P 0 ) { D F ( P n * ) − D ˆ F ( P n * ) } + ( P 0 − P n * ) { D F ( P n * ) − D ˆ F ( P n * ) } .

To maintain the asymptotic efficiency achieved under the assumptions listed in Section 5.3, (1) the first part converges under the similar Donsker condition, and (2) with the known HAL error rate [25] of ‖ D X * ( P ) − D ˆ X * ( P ) ‖ P = O P ( N − 1 ⁄ 3 ( log N ) d ⁄ 2 ) , where ‖ g ( O ) ‖ P = ⟨ g , g ⟩ P = P g 2 = ∫ g ( o ) 2 p ( o ) d μ is the L 2 ( P ) norm, the second part (assuming the L 2 ( μ ) norm below is applied element-wise for vectors) by the Cauchy Schwarz inequality

‖ ( P 0 − P n * ) { D F ( P n * ) − D ˆ F ( P n * ) } ‖ ≤ ‖ ‖ p 0 − p n * ‖ μ ⋅ ‖ D F ( P n * ) − D ˆ F ( P n * ) ‖ μ ‖ ,

only requires (B5), that is, either ‖ p 0 − p n * ‖ P 0 = o P ( n − 1 ⁄ 6 ) when we select N = n , or requires no additional condition when N ( n ) increases at a faster rate than n 3 ⁄ 2 , under the assumptions that p 0 ( o ) > δ > 0 and p n * ( o ) > δ > 0 over the supports for some δ > 0 .

In practice, we can simulate with N > n to further improve the finite sample performance, and HAL can be included as one of the estimators in the super learner for P n 0 so that ‖ p 0 − p n * ‖ P 0 = O P ( n − 1 ⁄ 3 ( log n ) d ⁄ 2 ) is guaranteed.

E Multiple robustness and exact remainders

First, focus on one of the dimensions Ψ ( P ) = Ψ s a ¯ , a ¯ ′ ( P ) and its corresponding exact remainder as R ( P , P 0 ) = R s a ¯ , a ¯ ′ ( P , P 0 ) . Then by definition of Section 5.3,

R ( P , P 0 ) = Ψ ( P ) − Ψ ( P 0 ) + P 0 D ( P ) = P Q R 1 ( P ) − P 0 Q R 1 ( P 0 ) + ∑ X ∈ R ¯ ∪ Z ¯ ∪ L ¯ P 0 D X ( P ) .

Define the following generalized propensity terms:

π A k = ∏ t = 1 k p A ( A t ∣ Pa ( A t ) ) , π 0 , A k = ∏ t = 1 k p 0 , A ( A t ∣ Pa ( A t ) ) π R k = ∏ t = 1 k p R ( R t ∣ Pa ( R t ) ) , π 0 , R k = ∏ t = 1 k p 0 , R ( R t ∣ Pa ( R t ) ) π Z k = ∏ t = 1 k p Z ( Z t ∣ Pa ( Z t ) ) , π 0 , Z k = ∏ t = 1 k p 0 , Z ( Z t ∣ Pa ( Z t ) ) π L k = ∏ t = 1 k p L ( L t ∣ Pa ( L t ) ) , π 0 , L k = ∏ t = 1 k p 0 , L ( L t ∣ Pa ( L t ) ) π A * k , a ¯ = ∏ t = 1 k I A t = a t π A * k , a ¯ ′ = ∏ t = 1 k I A t = a t ′ π R * k = ∏ t = 1 k p R ( R t ∣ Pa ( R t ∣ a ¯ t ) ) , π 0 , R * k = ∏ t = 1 k p 0 , R ( R t ∣ Pa ( R t ∣ a ¯ t ) ) π Z * k = ∏ t = 1 k p Z ( Z t ∣ Pa ( Z t ∣ a ¯ t ′ ) ) , π 0 , Z * k = ∏ t = 1 k p 0 , Z ( Z t ∣ Pa ( Z t ∣ a ¯ t ′ ) ) π L * k = ∏ t = 1 k p L ( L t ∣ Pa ( L t ∣ a ¯ t ) ) , π 0 , L * k = ∏ t = 1 k p 0 , L ( L t ∣ Pa ( L t ∣ a ¯ t ) ) ,

then

D L t ( P ) = π A * t , a ¯ π Z * t π A t π Z t ( Q R t + 1 − Q L t ) D Z t ( P ) = π A * t , a ¯ ′ π R * t π L * , t − 1 π A t π R t π L t − 1 ( Q L t − Q Z t ) D R t ( P ) = π A * t , a ¯ π Z * , t − 1 π A t π Z t − 1 ( Q Z t − Q R t ) .

Plug in to the exact remainder above (let Q X = Q X ( P ) when the dependence is not specified), and note that

P 0 Q R 1 ( P 0 ) = P 0 G L ( P 0 ) = P 0 π A * K , a ¯ π 0 , Z * K π 0 , A K π 0 , Z K Y P 0 D R 1 ( P ) = P 0 ( Q Z 1 ( P ) − Q R 1 ( P ) ) P 0 Q R 1 ( P 0 ) − P 0 Q R 1 ( P ) = P 0 π A * K , a ¯ π 0 , Z * K π 0 , A K π 0 , Z K ( Y − Q L K ( P ) ) + P 0 π A * K , a ¯ ′ π 0 , R * K π 0 , L * , K − 1 π 0 , A K π 0 , R K π 0 , L K − 1 ( Q L K ( P ) − Q Z K ( P ) ) + … + P 0 ( Q Z 1 ( P ) − Q R 1 ( P ) ) ,

and therefore (still let Q X = Q X ( P ) for clarity)

R ( P , P 0 ) = − P 0 Q R 1 ( P 0 ) + P 0 Q R 1 ( P ) + ∑ t = 1 K P 0 π A * t , a ¯ π Z * t π A t π Z t ( Q R t + 1 − Q L t ) + P 0 π A * t , a ¯ ′ π R * t π L * , t − 1 π A t π R t π L t − 1 ( Q L t − Q Z t ) + P 0 π A * t , a ¯ π Z * , t − 1 π A t π Z t − 1 ( Q Z t − Q R t ) = ∑ t = 1 K P 0 π A * t , a ¯ π Z * t π A t π Z t − π A * t , a ¯ π 0 , Z * t π 0 , A t π 0 , Z t ( Q R t + 1 − Q L t ) + P 0 π A * t , a ¯ ′ π R * t π L * , t − 1 π A t π R t π L t − 1 − π A * t , a ¯ ′ π 0 , R * t π 0 , L * , t − 1 π 0 , A t π 0 , R t π 0 , L t − 1 ( Q L t − Q Z t ) + P 0 π A * t , a ¯ π 0 , Z * , t − 1 π A t π Z t − 1 − π A * t , a ¯ π Z * , t − 1 π 0 , A t π 0 , Z t − 1 ( Q Z t − Q R t ) .

Due to the sequential definition of { Q X : X ∈ R ¯ , Z ¯ , L ¯ } as functions of ( p R , p Z , p L ) , one can check that p R t = p 0 , R t leads to P 0 { Q Z t − Q R t } = 0 , p Z t = p 0 , Z t leads to P 0 { Q L t − Q Z t } = 0 , and p L t = p 0 , L t leads to P 0 { Q R t + 1 − Q L t } = 0 . Under positivity assumptions, this proves the statement that under one of the following three scenarios we have R ( P , P 0 ) = 0 : (1) p A = p 0 , A and p Z = p 0 , Z , (2) p A = p 0 , A and p R = p 0 , R , p L = p 0 , L , or (3) p Z = p 0 , Z and p R = p 0 , R , p L = p 0 , L .

Furthermore, under strong positivity and bounded variation norm assumptions as specified in (B4) and (B5), Cauchy-Schwarz inequality applies such that the aforementioned conditions are relaxed such that only ‖ p X − p 0 , X ‖ P = o P ( n − 1 ⁄ 4 ) is required for (1) X ∈ A ¯ ∪ Z ¯ , (2) X ∈ A ¯ ∪ R ¯ ∪ L ¯ , or (3) X ∈ Z ¯ ∪ R ¯ ∪ L ¯ . This proves the multiple robustness statement in Section 7.

F Numerical results

Tables A5, A6, A7, A8.

Table A5

Misspecification: None, positivity scaling factor λ = 2

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	‒ 0.0000	0.0251	0.0006	0.9394	0.0988
HAL-EIC Initial	‒ 0.0001	0.0251	0.0006	0.9414	0.0978
Exact-EIC TMLE	‒ 0.0001	0.0252	0.0006	0.9455	0.0988
HAL-EIC TMLE	‒ 0.0001	0.0251	0.0006	0.9414	0.0978

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	‒ 0.0001	0.0254	0.0006	0.9697	0.1142
HAL-EIC Initial	‒ 0.0002	0.0254	0.0006	0.9404	0.0988
Exact-EIC TMLE	0.0004	0.0301	0.0009	0.9273	0.1129
HAL-EIC TMLE	‒ 0.0002	0.0254	0.0006	0.9404	0.0988

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0001	0.0254	0.0006	0.9424	0.1010
HAL-EIC Initial	0.0001	0.0254	0.0006	0.9424	0.0990
Exact-EIC TMLE	‒ 0.0000	0.0256	0.0007	0.9414	0.1009
HAL-EIC TMLE	0.0001	0.0254	0.0006	0.9424	0.0990

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	‒ 0.0000	0.0260	0.0007	0.9768	0.1209
HAL-EIC Initial	‒ 0.0000	0.0260	0.0007	0.9485	0.1036
Exact-EIC TMLE	0.0006	0.0330	0.0011	0.9253	0.1185
HAL-EIC TMLE	‒ 0.0000	0.0260	0.0007	0.9485	0.1036

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0032	0.0296	0.0009	0.9677	0.1274
HAL-EIC Initial	0.0032	0.0296	0.0009	0.9515	0.1171
Exact-EIC TMLE	0.0035	0.0321	0.0010	0.9475	0.1270
HAL-EIC TMLE	0.0032	0.0296	0.0009	0.9515	0.1171

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0007	0.0312	0.0010	0.9869	0.1896
HAL-EIC Initial	0.0008	0.0312	0.0010	0.9323	0.1227
Exact-EIC TMLE	0.0004	0.0528	0.0028	0.7869	0.1613
HAL-EIC TMLE	0.0008	0.0312	0.0010	0.9323	0.1227

Table A6

Misspecification: None, positivity scaling factor λ = 3

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0010	0.0239	0.0006	0.9391	0.0948
HAL-EIC Initial	0.0012	0.0240	0.0006	0.9359	0.0930
Exact-EIC TMLE	0.0008	0.0243	0.0006	0.9391	0.0948
HAL-EIC TMLE	0.0012	0.0240	0.0006	0.9359	0.0930

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0012	0.0235	0.0006	0.9695	0.1100
HAL-EIC Initial	0.0013	0.0235	0.0006	0.9475	0.0931
Exact-EIC TMLE	0.0015	0.0289	0.0008	0.9233	0.1077
HAL-EIC TMLE	0.0013	0.0235	0.0006	0.9475	0.0931

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0011	0.0242	0.0006	0.9370	0.0970
HAL-EIC Initial	0.0013	0.0243	0.0006	0.9296	0.0947
Exact-EIC TMLE	0.0007	0.0247	0.0006	0.9370	0.0971
HAL-EIC TMLE	0.0013	0.0243	0.0006	0.9296	0.0947

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0011	0.0239	0.0006	0.9737	0.1183
HAL-EIC Initial	0.0011	0.0240	0.0006	0.9559	0.1049
Exact-EIC TMLE	‒ 0.0008	0.0346	0.0012	0.8897	0.1172
HAL-EIC TMLE	0.0011	0.0240	0.0006	0.9559	0.1049

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0023	0.0343	0.0012	0.9800	0.1650
HAL-EIC Initial	0.0023	0.0344	0.0012	0.9443	0.1334
Exact-EIC TMLE	0.0010	0.0430	0.0019	0.9349	0.1624
HAL-EIC TMLE	0.0023	0.0344	0.0012	0.9443	0.1334

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0004	0.0354	0.0013	0.9706	0.2595
HAL-EIC Initial	0.0004	0.0354	0.0013	0.8845	0.1411
Exact-EIC TMLE	0.0028	0.0802	0.0064	0.6113	0.1732
HAL-EIC TMLE	0.0004	0.0354	0.0013	0.8845	0.1411

Table A7

Misspecification: None, positivity scaling factor λ = 4

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0009	0.0230	0.0005	0.9619	0.0927
HAL-EIC Initial	0.0009	0.0231	0.0005	0.9499	0.0902
Exact-EIC TMLE	0.0008	0.0236	0.0006	0.9459	0.0927
HAL-EIC TMLE	0.0009	0.0231	0.0005	0.9499	0.0902

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0008	0.0227	0.0005	0.9709	0.1094
HAL-EIC Initial	0.0008	0.0228	0.0005	0.9379	0.0894
Exact-EIC TMLE	0.0008	0.0294	0.0009	0.8998	0.1054
HAL-EIC TMLE	0.0008	0.0228	0.0005	0.9379	0.0894

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0009	0.0234	0.0005	0.9619	0.0952
HAL-EIC Initial	0.0009	0.0234	0.0005	0.9519	0.0928
Exact-EIC TMLE	0.0005	0.0249	0.0006	0.9439	0.0963
HAL-EIC TMLE	0.0009	0.0234	0.0005	0.9519	0.0928

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0007	0.0232	0.0005	0.9749	0.1221
HAL-EIC Initial	0.0006	0.0233	0.0005	0.9449	0.1082
Exact-EIC TMLE	‒ 0.0043	0.0392	0.0016	0.8347	0.1473
HAL-EIC TMLE	0.0006	0.0233	0.0005	0.9449	0.1082

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0025	0.0369	0.0014	0.9950	0.2265
HAL-EIC Initial	0.0025	0.0369	0.0014	0.9629	0.1506
Exact-EIC TMLE	0.0028	0.0605	0.0037	0.8707	0.2123
HAL-EIC TMLE	0.0025	0.0369	0.0014	0.9629	0.1506

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0021	0.0393	0.0015	0.9509	0.3478
HAL-EIC Initial	0.0020	0.0393	0.0015	0.8307	0.1562
Exact-EIC TMLE	0.0105	0.1101	0.0122	0.4559	0.2778
HAL-EIC TMLE	0.0020	0.0393	0.0015	0.8307	0.1562

Table A8

Misspecification: None, positivity scaling factor λ = 5

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0013	0.0226	0.0005	0.9604	0.0919
HAL-EIC Initial	0.0013	0.0228	0.0005	0.9451	0.0886
Exact-EIC TMLE	0.0010	0.0234	0.0005	0.9502	0.0918
HAL-EIC TMLE	0.0013	0.0228	0.0005	0.9451	0.0886

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0010	0.0222	0.0005	0.9787	0.1111
HAL-EIC Initial	0.0010	0.0223	0.0005	0.9360	0.0871
Exact-EIC TMLE	0.0010	0.0303	0.0009	0.8770	0.1054
HAL-EIC TMLE	0.0010	0.0223	0.0005	0.9360	0.0871

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0014	0.0229	0.0005	0.9644	0.0945
HAL-EIC Initial	0.0013	0.0231	0.0005	0.9543	0.0925
Exact-EIC TMLE	0.0004	0.0273	0.0007	0.9360	0.0970
HAL-EIC TMLE	0.0013	0.0231	0.0005	0.9543	0.0925

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0009	0.0226	0.0005	0.9817	0.1211
HAL-EIC Initial	0.0009	0.0227	0.0005	0.9502	0.1141
Exact-EIC TMLE	‒ 0.0089	0.0451	0.0021	0.7764	0.1370
HAL-EIC TMLE	0.0009	0.0227	0.0005	0.9502	0.1141

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0016	0.0413	0.0017	0.9919	0.2520
HAL-EIC Initial	0.0015	0.0414	0.0017	0.9512	0.1623
Exact-EIC TMLE	0.0017	0.0897	0.0080	0.7083	0.2234
HAL-EIC TMLE	0.0015	0.0414	0.0017	0.9512	0.1623

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width
Exact-EIC Initial	0.0019	0.0421	0.0018	0.9350	0.3276
HAL-EIC Initial	0.0019	0.0421	0.0018	0.7683	0.1672
Exact-EIC TMLE	0.0077	0.1348	0.0182	0.3780	0.2262
HAL-EIC TMLE	0.0019	0.0421	0.0018	0.7683	0.1672

G Additional simulations

We extend the simulation setup of Section 8.1 to more time points and continuous time-varying covariates ( R c t ). The data-generating process is adjusted accordingly to achieve realistic probabilities of having uncensored and survival events and balanced treatment allocation at later time points. Specifically, the probability of being uncensored ( A t C = 1 ) or treated ( A t E = 1 ) is increased, while the probability of death ( Y = 1 ) is reduced.

L 01 ∼ Bernoulli ( 0.4 ) L 02 ∼ Bernoulli ( 0.6 ) A t C ∼ Bernoulli ( expit ( 2 ⋅ ( 1.5 − 0.4 L 01 − 0.8 L 02 + I { t > 1 } 0.5 A t − 1 E ) ) ) A t E ∼ Bernoulli ( expit ( ( − 0.05 + 0.35 L 01 + 0.6 L 02 − I { t > 1 } 0.05 A t − 1 E ) ) ) R t ∼ Bernoulli ( expit ( − 0.8 + 0.1 L 01 + I { t = 1 } 0.3 L 02 + I { t > 1 } 0.3 R t − 1 + A t E ) ) R c t ∼ N ( 0.2 ⋅ ( I { t = 1 } 0.3 L 02 + I { t > 1 } 0.3 R c t − 1 + A t E ) , 1 ) Z t ∼ Bernoulli ( expit ( − 0.25 + 0.4 L 02 + 0.4 A t E + 0.5 R t + 0.5 R c t ) ) Y t ∼ Bernoulli ( expit ( − 3 + 2 ⋅ ( 0.05 + 0.375 L 02 + 0.25 R t + 0.25 R c t − 0.075 A t E − 0.075 Z t − I { t > 1 } 0.025 R t − 1 ) ) ) .

In total, we generate data up to t = 5 , and we target fifteen parameters for outcome variables Y ¯ = ( Y 1 , … , Y 5 ) and interventions ( 1 ¯ , Γ 1 ¯ ) , ( 1 ¯ , Γ 0 ¯ ) , ( 1 ¯ , Γ 1 ¯ ) . In this simulation, we also assume that the outcome at time point t only depends on the most recent k visits ( k = 2 ). This constructs the summary covariates (Section 6.2) as the observations within the recent visits. The results validate similar multiple robustness properties as Section 8.1 (Tables A9, A10, A11, A12).

Remark

The final plug-in estimation of the target parameters Ψ ( P ) through g-computation still involves integral terms as (2) even though the computation of other similar integrals (the Q functionals discussed in Section 3.4) have been avoided in the updating steps. This final integral can be conducted fully numerically by Monte Carlo integration, and on the finite sample, one can choose arbitrarily large Monte Carlo sample sizes until the plug-in estimation stabilizes. In simulations, to control computational resources, we limit the Monte Carlo sample size to 8,000 and use bootstrap SE to construct Wald-type confidence intervals, which are then compared to the oracle confidence intervals constructed by the SD of the repeatedly generated estimations. This validates the asymptotic normality for multivariate parameter estimation. Nonetheless, in practice, a larger Monte Carlo sample size can be chosen to avoid the need of bootstrapping to control the independent approximation error in g-computation.

Table A9

Total time points: 5. Misspecification: None

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0019	0.0177	0.0003	0.9300	0.0646	0.9300	0.0695
HAL-EIC TMLE	0.0018	0.0177	0.0003	0.9300	0.0645	0.9300	0.0693

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0005	0.0270	0.0007	0.9700	0.1081	0.9500	0.1057
HAL-EIC TMLE	0.0004	0.0269	0.0007	0.9700	0.1077	0.9500	0.1053

E [ Y 3 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0022	0.0350	0.0012	0.9700	0.1541	0.9400	0.1370
HAL-EIC TMLE	0.0019	0.0347	0.0012	0.9800	0.1530	0.9400	0.1359

E [ Y 4 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0036	0.0499	0.0025	0.9700	0.2261	0.9500	0.1956
HAL-EIC TMLE	0.0033	0.0493	0.0024	0.9700	0.2237	0.9500	0.1934

E [ Y 5 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0064	0.0608	0.0037	0.9800	0.3417	0.9500	0.2383
HAL-EIC TMLE	0.0059	0.0600	0.0036	0.9800	0.3368	0.9500	0.2352

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0015	0.0175	0.0003	0.9400	0.0640	0.9500	0.0686
HAL-EIC TMLE	0.0015	0.0175	0.0003	0.9400	0.0639	0.9500	0.0685

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0008	0.0263	0.0007	0.9500	0.1046	0.9600	0.1030
HAL-EIC TMLE	0.0007	0.0262	0.0007	0.9500	0.1042	0.9600	0.1027

E [ Y 3 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0008	0.0309	0.0009	0.9700	0.1457	0.9500	0.1211
HAL-EIC TMLE	0.0006	0.0306	0.0009	0.9700	0.1447	0.9500	0.1201

E [ Y 4 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0021	0.0445	0.0020	0.9600	0.2094	0.9700	0.1746
HAL-EIC TMLE	0.0017	0.0441	0.0019	0.9600	0.2072	0.9700	0.1727

E [ Y 5 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0049	0.0557	0.0031	0.9900	0.3074	0.9600	0.2183
HAL-EIC TMLE	0.0044	0.0550	0.0030	0.9900	0.3030	0.9600	0.2156

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0009	0.0184	0.0003	0.9400	0.0777	0.9300	0.0722
HAL-EIC TMLE	‒ 0.0010	0.0184	0.0003	0.9300	0.0775	0.9300	0.0722

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0055	0.0344	0.0012	0.9300	0.1321	0.9400	0.1349
HAL-EIC TMLE	‒ 0.0057	0.0343	0.0012	0.9300	0.1315	0.9400	0.1345

E [ Y 3 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0059	0.0476	0.0023	0.9000	0.1802	0.9300	0.1864
HAL-EIC TMLE	‒ 0.0062	0.0473	0.0022	0.9000	0.1787	0.9200	0.1853

E [ Y 4 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0053	0.0597	0.0036	0.9500	0.2539	0.9300	0.2338
HAL-EIC TMLE	‒ 0.0058	0.0590	0.0035	0.9400	0.2503	0.9300	0.2312

E [ Y 5 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0088	0.0777	0.0060	0.9800	0.3764	0.9500	0.3044
HAL-EIC TMLE	‒ 0.0093	0.0767	0.0059	0.9800	0.3686	0.9500	0.3006

Table A10

Total time points: 5. Misspecification: A

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0001	0.0163	0.0003	0.9100	0.0661	0.9100	0.0638
HAL-EIC TMLE	0.0000	0.0162	0.0003	0.9100	0.0659	0.9200	0.0637

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0039	0.0236	0.0006	0.9900	0.1087	0.9600	0.0926
HAL-EIC TMLE	0.0038	0.0235	0.0006	0.9900	0.1083	0.9600	0.0923

E [ Y 3 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0000	0.0318	0.0010	0.9800	0.1512	0.9400	0.1245
HAL-EIC TMLE	‒ 0.0003	0.0315	0.0010	0.9800	0.1502	0.9400	0.1237

E [ Y 4 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0019	0.0496	0.0024	0.9700	0.2263	0.9700	0.1942
HAL-EIC TMLE	‒ 0.0022	0.0490	0.0024	0.9700	0.2239	0.9700	0.1922

E [ Y 5 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0040	0.0749	0.0056	0.9800	0.3593	0.9600	0.2936
HAL-EIC TMLE	‒ 0.0044	0.0738	0.0054	0.9800	0.3542	0.9600	0.2893

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0004	0.0160	0.0003	0.9400	0.0648	0.9400	0.0626
HAL-EIC TMLE	0.0003	0.0159	0.0003	0.9400	0.0646	0.9400	0.0624

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0042	0.0231	0.0005	0.9800	0.1058	0.9600	0.0906
HAL-EIC TMLE	0.0040	0.0230	0.0005	0.9800	0.1054	0.9600	0.0903

E [ Y 3 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0002	0.0310	0.0009	0.9700	0.1414	0.9500	0.1215
HAL-EIC TMLE	‒ 0.0004	0.0308	0.0009	0.9700	0.1404	0.9500	0.1207

E [ Y 4 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0018	0.0475	0.0022	0.9600	0.2072	0.9500	0.1861
HAL-EIC TMLE	‒ 0.0021	0.0470	0.0022	0.9500	0.2051	0.9500	0.1843

E [ Y 5 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0031	0.0662	0.0043	0.9800	0.3199	0.9300	0.2594
HAL-EIC TMLE	‒ 0.0034	0.0652	0.0042	0.9800	0.3153	0.9300	0.2558

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0008	0.0196	0.0004	0.9700	0.0771	0.9600	0.0769
HAL-EIC TMLE	‒ 0.0009	0.0196	0.0004	0.9700	0.0769	0.9600	0.0768

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0012	0.0348	0.0012	0.9100	0.1291	0.9600	0.1366
HAL-EIC TMLE	‒ 0.0015	0.0347	0.0012	0.9100	0.1287	0.9500	0.1361

E [ Y 3 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0033	0.0529	0.0028	0.8600	0.1804	0.9400	0.2072
HAL-EIC TMLE	‒ 0.0036	0.0526	0.0027	0.8600	0.1791	0.9400	0.2061

E [ Y 4 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0053	0.0637	0.0040	0.9400	0.2490	0.9400	0.2496
HAL-EIC TMLE	‒ 0.0057	0.0630	0.0040	0.9400	0.2457	0.9400	0.2470

E [ Y 5 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0174	0.0774	0.0062	0.9500	0.3708	0.9300	0.3034
HAL-EIC TMLE	‒ 0.0176	0.0761	0.0060	0.9500	0.3635	0.9300	0.2985

Table A11

Total time points: 5. Misspecification: Z

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0021	0.0172	0.0003	0.9400	0.0625	0.9300	0.0674
HAL-EIC TMLE	‒ 0.0016	0.0172	0.0003	0.9600	0.0630	0.9200	0.0674

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0046	0.0253	0.0007	0.9700	0.1023	0.9600	0.0993
HAL-EIC TMLE	‒ 0.0042	0.0256	0.0007	0.9800	0.1042	0.9600	0.1004

E [ Y 3 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0015	0.0330	0.0011	0.9800	0.1413	0.9600	0.1295
HAL-EIC TMLE	0.0009	0.0337	0.0011	0.9700	0.1467	0.9400	0.1319

E [ Y 4 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0036	0.0436	0.0019	0.9800	0.1958	0.9500	0.1711
HAL-EIC TMLE	‒ 0.0028	0.0429	0.0018	0.9800	0.2046	0.9500	0.1681

E [ Y 5 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0056	0.0613	0.0038	0.9600	0.2902	0.9700	0.2405
HAL-EIC TMLE	‒ 0.0078	0.0639	0.0041	0.9800	0.3129	0.9500	0.2504

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0011	0.0172	0.0003	0.9400	0.0625	0.9300	0.0674
HAL-EIC TMLE	‒ 0.0010	0.0171	0.0003	0.9200	0.0622	0.9400	0.0672

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0027	0.0253	0.0006	0.9700	0.1023	0.9700	0.0993
HAL-EIC TMLE	‒ 0.0025	0.0243	0.0006	0.9700	0.1021	0.9500	0.0954

E [ Y 3 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0011	0.0330	0.0011	0.9800	0.1413	0.9600	0.1295
HAL-EIC TMLE	0.0006	0.0310	0.0009	0.9700	0.1413	0.9500	0.1214

E [ Y 4 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0005	0.0436	0.0019	0.9800	0.1958	0.9500	0.1711
HAL-EIC TMLE	‒ 0.0015	0.0430	0.0018	0.9700	0.1939	0.9700	0.1687

E [ Y 5 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0021	0.0613	0.0037	0.9700	0.2902	0.9700	0.2405
HAL-EIC TMLE	‒ 0.0043	0.0586	0.0034	0.9900	0.2911	0.9600	0.2297

E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0018	0.0200	0.0004	0.9600	0.0786	0.9700	0.0783
HAL-EIC TMLE	‒ 0.0020	0.0196	0.0004	0.9700	0.0777	0.9600	0.0768

E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0066	0.0345	0.0012	0.9000	0.1314	0.9500	0.1354
HAL-EIC TMLE	‒ 0.0064	0.0342	0.0012	0.9100	0.1300	0.9200	0.1340

E [ Y 3 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0019	0.0479	0.0023	0.9000	0.1833	0.9500	0.1877
HAL-EIC TMLE	‒ 0.0027	0.0466	0.0022	0.9000	0.1778	0.9400	0.1827

E [ Y 4 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0079	0.0598	0.0036	0.9800	0.2524	0.9700	0.2345
HAL-EIC TMLE	‒ 0.0087	0.0560	0.0032	0.9600	0.2412	0.9500	0.2194

E [ Y 5 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	‒ 0.0117	0.0896	0.0081	0.9200	0.3619	0.9400	0.3514
HAL-EIC TMLE	‒ 0.0162	0.0818	0.0069	0.9600	0.3393	0.9500	0.3207

Table A12

Total time points: 5. Misspecification: Y

E [ Y 1 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0523	0.0161	0.0030	0.1099	0.0584	0.1099	0.0633
HAL-EIC TMLE	0.0014	0.0178	0.0003	0.9560	0.0633	0.9341	0.0699

E [ Y 2 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0961	0.0293	0.0101	0.0659	0.1051	0.0879	0.1149
HAL-EIC TMLE	0.0010	0.0288	0.0008	0.9451	0.1112	0.9560	0.1128

E [ Y 3 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.1417	0.0439	0.0220	0.1209	0.1651	0.1209	0.1722
HAL-EIC TMLE	0.0038	0.0365	0.0013	0.9670	0.1582	0.9451	0.1432

E [ Y 4 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.1816	0.0624	0.0368	0.2088	0.2731	0.1758	0.2446
HAL-EIC TMLE	0.0051	0.0511	0.0026	0.9670	0.2275	0.9341	0.2003

E [ Y 5 ( 1 ¯ , Γ 1 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.2221	0.0834	0.0562	0.4945	0.4477	0.2747	0.3270
HAL-EIC TMLE	0.0100	0.0634	0.0041	0.9780	0.3372	0.9451	0.2486

E [ Y 1 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0527	0.0156	0.0030	0.0989	0.0567	0.0879	0.0613
HAL-EIC TMLE	0.0015	0.0175	0.0003	0.9341	0.0628	0.9451	0.0686

E [ Y 2 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0977	0.0271	0.0103	0.0330	0.0997	0.0220	0.1064
HAL-EIC TMLE	0.0022	0.0277	0.0008	0.9341	0.1080	0.9560	0.1085

E [ Y 3 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.1431	0.0366	0.0218	0.0440	0.1506	0.0220	0.1436
HAL-EIC TMLE	0.0044	0.0333	0.0011	0.9560	0.1509	0.9451	0.1305

E [ Y 4 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.1835	0.0527	0.0364	0.1319	0.2486	0.0879	0.2064
HAL-EIC TMLE	0.0066	0.0460	0.0021	0.9560	0.2165	0.9560	0.1804

E [ Y 5 ( 1 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.2226	0.0727	0.0548	0.3956	0.3931	0.1648	0.2850
HAL-EIC TMLE	0.0114	0.0586	0.0035	0.9780	0.3058	0.9560	0.2298
E [ Y 1 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.0553	0.0155	0.0033	0.0769	0.0574	0.0549	0.0606
HAL-EIC TMLE	‒ 0.0009	0.0176	0.0003	0.9451	0.0764	0.9341	0.0689
E [ Y 2 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.1040	0.0287	0.0116	0.0659	0.1058	0.0440	0.1125
HAL-EIC TMLE	‒ 0.0021	0.0330	0.0011	0.9560	0.1347	0.9451	0.1295
E [ Y 3 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.1514	0.0429	0.0248	0.0659	0.1610	0.0769	0.1684
HAL-EIC TMLE	‒ 0.0001	0.0458	0.0021	0.9451	0.1793	0.9231	0.1796
E [ Y 4 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.1961	0.0616	0.0422	0.1758	0.2736	0.0989	0.2416
HAL-EIC TMLE	0.0077	0.0559	0.0031	0.9780	0.2617	0.9341	0.2191
E [ Y 5 ( 0 ¯ , Γ 0 ¯ ) ]	Bias	SD	MSE	Coverage	Width	Oracle coverage	Oracle width
HAL-EIC Initial	0.2358	0.0912	0.0638	0.4615	0.4562	0.2747	0.3575
HAL-EIC TMLE	0.0249	0.0783	0.0067	0.9780	0.3794	0.9231	0.3071

H List of notations

Data and model

X ¯	the available history of X such as ( X t : t = 0 , … K )
X ¯ t	the available history of X till t such as ( X s : s = 0 , … t )
X ¯ s t	the available history of X from s to t (assuming s ≤ t )
x	a realization value in the range of a random variable X
Pa ( X )	the parent nodes prior to X given a variable ordering
Ch ( X )	the child nodes after X given a variable ordering
Pa ( X ∣ a ¯ )	the vector of parent nodes of X but intervening Pa ( X ) ∩ A ¯ following A ¯ = a ¯ ; for example, Pa ( R 1 ∣ a ¯ ) = ( L 0 , a 1 ) in Section 2
ℳ	statistical model, a collection of data distributions
P	data distribution
P n	empirical distribution of an i.i.d. sample O 1 , … , O n
P f	expectation P f = E P ( f ) = ∫ f ( o ) d P ( o ) of f ( O ) under P ; for example, P n f = 1 n ∑ i = 1 n f ( O i )
p	density p = d P ⁄ d μ of P with respect to some measure μ
p X ( X ∣ Pa ( X ) )	conditional density of variable X given Pa ( X )
P 0	true data distribution
X ( a ¯ )	the counterfactual of X under intervention A ¯ = a ¯ given a structural causal model
Γ t a ¯ ′	conditional density of Z t ( a ¯ ′ ) given R ¯ t ( a ¯ ′ ) , Z ¯ t − 1 ( a ¯ ′ ) , L ¯ t − 1 ( a ¯ ′ )
X ( a ¯ , Γ ¯ a ¯ ′ )	the counterfactual of X under intervention A ¯ = a ¯ and Z t ∼ Γ t a ¯ ′ ( z t ∣ R ¯ t ( a ¯ , Γ a ¯ ′ ) , Z ¯ t − 1 ( a ¯ , Γ a ¯ ′ ) , L ¯ t − 1 ( a ¯ , Γ a ¯ ′ ) ) for t = 1 , … , K ;

Statistical estimation

Ψ	a general k dimensional target parameter ℳ → R k
Ψ s a ¯ , a ¯ ′ ( P )	E [ Y s ( a ¯ , Γ ¯ a ¯ ′ ) ] where Y s = ψ ( L ¯ K ) is the s th outcome; ψ is a function of L ¯ K
P a ¯ , a ¯ ′	the counterfactual of distribution P under intervention ( a ¯ , Γ ¯ a ¯ ′ )
Q s , X a ¯ , a ¯ ′ ( Pa ( X ) \ A ¯ )	E P a ¯ , a ¯ ′ [ Y s ∣ Pa ( X ) \ A ¯ ] , the conditional mean (following P a ¯ , a ¯ ′ ) of Y s given the parent nodes of X excluding A ¯
D s a ¯ , a ¯ ′ ( P )	the canonical gradient of Ψ s a ¯ , a ¯ ′ ( P )
D s , X a ¯ , a ¯ ′ ( P )	∏ ( D s a ¯ , a ¯ ′ ( P ) ∣ T X ( P ) ) the gradient component of X defined as the projection onto the tangent subspace T X ( P ) = { f ∈ L 2 ( P ) : E [ f ∣ Pa ( X ) ] = 0 }
p ˜ X ( p X , ε ¯ X )	a (multivariate) locally least favorable path through p X
P n 0	an initial distribution estimator for P 0
P ˜ n	a TMLE update of P n 0
P n *	the final TMLE update

References

[1] Vansteelandt S, Linder M, Vandenberghe S, Steen J, Madsen J. Mediation analysis of time-to-event endpoints accounting for repeatedly measured mediators subject to time-varying confounding. Stat Med. 2019;38(24):4828–40. 10.1002/sim.8336Suche in Google Scholar PubMed PubMed Central

[2] Buse JB, Bain SC, Mann JF, Nauck MA, Nissen SE, Pocock S, et al. Cardiovascular risk reduction with liraglutide: an exploratory mediation analysis of the LEADER trial. Diabetes Care. 2020;43(7):1546–52. 10.2337/dc19-2251Suche in Google Scholar PubMed PubMed Central

[3] Lai ET, Schlüter DK, Lange T, Straatmann V, Andersen AMN, Strandberg-Larsen K, et al. Understanding pathways to inequalities in child mental health: a counterfactual mediation analysis in two national birth cohorts in the UK and Denmark. BMJ Open. 2020;10(10):e040056. 10.1136/bmjopen-2020-040056Suche in Google Scholar PubMed PubMed Central

[4] Avin C, Shpitser I, Pearl J. Identifiability of path-specific effects. In Proceedings of International Joint Conference on Artificial Intelligence. Edinburgh, Schotland. 2005. Suche in Google Scholar

[5] VanderWeele TJ, Vansteelandt S. Conceptual issues concerning mediation, interventions and composition. Stat Interface. 2009;2(4):457–68. 10.4310/SII.2009.v2.n4.a7Suche in Google Scholar

[6] Tchetgen EJT, Shpitser I. Semiparametric theory for causal mediation analysis: efficiency bounds, multiple robustness, and sensitivity analysis. Ann Stat. 2012;40(3):1816. 10.1214/12-AOS990Suche in Google Scholar PubMed PubMed Central

[7] Zheng W, van der Laan MJ. Causal mediation in a survival setting with time-dependent mediators. U.C. Berkeley Division of Biostatistics Working Paper Series. 2012. Suche in Google Scholar

[8] Zheng W, van der Laan M. Longitudinal mediation analysis with time-varying mediators and exposures, with application to survival outcomes. J Causal Inference. 2017;5(2):20160006. 10.1515/jci-2016-0006Suche in Google Scholar PubMed PubMed Central

[9] VanderWeele TJ, Tchetgen EJT. Mediation analysis with time varying exposures and mediators. J R Stat Soc Ser B Stat Meth. 2017;79(3):917. 10.1111/rssb.12194Suche in Google Scholar PubMed PubMed Central

[10] Didelez V, Dawid A, Geneletti S. Direct and indirect effects of sequential treatments. In: 23rd Annual Conference on Uncertainty in Artifical Intelligence; 2006. Suche in Google Scholar

[11] Petersen ML, Sinisi SE, van der Laan MJ. Estimation of direct causal effects. Epidemiology. 2006;17(3):276–84. 10.1097/01.ede.0000208475.99429.2dSuche in Google Scholar PubMed

[12] van der Laan MJ, Petersen ML. Direct effect models. Int J Biostat. 2008;4(1). 10.2202/1557-4679.1064.Suche in Google Scholar PubMed

[13] Pearl J. Causal inference in statistics: An overview. Statistics Surveys. 2009;3:96–146. 10.1214/09-SS057Suche in Google Scholar

[14] VanderWeele TJ. A unification of mediation and interaction: a four-way decomposition. Epidemiology (Cambridge, Mass). 2014;25(5):749. 10.1097/EDE.0000000000000121Suche in Google Scholar PubMed PubMed Central

[15] VanderWeele TJ, Vansteelandt S, Robins JM. Effect decomposition in the presence of an exposure-induced mediator-outcome confounder. Epidemiology (Cambridge, Mass). 2014;25(2):300. 10.1097/EDE.0000000000000034Suche in Google Scholar PubMed PubMed Central

[16] Díaz I, Williams N, Rudolph KE. Efficient and flexible mediation analysis with time-varying mediators, treatments, and confounders. J Causal Inference. 2023;11(1):20220077. https://doi.org/10.1515/jci-2022-0077. Suche in Google Scholar

[17] Miles CH. On the causal interpretation of randomised interventional indirect effects. J R Stat Soc Ser B Stat Meth. 2023 June;85(4):1154–1172. https://doi.org/10.1093/jrsssb/qkad066. Suche in Google Scholar

[18] Robins JM, Richardson TS. Alternative graphical causal models and the identification of direct effects. In Causality and psychopathology: Finding the determinants of disorders and their cures. 2010. p. 103–58. 10.1093/oso/9780199754649.003.0011Suche in Google Scholar

[19] van der Laan MJ. Targeted maximum likelihood based causal inference: Part I. Int J Biostat. 2010;6(2). 10.2202/1557-4679.1211.Suche in Google Scholar PubMed PubMed Central

[20] van der Laan M. Targeted maximum likelihood based causal inference: Part II. Int J Biostat. 2010;6(2). 10.2202/1557-4679.1241.Suche in Google Scholar PubMed PubMed Central

[21] Robins J, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Stat Sci. 2007;22(4):544–59. 10.1214/07-STS227DSuche in Google Scholar

[22] Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–73. 10.1111/j.1541-0420.2005.00377.xSuche in Google Scholar PubMed

[23] van der Laan MJ, Gruber S. Collaborative double robust targeted maximum likelihood estimation. Int J Biostat. 2010;6(1). 10.2202/1557-4679.1181.Suche in Google Scholar PubMed PubMed Central

[24] Benkeser D, van der Laan M. The highly adaptive lasso estimator. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE; 2016. p. 689–96. 10.1109/DSAA.2016.93Suche in Google Scholar PubMed PubMed Central

[25] Bibaut AF, van der Laan MJ. Fast rates for empirical risk minimization over c/adl/ag functions with bounded sectional variation norm. 2019. arXiv: https://arxiv.org/abs/1907.09244. Suche in Google Scholar

[26] van der Laan MJ, Rose S, van der Laan MJ, Benkeser D. Highly adaptive lasso (hal). In Targeted learning in data science: Causal inference for complex longitudinal studies. 2018. p. 77–94. 10.1007/978-3-319-65304-4_6Suche in Google Scholar

[27] van der Laan MJ, Rose S, van der Laan MJ. HAL estimator of the efficient influence curve. In Targeted learning in data science: Causal inference for complex longitudinal studies. 2018. p. 103–23. Suche in Google Scholar

[28] van der Laan M, Gruber S. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. Int J Biostat. 2016;12(1):351–78. 10.1515/ijb-2015-0054Suche in Google Scholar PubMed PubMed Central

[29] Dudoit S, van der Laan MJ. Multiple testing procedures with applications to genomics. Springer; 2008. 10.1007/978-0-387-49317-6Suche in Google Scholar

[30] Hejazi NS, Coyle JR, van der Laan MJ. hal9001: Scalable highly adaptive lasso regression in R. J Open Source Software. 2020;5(53):2526. 10.21105/joss.02526Suche in Google Scholar

[31] Tchetgen EJT. A commentary on G. Molenberghs’s review of missing data methods. Drug Inform J. 2009;43(4):433–5. 10.1177/009286150904300406Suche in Google Scholar

[32] Molina J, Rotnitzky A, Sued M, Robins J. Multiple robustness in factorized likelihood models. Biometrika. 2017;104(3):561–81. 10.1093/biomet/asx027Suche in Google Scholar PubMed PubMed Central

[33] Luedtke AR, Sofrygin O, van der Laan MJ, Carone M. Sequential double robustness in right-censored longitudinal models. 2017. arXiv: https://arXiv.org/abs/1705.02459. Suche in Google Scholar

[34] van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer; 2011. 10.1007/978-1-4419-9782-1Suche in Google Scholar

[35] Andrews RM, Didelez V. Insights into the cross-world independence assumption of causal mediation analysis. Epidemiology. 2020;32(2):209–19. 10.1097/EDE.0000000000001313Suche in Google Scholar PubMed

[36] Andersen PK, Borgan O, Gill RD, Keiding N. Statistical models based on counting processes. Springer Series in Statistics. New York: Springer; 1993. 10.1007/978-1-4612-4348-9Suche in Google Scholar

[37] Benkeser D, Carone M, Gilbert PB. Improved estimation of the cumulative incidence of rare outcomes. Stat Med. 2018;37(2):280–93. 10.1002/sim.7337Suche in Google Scholar PubMed PubMed Central

[38] Rytgaard HC, van der Laan MJ. Targeted maximum likelihood estimation for causal inference in survival and competing risks analysis. Lifetime Data Analysis. 2022;30:4–33.10.1007/s10985-022-09576-2Suche in Google Scholar PubMed

[39] Cole SR, Frangakis CE. The consistency statement in causal inference: a definition or an assumption? Epidemiology. 2009;20(1):3–5. 10.1097/EDE.0b013e31818ef366Suche in Google Scholar PubMed

[40] Díaz I, Williams N, Hoffman KL, Schenck EJ. Nonparametric causal effects based on longitudinal modified treatment policies. J Amer Stat Assoc. 2023;118(542):846–57. 10.1080/01621459.2021.1955691Suche in Google Scholar

[41] van der Laan MJ, Rose S, van der Laan MJ, Chambaz A, Lendle S. Online targeted learning for time series. In Targeted learning in data science: Causal inference for complex longitudinal studies. Cham: Springer; 2018. p. 317–46. 10.1007/978-3-319-65304-4_19Suche in Google Scholar

[42] Cheng KF, Chu CK. Semiparametric density estimation under a two-sample density ratio model. Bernoulli. 2004;10(4):583–604. 10.3150/bj/1093265631Suche in Google Scholar

[43] Qin J. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika. 1998;85(3):619–30. 10.1093/biomet/85.3.619Suche in Google Scholar

[44] Díaz Muñoz I, van der Laan MJ. Super learner based conditional density estimation with application to marginal structural models. Int J Biostat. 2011;7(1):0000102202155746791356. 10.2202/1557-4679.1356Suche in Google Scholar PubMed

[45] van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genetics Mol Biol. 2007;6(1). 10.2202/1544-6115.1309.Suche in Google Scholar PubMed

[46] Van Der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. U.C. Berkeley Division of Biostatistics Working Paper Series. 2003. Suche in Google Scholar

[47] Hejazi NS, Benkeser DC, van der Laan MJ. Haldensify: Highly adaptive lasso conditional density estimation; 2022. J Open Source Softw. 2022;7(77):4522. 10.21105/joss.04522.Suche in Google Scholar

[48] van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1). 10.2202/1557-4679.1043.Suche in Google Scholar

[49] van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Stat Appl Genetics Mol Biol. 2004;3(1). 10.2202/1544-6115.1036.Suche in Google Scholar

[50] Rose S, van der Laan MJ. LTMLE. In: Targeted Learning in Data Science.Springer; 2018. p. 35–47. 10.1007/978-3-319-65304-4_4Suche in Google Scholar

[51] Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. 10.1093/biomet/70.1.41Suche in Google Scholar

[52] D’Agostino Jr RB. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat Med. 1998;17(19):2265–81. 10.1002/(SICI)1097-0258(19981015)17:19<2265::AID-SIM918>3.0.CO;2-BSuche in Google Scholar

[53] Zheng W, van der Laan MJ. Cross-validated targeted minimum-loss-based estimation. In: Targeted Learning. Springer; 2011. p. 459–74. 10.1007/978-1-4419-9782-1_27Suche in Google Scholar

[54] Hubbard AE, Kennedy CJ, van der Laan MJ. Data-adaptive target parameters. In: Targeted Learning in Data Science.Springer; 2018. p. 125–42. 10.1007/978-3-319-65304-4_9Suche in Google Scholar

[55] Diiaz I, van der Laan MJ. Doubly robust inference for targeted minimum loss-based estimation in randomized trials with missing outcome data. Stat Med. 2017;36(24):3807–19. 10.1002/sim.7389Suche in Google Scholar

[56] Petersen ML, Porter KE, Gruber S, Wang Y, van der Laan MJ. Positivity. In: Targeted Learning.Springer; 2011. p. 161–84. 10.1007/978-1-4419-9782-1_10Suche in Google Scholar

[57] Ju C, Schwab J, van der Laan MJ. On adaptive propensity score truncation in causal inference. Stat Meth Med Res. 2019;28(6):1741–60. 10.1177/0962280218774817Suche in Google Scholar

[58] Rytgaard HC, Gerds TA, van der Laan MJ. Continuous-time targeted minimum loss-based estimation of intervention-specific mean outcomes.2021. arXiv: https://arXiv.org/abs/2105.02088. 10.1214/21-AOS2114Suche in Google Scholar

[59] van der Laan M, Wang Z, van der Laan L. Higher Order Targeted Maximum Likelihood Estimation. 2021. arXiv: https://arXiv.org/abs/2101.06290. Suche in Google Scholar

[60] van der Laan MJ. HAL estimator of the efficient influence curve. In: Targeted Learning in Data Science. Cham: Springer; 2018. p. 103–23. 10.1007/978-3-319-65304-4_8Suche in Google Scholar

Received: 2023-03-13

Revised: 2024-04-30

Accepted: 2024-07-13

Published Online: 2025-01-31

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

https://doi.org/10.1515/jci-2023-0013

Schlagwörter für diesen Artikel

longitudinal mediation analysis; stochastic intervention; random intervention; targeted maximum likelihood estimation; efficient influence curve; efficient estimator; highly adaptive lasso

Creative Commons

BY 4.0