Home Efficient and flexible mediation analysis with time-varying mediators, treatments, and confounders
Article Open Access

Efficient and flexible mediation analysis with time-varying mediators, treatments, and confounders

  • Iván Díaz EMAIL logo , Nicholas Williams and Kara E. Rudolph
Published/Copyright: July 5, 2023
Become an author with De Gruyter Brill

Abstract

Understanding the mechanisms of action of interventions is a major general goal of scientific inquiry. The collection of statistical methods that use data to achieve this goal is referred to as mediation analysis. Natural direct and indirect effects provide a definition of mediation that matches scientific intuition, but they are not identified in the presence of time-varying confounding. Interventional effects have been proposed as a solution to this problem, but existing estimation methods are limited to assuming simple (e.g., linear) and unrealistic relations between the mediators, treatments, and confounders. We present an identification result for interventional effects in a general longitudinal data structure that allows flexibility in the specification of treatment-outcome, treatment-mediator, and mediator-outcome relationships. Identification is achieved under the standard no-unmeasured-confounders and positivity assumptions. In this article, we study semi-parametric efficiency theory for the functional identifying the mediation parameter, including the non-parametric efficiency bound, and was used to propose non-parametrically efficient estimators. Implementation of our estimators only relies on the availability of regression algorithms, and the estimators in a general framework that allows the analyst to use arbitrary regression machinery were developed. The estimators are doubly robust, n -consistent, asymptotically Gaussian, under slow convergence rates for the regression algorithms used. This allows the use of flexible machine learning for regression while permitting uncertainty quantification through confidence intervals and p -values. A free and open-source R package implementing the methods is available on GitHub. The proposed estimator to a motivating example from a trial of two medications for opioid-use disorder was applied, where we estimate the extent to which differences between the two treatments on risk of opioid use are mediated by craving symptoms.

MSC 2010: 62D20

1 Introduction

Mediation analyses are paramount to scientific inquiry, as they provide a way of understanding the mechanisms through which effects operate [1]. For example, recent mediation analyses have helped to uncover the types of immune response that coronavirus disease 2019 vaccines trigger in order to prevent disease [2].

Multiple methods have been developed for mediation analysis in the setting of a mediator measured at a single time point, using a counterfactual or potential outcome framework. For example, the natural direct effect is defined as the effect of exposure/treatment that would have been seen in a hypothetical world where the effect operating through the mediator is disabled. Conditions for identifiability for the natural direct effect and its indirect counterpart have been given in [3] and [4].

Though the definition of the natural (in)direct effects is scientifically interesting, their identification requires assumptions about the real world, which are guaranteed to not hold in many practical situations, such as with time-varying confounding (cf. recanting witness, [5]). This restriction limits the applicability of natural direct and indirect effects in practice, as intermediate confounders are expected to be present in many applications, for example, when the effect of an intervention operates through adherence [6]. Several methods have been proposed to do away with the cross-world independence assumption and/or the assumption of no intermediate confounders, for example, partial identification methods [79], so-called separable effects [7,10], and effects defined in terms of stochastic interventions on the exposure [11,12]. In this article, we tackle this problem focusing on mediation effects defined in terms of contrasts between counterfactuals in hypothetical worlds in which the treatment is set to some value deterministically, whereas the mediator is drawn stochastically from its counterfactual distribution under interventions on treatment [1315]. Efficient non-parametric estimators that leverage machine learning to alleviate model misspecification bias have been recently proposed for these parameters [16,17], but they are limited to single time-point exposures. These effects have been called interventional effects [18].

Mediation analysis methodologies for time-varying exposures and mediators are limited. Non-parametric identification formulas for interventional effects in the case of mediators and treatments measured longitudinally, assuming that the time-dependent covariates are measured either before or after the mediator, but not both, are given in refs [18,19]. The authors propose estimation methods that rely on the unlikely ability to correctly specify parametric models for the distribution of the unobservable counterfactual outcomes. Similar interventional effects where the mediator is drawn from its counterfactual distribution conditional on all the past are proposed in ref. [20], together with non-parametric efficient estimators that rely on data-adaptive regression to alleviate model misspecification bias. However, the “direct effect” defined in ref. [20] does not capture the pathway from treatment through intermediate confounder to outcome and is thus not a direct effect in the sense that we are interested in this article. Similar methods for survival analysis and treatment at a single time point have also been proposed [21]. Marginal structural models for longitudinally measured mediators under treatment at a single time point and no loss-to-follow-up are proposed in ref. [22]. More recent work includes the development of a 4-way decomposition of immediate and delayed effects using an infinite time-horizon in a reinforcement learning framework [23], and other methods considering time-varying mediators but a single exposure at baseline [21,2427].

In this article, we develop a general longitudinal causal mediation approach that fills several gaps in the aforementioned literature. The method we propose satisfies the following: (i) the direct effect is defined in terms of the effects operating through all the pathways that do not include the mediator at any time point, (ii) the methods allow for longitudinally measured mediators and treatments, with confounders possibly measured before and after treatment, (iii) the methods allow the use of data-adaptive regression to alleviate concerns of model misspecification bias, and (iv) the methods are endowed with formulas to construct efficient estimators and computation of valid standard errors and confidence intervals, even under the use of data-adaptive regression. One limitation of prior work remains: our proposed methods can only handle categorical mediators, and the computational complexity increases with the number of categories.

The remainder of this article is organized as follows. In Section 2, we introduce the parameters of interest as well as the identification result and some simple estimators; in Section 4, we discuss efficiency theory for the interventional mediation functional, presenting estimating equations and the efficiency bound in the non-parametric model; in Section 3, we discuss simple estimators based on inverse probability weighting and sequential regression, and in Section 5, we discuss the proposed efficient estimator as well as its asymptotic properties. In Section 6, we present the results of numerical studies, and finally, in Section 7, we present the results of an illustrative study on estimating the longitudinal effect of initiating treatment for opioid-use disorder (OUD) with extended-release naltrexone (XR-NTX) vs buprenorphine-naloxone (BUP-NX) on risk of illicit opioid use during the fourth week of treatment that operates through the mediator of craving symptoms, using symptoms of depression and withdrawal as time-varying covariates. We include weekly measures for each of the first 4 weeks of treatment.

2 Notation and definition of interventional (in)direct effects

Let X 1 , , X n denote a sample of i.i.d. observations with X = ( L 1 , A 1 , Z 1 , M 1 , L 2 , , A τ , Z τ , M τ , L τ + 1 = Y ) P , where A t denotes a vector of intervention variables such as treatment and/or loss-to-follow-up, Z t denotes intermediate confounders, M t denotes a mediator of interest, and L t denotes the time-varying covariates. The outcome of interest is a variable Y = L τ + 1 measured at the end of the study. We let P f = f ( x ) d P ( x ) for a given function f ( x ) . We use P n to denote the empirical distribution of X 1 , , X n and assume P is an element of the non-parametric statistical model defined as all continuous densities on X with respect to a dominating measure ν . We let E denote the expectation with respect to P , i.e., E { f ( X ) } = f ( x ) d P ( x ) . We also let f 2 denote the L 2 ( P ) norm f 2 ( x ) d P ( x ) . We use W ¯ t = ( W 1 , , W t ) to denote the past history of a variable W , use W ̲ t = ( W t , , W τ ) to denote the future of a variable, and use H A , t = ( L ¯ t , M ¯ t 1 , Z ¯ t 1 , A ¯ t 1 ) to denote the history of all variables up until just before A t . The random variables H Z , t , H M , t , and H L , t are defined similarly as H Z , t = ( A t , H A , t ) , H M , t = ( Z t , H Z , t ) , and H L , t = ( M t 1 , H M , t 1 ) . For the complete history and past of a random variable, we sometimes simplify W ¯ τ and W ̲ 1 as W ¯ . By convention, variables with an index t 0 are defined as null, expectations conditioning on a null set are marginal, products of the type t = k k 1 b t and t = 0 0 b t are equal to one, and sums of the type t = k k 1 b t and t = 0 0 b t are equal to zero. We let g A , t ( a t h A , t ) denote the probability mass function of A t conditional on H A , t = h A , t and assume A t takes values on a finite set. The function g M , t ( m t h M , t ) is defined similarly, and we also assume that M t takes values on a finite set. The variables L t and Z t are allowed to take values on any set; i.e., they can be multivariate, continuous, etc.

We formalize the definition of the causal effects using a non-parametric structural equation model [28]. Specifically, for each time point t , we assume the existence of deterministic functions f A , t , f Z , t , f M , t , and f L , t such that A t = f A , t ( H A , t , U A , t ) , Z t = f Z , t ( H Z , t , U Z , t ) , M t = f M , t ( H M , t , U M , t ) , and L t = f L , t ( H L , t , U L , t ) . Here, U = ( U A , t , U Z , t , U M , t , U L , t , U Y : t { 1 , , τ } ) is a vector of exogenous variables, with unrestricted joint distribution. This model can be expressed in the form of a directed acyclic graph (DAG) as in Figure 1. In this article, we are concerned with the definition and estimation of the causal effect of an intervention on A ¯ on Y , as well as its decomposition in terms of the effect through all paths involving the mediator M ¯ versus effects through all other mechanisms that do not involve any component of M ¯ . Mediation effects will be defined in terms of hypothetical interventions where the equations A t = f A , t ( H A , t , U A , t ) and M t = f M , t ( H M , t , U M , t ) are removed from the structural model, and the treatment and mediator nodes are externally assigned as follows. Let M ¯ ( a ¯ ) denote the counterfactual mediator vector observed in a hypothetical world where A ¯ = a ¯ . For a value m ¯ in the range of M ¯ , we also define a counterfactual outcome Y ( a ¯ , m ¯ ) as the outcome in a hypothetical world, where A ¯ = a ¯ and M ¯ = m ¯ . These variables are defined in terms of our assumed NPSEM in the supplementary materials.

Figure 1 
               DAG. For simplicity, the symbol 
                     
                        
                        
                           ↦
                        
                        \mapsto 
                     
                   is used to indicate arrows from all nodes in its left to all nodes in its right.
Figure 1

DAG. For simplicity, the symbol is used to indicate arrows from all nodes in its left to all nodes in its right.

Let a ¯ = ( a 1 , , a τ ) and a ¯ = ( a 1 , , a τ ) denote two user-specified values in the range of A ¯ , and let G ¯ ( a ¯ ) denote a random draw from the distribution of the counterfactual variable M ¯ ( a ¯ ) conditional on V L 1 . We define the conditional interventional effect as E [ Y ( a ¯ , G ¯ ( a ¯ ) ) Y ( a ¯ , G ¯ ( a ¯ ) ) ] and decompose it as

(1) E [ Y ( a ¯ , G ¯ ( a ) ) Y ( a ¯ , G ¯ ( a ¯ ) ) ] = E [ Y ( a ¯ , G ¯ ( a ¯ ) ) Y ( a ¯ , G ¯ ( a ¯ ) ) ] Indirect effect ( through  M ¯ ) + E [ Y ( a ¯ , G ¯ ( a ¯ ) ) Y ( a ¯ , G ¯ ( a ¯ ) ) ] Direct effect ( not through  M ¯ ) .

This is the definition of interventional effect in a longitudinal setting given by ref. [18]. In what follows, we focus on the identification and estimation of the parameters E [ Y ( a ¯ , G ( a ¯ ) ) ] for fixed a ¯ , a ¯ , from which we can obtain the aforementioned effects. In a slight abuse of language, we will also refer to parameters of the type E [ Y ( a ¯ , G ( a ¯ ) ) ] as effects.

The aforementioned setup allows for the definition of causal effects for time-to-event outcomes subject to loss-to-follow-up or censoring as follows. Let A t = ( A 1 , t , A 2 , t ) , where A 1 , t denotes the exposure at time t , A 2 , t is equal to one if the unit remains uncensored at time t + 1 and zero otherwise, and Y = L τ + 1 denotes the event-free status at the end of study follow-up. Assume monotone loss-to-follow-up so that A 2 , t = 0 implies A 2 , k = 0 for all k > t , in which case all the data for k > t become degenerate. In this case, we could define the effects as above with a ¯ = ( ( a ¯ 1 , 1 , 1 ) , , ( a ¯ 1 , τ , 1 ) ) and a ¯ = ( ( a ¯ 1 , 1 , 1 ) , , ( a ¯ 1 , τ , 1 ) ) , contrasting regimes where treatment at time t is set to A 1 , t = a 1 , t vs A 1 , t = a 1 , t while setting censoring status A 2 , t = 1 as “not censored” for everyone. This definition of causal effect for time-to-event outcomes in terms of interventional effects bypasses some (but not all) problems that occur with natural effects due to the fact that a counterfactual longitudinal mediator may be truncated by death, which may render the counterfactual survival time undefined [29]. Furthermore, recent work has uncovered limitations of these effects as a tool to unveil mechanisms [30]. These and other issues related to the interpretation of these direct and indirect effects are discussed at length elsewhere. Interested readers are encouraged to consult the cited articles.

In what follows, we will use the notation H A , t , H Z , t , H M , t , and H L , t to refer to the intervened histories of the variables under an intervention setting A ¯ = a ¯ . For example, H L , t = ( L ¯ t 1 , M ¯ t 1 , Z ¯ t 1 , A ¯ t 1 = a ¯ t 1 ) . The histories H A , t , H Z , t , H M , t , and H L , t are defined analogously, e.g., H L , t = ( L ¯ t 1 , M ¯ t 1 , Z ¯ t 1 , A ¯ t 1 = a ¯ t 1 ) . The following assumptions will be sufficient to prove the identification of the parameter E [ Y ( a ¯ , G ( a ¯ ) ) ] :

Assumption 1

(Conditional exchangeability of treatment and mediator) Assume:

  1. A t Y ( a ¯ , m ¯ ) L ¯ t , M ¯ t 1 = m ¯ t 1 , Z ¯ t 1 , A ¯ t 1 = a ¯ t 1 for all t , m ¯ ,

  2. M t Y ( a ¯ , m ¯ ) L ¯ t , M ¯ t 1 = m ¯ t 1 , Z ¯ t , A ¯ t = a ¯ t for all t , m ¯ ,

  3. A t M ̲ t ( a ¯ ) L ¯ t , M ¯ t 1 , Z ¯ t 1 , A ¯ t 1 = a ¯ t 1 for all t .

Assumption 2

(Positivity of treatment and mediator assignment) Assume:

  1. s = 1 t p ( l s h L , s ) s = 1 t 1 p ( z s h Z , s ) > 0 implies g A , t ( a t h A , t ) > 0 for all t ,

  2. s = 1 t p ( l s h L , s ) p ( z s h Z , s ) > 0 implies g M , t ( m t h M , t ) > 0 for all t and m t ,

  3. s = 1 t p ( l s h L , s ) s = 1 t 1 p ( m s h M , s ) p ( z s h Z , s ) > 0 implies g A , t ( a t h A , t ) > 0 for all t and all m t .

Assumptions 1 (i), 1(ii), and 1(iii) together with the assumed DAG in Figure 1 state that H A , t contains all the common causes of A t and Y , H M , t contains all the common causes of M t and Y , and H A , t contains all the common causes of A t and M t , respectively. These are the standard assumptions of no-unmeasured confounders for the treatment-outcome, mediator-outcome, and treatment-mediator relations. Assumptions 2(i) and 2(ii) allow the identification of E [ Y ( a ¯ , m ¯ ) V = v ] for all m ¯ . Assumption 2(i) states that the value a t has positive probability conditional on values ( l s , z s 1 : s t ) that have themselves a positive conditional probability of occurring. Assumption 2(ii) states that the value m t has positive probability conditional on values ( l s , z s : s t ) that have a positive conditional probability of occurring. Assumption 2(iii) allows us to identify P ( M ¯ ( a ) = m ¯ V = v ) and states that a t has a positive probability conditional on values ( l s , m s 1 , z s 1 : s t ) that have positive conditional probability of occurring. Under these assumptions, we have the following identification result.

Theorem 1

(Identification) Under Assumptions 1and2, the interventional effect θ = E [ Y ( a ¯ , G ( a ¯ ) ) ] is identified as follows. Let

φ ( m ¯ , v ) = l τ + 1 t = 1 τ p [ l t + 1 h L , t + 1 ] p [ z t h Z , t ] p ( l 1 v ) d ν ( l ¯ , z ¯ ) , λ ( m ¯ , v ) = t = 1 τ p [ l t + 1 , m t , z t h Z , t ] p ( l 1 v ) d ν ( l ¯ , z ¯ ) .

Then, θ is identified as θ = E [ m ¯ φ ( m ¯ , V ) λ ( m ¯ , V ) ] .

The aforementioned identification theorem result generalizes several identification formulas in the literature. When τ = 1 , Z = , and V = L 1 , this identification formula yields (by computing the corresponding contrasts) the identification formula for the natural direct and indirect effects as derived by ref. [4] under an additional cross-world counterfactual assumption. When τ = 1 and V = L 1 , and the confounder Z is present, this identification formula yields the identification formula for the interventional effect described by [15]. If τ > 1 , V = L 1 , and Z t = for all t , θ in Theorem 1 reduces to Formula (1) in ref. [18]. If τ > 1 and L t = for t τ , θ in Theorem 1 reduces to the identification results for Figure 5 given in page 926 of [18]. Our result is a particular case of the identification results of ref. [31].

The choice of variable V in defining the effect should be guided by the total effect to decompose. For instance, if V = L 1 then it may be the case that the total effect in (1) will be closer to the average treatment effect E [ Y ( a ¯ ) Y ( a ¯ ) ] compared to V L 1 . In what follows, we assume for simplicity in the presentation that V = , but the estimators presented can be easily adapted to other cases.

3 Inverse probability weighting and sequential regression

The identification formula in Theorem 1 is a complex functional of the distribution P of the observed data, and it is therefore hard to estimate. Consider, for example, a plug-in estimation strategy where the relevant components of the likelihood are estimated and then plugged in the identifying functional. This would involve finding estimates of the densities of L t and Z t conditional on the past data H L , t and H Z , t . If L t and Z t are continuous or multivariate, or if H L , t and H Z , t are high-dimensional, these densities may be hard to estimate, complicating the construction of a plug-in estimator. Furthermore, the consistency of such an estimator would depend on our ability to consistently estimate the densities involved. In general, this would require the use of flexible estimators that allow automatic selection of non-linearities, interactions, etc. However, there is no statistical theory that allows the study of the sampling distribution (both asymptotic and finite-sample) of such estimator. This results in a lack of a theoretical foundation for the construction of confidence intervals, standard errors, p -values, and other quantities that are often of interest to quantify the uncertainty of statistical estimates.

To address this problem, an estimator based on the theory of doubly robust unbiased transformations [32], targeted minimum loss-based estimation [33,34], and double machine learning [35] was proposed. Concepts in these frameworks will be used to construct estimators that will allow us to use machine learning to address model misspecification bias, while also achieving asymptotic normality, allowing us to construct formulas for computation of standard errors, confidence intervals, and p -values. The asymptotically normal estimators will be constructed in Section 5. This section focuses on describing the two main building blocks: an inverse probability weighted estimator and a sequential regression estimator.

3.1 Inverse probability weighted estimation

We start with the description of our proposed estimators by observing that an alternative representation of the identifying functional allows inverse probability-weighted estimation without reliance on estimating the conditional density of L t and Z t . Specifically, define the following random variables:

G A , t ( H A , t ) = 1 { A t = a t } g t ( a t H A , t ) , G A , t ( H A , t ) = 1 { A t = a t } g t ( a t H A , t ) , G M , t ( H M , t , m t ) = 1 { M t = m t } g M , t ( M t H M , t ) ,

as well as K l , u = r = l u G A , r , K l , u = r = l u G A , r , H l , u = r = l u G M , r . Note that we have omitted the dependence on H A , t , H M , t , and m t , and we will do so whenever there is no ambiguity. Then, the identifying functionals in Theorem 1 can also be written as φ ( m ¯ ) = E { K 1 , τ H 1 , τ Y } ; λ ( m ¯ ) = E { K 1 , τ 1 { M ¯ = m ¯ } } .

The aforementioned formulas reveal a simple way to construct an estimator of θ using re-weighting. First, models for the probability distributions of A t and M t could be constructed, e.g., using flexible regression or classification methods. Then, these models are used to compute estimates of the weights K 1 , τ , H 1 , τ , and K 1 , τ . Then, these weights are used to estimate φ and λ , where the outer expectations of the aforementioned display are estimated using empirical means of weighted outcomes Y and 1 { M ¯ = m ¯ } . These estimators of φ and λ can then be plugged in the formula for θ . While this estimator is attractive for its simplicity, it has two major drawbacks. First, it relies on the ability to correctly estimate the treatment and mediator conditional distributions. If flexible estimators from the regression or classification literature (e.g., generalized linear models with 1 penalties, and random forests) are used to increase the chances of correct specification, there are no general theoretical results that allow the computation of standard errors of confidence intervals with correctness guarantees. Furthermore, this estimator is generally inefficient in the non-parametric model.

Due to the aforementioned issues, the use of the inverse probability-weighted estimator is generally discouraged in practice. However, the weights K and H will be fundamental to construct efficient and asymptotically normal estimators in Section 5, and for the construction of the efficiency bound in Section 4.

3.2 Sequential regression representation of the mediational g-computation formula

The construction of efficient and asymptotically normal estimators in Section 5 will rely on sequential regression procedures to estimate φ and λ . Those estimators are motivated by an alternative representation of θ in Theorem 1 in the form of sequential regressions, which we present in this section. This approach was first proposed by Bang and Robins [36] and has become standard in the estimation of total causal effects in longitudinal studies [34,37].

The sequential regression procedure computes the integrals in Theorem 1 through an alternative representation that iterates in time starting at the last time point of the study t = τ + 1 and ending at t = 0 . To start the iteration, set Q Z , τ + 1 = Y . For fixed values a ¯ and m ¯ , and for t = τ , , 0 , recursively define the random variables

(2) Q L , t ( H M , t , m ̲ t ) = E [ Q Z , t + 1 ( H A , t + 1 , m ̲ t + 1 ) M t = m t , H M , t ] ,

(3) Q Z , t ( H ¯ A , t , m ̲ t ) = E [ Q L , t ( H M , t , m ̲ t ) A t = a t , H A , t ] ,

where we note that Q L , t is a random variable through its dependence on H M , t , but that m ̲ t is a non-random user-given value. Likewise, Q Z , t is a random variable only through its dependence on H A , t . To simplify notation, we will sometimes omit the dependence of the aforementioned functions on H A , t , H M , t , and m ¯ . In the proof of Proposition 1 (available in the Supplementary Materials), we show that φ ( m ¯ ) = Q L , 0 ( m ¯ ) . The counterfactual distribution λ ( m ¯ ) may be identified as follows. Let Q M , τ + 1 = 1 . For t = τ , , 0 , recursively define

Q M , t ( H A , t , m ̲ t ) = E [ 1 { M t = m t } Q M , t + 1 ( H A , t + 1 , m ̲ t + 1 ) A t = a t , H A , t ] .

Then, we have λ ( m ¯ ) = Q M , 0 ( m ¯ ) . This leads to the following alternative expression for the mediational g-computation formula in terms of sequential regressions.

Proposition 1

(Sequential regression representation of the longitudinal mediation g-formula) For θ defined in Theorem 1, we have θ = m ¯ Q L , 0 ( m ¯ ) Q M , 0 ( m ¯ ) .

We illustrate the aforementioned result for the case τ = 2 in the supplementary materials. This alternative expression of the longitudinal mediation formula allows the construction of an estimator by sequential regression, estimating the parameters Q L , t ( m ¯ ) and Q Z , t ( m ¯ ) sequentially for t = τ , , 0 to obtain an estimate of Q L , 0 ( m ¯ ) , and then estimating Q M , t ( m ¯ ) sequentially for t = τ , , 0 to obtain an estimate of Q M , 0 ( m ¯ ) . The algorithm requires the estimation of sequential regressions as in (2) and (3). There are at least two alternatives to estimate these sequential regressions. The first is to perform sequential regression separately for each m ¯ in the range of M ¯ . The second is to construct a pooled dataset where we pool all values m ¯ in the range of M ¯ to obtain the estimates for all m ¯ from a single sequential regression procedure. In this article, we pursue the second approach. In this setup, the values m ¯ will play the role of another variable to be included in the regression algorithms. Specifically, note that estimating Q L , 0 ( m ¯ ) for each m ¯ would, in principle, require fitting k = 1 τ J k sets of τ sequential regressions as in formula (2), one set of sequential regressions for each one of the possible values of m ¯ . This can be computationally prohibitive. For example, estimating Q L , 0 ( m ¯ ) in a study with τ = 5 and a mediator taking on three different values involves 3 5 = 243 sets of five sequential regressions, each in a dataset of size n . To alleviate computational complexity, we propose to fit regressions in pooled datasets constructed sequentially from t = τ to t = 1 , where at time point t , each observation in the dataset at time point t + 1 is repeated J t times. Under this data pooling approach, estimating Q L , 0 in a study with τ = 5 and a mediator taking on three different values involves five sequential regressions, each in datasets of size 3 n , 3 2 n , 3 3 n , 3 4 n , and 3 5 n , respectively, where the predictor set decreases in size as the dataset increases in size. In this setting, smoothing on the mediator values m ¯ may be carried out through any regression procedure such as a parametric model or a more flexible regression approach from the machine or statistical learning literature. A detailed algorithm with pseudo-code for computation of a sequential regression estimator is given in Algorithm S1 in the supplementary materials.

If the pooled sequential regressions are performed within a priori correctly specified parametric models, then estimators of θ based on Proposition 1 may be shown to be CAN, and the delta method or the non-parametric bootstrap may be used to construct confidence intervals. However, positing correct parametric models for the sequential regressions involved is generally unattainable a priori. This is especially difficult to do in the longitudinal setting. Even if it was known that the data Z followed a given parametric model, correctly specifying models for the sequential regressions such that they are congenial with the model requires specialized models and might be impossible [38]. Furthermore, in most cases, especially with a large number of variables, the data Z may not be accurately modeled using parametric models and may require data-adaptive regression (e.g., machine learning) tools that offer flexibility in modeling non-linearities and interactions are necessary to attain consistency of the sequential regressions and therefore consistency of the estimator of θ .

Under model selection or data-adaptive regression, the sampling distribution of the aforementioned sequential regression estimator is generally unknown, which hinders computation of confidence intervals and other uncertainty measures. In the next section, we discuss efficiency theory for estimation of θ , which will allow the use of data-adaptive regression techniques while also allowing the computation of valid (under assumptions) standard errors and confidence intervals.

4 Efficiency theory

The foundations of our proposed efficient estimation approach are in semi-parametric estimation theory [e.g., 3942] and in the theory for doubly robust estimation of causal effects using sequential regression [e.g., 33,34,36,37,4347]. Central to this theory is the study of the non-parametric efficient influence function (EIF) or canonical gradient, which characterizes the efficiency bound of the longitudinal mediation functional given in Theorem 1 and allows the development of estimators under slow convergence rates for the nuisance parameters involved [41]. Specifically, our estimators will involve finding von-Mises-type approximations for the parameters Q L , t ( m ¯ ) , Q Z , t ( m ¯ ) , and Q M , t ( m ¯ ) , which can be intuitively understood as first-order expansions with second-order error remainder terms. Because the errors in the expansion are second-order, this will mean that the resulting estimator of θ ˆ will be consistent and asymptotically normal at rate n 1 / 2 as long as the second-order error terms converge to zero at rate n 1 / 2 . This convergence rate would be satisfied, for example, if the all regression functions were used for estimation converge at rate n 1 / 4 . We will elaborate on this discussion in Section 5 when we present the asymptotic normality theorems for the proposed estimators.

Let η = { G A , t , G A , t , G M , t , Q Z , t , Q M , t , Q L , t : t = 1 , , τ } . For t = 0 , , τ , define

(4) D L , t ( X ̲ t , m ̲ t ) = s = t τ K t + 1 , s H t , s { Q Z , s + 1 Q L , s } + s = t + 1 τ K t + 1 , s H t , s 1 { Q L , s Q Z , s } + Q L , t .

(5) D Z , t ( X ̲ t , m ̲ t ) = s = t τ K t , s H t , s { Q Z , s + 1 Q L , s } + s = t τ K t , s H t , s 1 { Q L , s Q Z , s } + Q Z , t

(6) D M , t ( X ̲ t , m ̲ t ) = s = t τ K t , s k = t s 1 1 ( M k = m k ) { 1 ( M s = m s ) Q M , s + 1 Q M , s } + Q M , t .

Whenever necessary, we make explicit the dependence of these functions on η using notation such as D Z , t ( η ) or D Z , t ( X ̲ t , m ̲ t ; η ) .

Lemma 1

(von-Mises-type approximation for Q Z , t , Q L , t , and Q M , t ) Let η ˜ denote an arbitrary value of η . For second-order terms R L , t ( η , η ˜ ) , R Z , t ( η , η ˜ ) , and R M , t ( η , η ˜ ) , we have the following first-order expansions:

Q L , t = E [ D Z , t + 1 ( η ˜ ) M t = m t , H M , t ] + R L , t ( η , η ˜ ) , Q Z , t = E [ D L , t ( η ˜ ) A t = a t , H A , t ] + R Z , t ( η , η ˜ ) ,

where we let D Z , τ + 1 ( η ˜ ) = Y . For H A , t such that M ¯ t 1 = m ¯ t 1 , we also have

Q M , t = E [ 1 { M t = m t } D M , t + 1 ( η ˜ ) A t = a t , H A , t ] + R M , t ( η , η ˜ ) ,

where we let D M , τ + 1 ( η ˜ ) = 1 .

The terms R L , t ( η , η ˜ ) , R Z , t ( η , η ˜ ) , and R M , t ( η , η ˜ ) are the second-order error terms involving expectations of products of errors such as ( G ˜ M , s G M , s ) ( Q ˜ L , s Q L , s ) , ( G ˜ A , s G A , s ) ( Q ˜ Z , s Q Z , s ) , and ( G ˜ A , s G A , s ) ( Q ˜ M , s Q M , s ) , and their explicit form is given in the supplementary materials. This lemma and specifically the second-order form of these remainder terms have important implications in terms of estimation. Specifically, this lemma says that for any value η ˜ , which could represent an inconsistent estimator, regressing D Z , t + 1 ( η ˜ ) on H M , t among units with M t = m t yields a consistent estimator of Q L , t , as long as R L , t ( η , η ˜ ) is small. Because R L , t ( η , η ˜ ) is a sum of products of errors, it may be reasonable to assume that this term is small for data-adaptive regression estimators of η . Specifically, in Section 5, consistency and asymptotic normality of the estimators will require that R L , t ( η , η ˆ ) converges to zero in probability at rate n 1 / 2 or faster. The second-order term structure of this remainder term means that this fast convergence rate can be achieved under slower convergence rates for each of the components of η ˆ . For example, it will be achievable if all the components of η ˆ converge in probability to their correct limits at rate n 1 / 4 . This kind of rate is achievable by several data-adaptive regression methods, such as 1 regularization [48], regression trees [49], neural networks [50], or the highly adaptive lasso [51]. Analogous considerations apply to the estimation of Q Z , t and Q M , t .

Remark 1

(Generalization to V ) When V = (as we have assumed throughout), Lemma 1 can be used to prove that the function D Z , 1 is the uncentered EIF of φ ( m ¯ ) . If V is discrete, EIF of φ ( m ¯ , v ) will have an additional term that accounts for conditioning on V . When V is continuous or high-dimensional, φ ( m ¯ , v ) is not pathwise differentiable, and EIF does not exist. Nonetheless, the lemma shows that D Z , 1 is a doubly robust estimating function for φ ( m ¯ , v ) (meaning that the corresponding estimators will have second-order errors, see [32]), regardless of the dimension of V , which implies that estimators of θ that use this lemma for estimating φ ( m ¯ , v ) will satisfy an appropriately modified version of the asymptotic properties outlined below.

An application of the delta method along with Lemma 1 yields the following EIF for the estimation of θ in the non-parametric model (see the supplementary materials for a proof).

Theorem 2

(EIF for θ ) The EIF for θ in the non-parametric model is given by S ( X , η ) = m ¯ ¯ [ { D Z , 1 ( X , m ¯ ; η ) φ ( m ¯ ) } λ ( m ¯ ) + { D M , 1 ( X , m ¯ ; η ) λ ( m ¯ ) } φ ( m ¯ ) ] .

This implies that the non-parametric efficiency bound for the estimation of θ is Var [ S ( X , η ) ] and that an efficient estimator of θ will satisfy n ( θ ˆ θ ) = 1 n i = 1 n S ( X i , η ) + o P ( 1 ) , licensing the construction of Wald-type confidence intervals and hypothesis tests based on the central limit theorem. In the following section, we will describe an algorithm to construct such an estimator.

5 Efficient estimation using sequential doubly robust regression

Motivated by Lemma 1, we construct a sequentially doubly robust estimator as follows. Consider pseudo-outcomes D ˆ Z , t + 1 , D ˆ L , t , and D ˆ M , t + 1 . The estimation algorithm proceeds sequentially from t = τ , , 1 performing regression of these pseudo-outcomes in datasets D t + that pool over values of m t . Specifics of this sequentially doubly robust estimator are presented in Algorithm 1.

Algorithm 1: Sequentially doubly robust estimator of θ
1 D ˆ Z , τ + 1 Y ;
2 D ˆ M , τ + 1 1 ;
3 for t = τ , , 1 do
4 5 6 7 8 9 10 11 12 13 14 15 D t + D t + 1 + × t ; Q ˆ L , t RegressAndPredict ( outcome = D ˆ Z , t + 1 , predictors = ( m ̲ t , H M , t ) , training set = Subset ( D t + , M t = m t ) , prediction set = D t + ) ; g ˆ t RegressAndPredict ( A t , H A , t , D , D t + ) ; g ˆ M , t RegressAndPredict ( M t , H M , t , D , D t + ) ; G ˆ A , t 1 { A t = a t } / g ˆ t ( A t H A , t ) ; G ˆ A , t 1 { A t = a t } / g ˆ t ( A t H A , t ) ; G ˆ M , t 1 { M t = m t } / g ˆ M , t ( M t H M , t ) ; D ˆ L , t D L , t ( η ˆ ) using formula (4) and data P v , t + ; Q ˆ Z , t RegressAndPredict ( D ˆ L , t , ( m ̲ t , H A , t ) , Subset ( D t + , A t = a t ) , D t + ) ; Q ˆ M , t RegressAndPredict ( 1 { M t = m t } D ˆ M , t + 1 , ( m ̲ t , H A , t ) , Subset ( D t + , A t = a t ) , D t + ) ; D ˆ Z , t D Z , t ( η ˆ ) using formula (5) and data D t + ; D ˆ M , t D M , t ( η ˆ ) using formula (6) and data D t + ;
16 end
17 φ ˆ v ( m ¯ ) Mean ( D ˆ Z , 1 ( m ¯ , L 1 ) , data = Subset ( D 1 + , M ¯ = m ¯ ) ) ;
18 λ ˆ v ( m ¯ ) Mean ( D ˆ M , 1 ( m ¯ , L 1 ) , data = Subset ( D 1 + , M ¯ = m ¯ ) ) ;
19 θ ˆ m ¯ φ ˆ ( m ¯ ) λ ˆ ( m ¯ )

Sequential doubly robust estimators decouple estimation of conditional expectations at time t from consistent estimation of sequences of estimators at time t + 1 , , τ , thereby achieving extra robustness to model misspecification. To introduce sequential double robustness, define the data-dependent parameters Q ˇ L , t = E [ D Z , t + 1 ( η ˆ ) M t = m t , H M , t ] , Q ˇ Z , t = E [ D L , t ( η ˆ ) A t = a t , H A , t ] , and Q ˇ M , t = E [ 1 { M t = m t } D M , t + 1 ( η ˆ ) A t = a t , H A , t ] , where the expectation is with respect to the distribution of X , i.e., the estimator η ˆ is considered fixed. Then, we have the following result.

Lemma 2

(Sequential double robustness) Assume that, at each time point t, we have

  1. G ˆ A , t G A , t = o P ( 1 ) or Q ˆ Z , t Q ˇ Z , t = o P ( 1 ) ,

  2. G ˆ M , t G M , t = o P ( 1 ) or Q ˆ L , t Q ˇ L , t = o P ( 1 ) , and

  3. G ˆ A , t G A , t = o P ( 1 ) or Q ˆ M , t Q ˇ M , t = o P ( 1 ) ,

then we have θ ˆ = θ + o P ( 1 ) , for θ ˆ defined in Algorithm 1.

Here, we emphasize that the error terms Q ˆ Z , t Q ˇ Z , t , Q ˆ L , t Q ˇ L , t , and Q ˆ M , t Q ˇ M , t only depend on the consistency of the regression procedure used at time point t and not on the consistency of estimators at any time point s > t . To illustrate why this is important, consider an alternative estimation strategy in which the parameters Q L , t , Q Z , t , and Q M , t are estimated directly (i.e., based on (2) and (3)) using flexible regression techniques, and the estimators of flexible regression techniques, and the estimators of φ ( m ¯ ) and λ ( m ¯ ) are constructed using a one-step estimator [42] by taking the empirical mean of D L , 1 ( X , m ¯ ; η ˆ ) and D M , 1 ( X , m ¯ ; η ˆ ) . Note that the functions Q L , t , Q Z , t , and Q M , t are the functions of the sequences { Q L , t + 1 , , Q L , τ } , { Q Z , t + 1 , , Q Z , τ } , and { Q M , t + 1 , , Q M , τ } . It therefore appears that consistent estimation of these parameters at time t requires consistent estimation of these future sequences. In fact, an analysis of such a one-step estimator would yield error terms of the form Q ˆ Z , t Q Z , t . Unlike our error terms (which are stated in terms of convergence to Q ˇ Z , t ), convergence of these error terms would require consistency of Q ˆ L , s and Q ˆ Z , s + 1 for all s t . The sequential regression construction using pseudo-outcomes avoids this problem. This lemma is the result of an application of Lemma 4 in [52]; more discussion on sequential double robustness may be found in [34,37,53].

5.1 Cross-fitting and asymptotic normality

Although the estimator of Algorithm 1 is sequentially doubly robust, thereby offering protection against model misspecification, it is not generally asymptotically normal nor efficient. Proving asymptotic normality and efficiency of this estimator would require the use of regression estimators that output functions in Donsker classes, which can be roughly understood as functions with bounded complexity [54]. This would importantly limit the ability to use flexible regression estimators that can further aid to mitigate model misspecification bias.

To solve the aforementioned problem, we will propose a cross-fitted sequentially doubly robust estimator, which helps to avoid assumptions on the complexity of the regression estimators used [35,5557]. Cross-fitting will allow us to fully exploit the EIF given in Theorem 2 to construct estimators that are efficient and CAN under the use of flexible regression algorithms from the statistical and machine learning literature. These regression estimators are often better than parametric models at capturing the true functional form of the regression functions, therefore making the error terms R L , 0 ( η ˆ , η ) , R Z , 0 ( η ˆ , η ) , and R M , 0 ( η ˆ , η ) small.

The cross-fitting algorithm is described in Algorithm S2, and the cross-fitted sequential regression algorithm using pooled datasets are described in Algorithm S3 in the supplementary materials. An R package implementing the algorithm is available on GitHub, and simulations testing the correctness of our implementation are given in the supplementary materials

Our proposed cross-fitted estimation algorithm satisfies sequential double robustness in the sense of Lemma 2. Furthermore, it satisfies the following weak convergence result.

Theorem 3

(Weak convergence of θ ˆ ) Assume that, at each time point t, we have

  1. G ˆ A , t G A , t Q ˆ Z , t Q ˇ Z , t = o P ( n 1 / 2 ) ,

  2. G ˆ M , t G M , t Q ˆ L , t Q ˇ L , t = o P ( n 1 / 2 ) ,

  3. G ˆ A , t G A , t Q ˆ M , t Q ˇ M , t = o P ( n 1 / 2 ) , and

  4. P ( G A , t < c ) = P ( G A , t < c ) = P ( G M , t < c ) = 1 for some c < .

Then, we have n ( θ ˆ θ ) N ( 0 , σ 2 ) , where σ 2 = Var [ S ( X ; η ) ] is the non-parametric efficiency bound.

Note that the weak convergence of θ ˆ requires a consistent estimation of the sequential regression functions at the rates stated in the theorem. Data-adaptive regression methods avoid reliance on parametric assumptions to achieve consistent estimation. Theorem 3 allows the construction of confidence intervals as θ ˆ ± z α / 2 σ ˆ / n , where σ ˆ 2 is the empirical variance of S ( X ; η ˆ ) and z α is the quantile of a standard normal distribution.

The main difference between the construction in Algorithm 1 and Algorithm A3 is the introduction of sample splitting and the introduction of formulas to compute standard errors. Sample splitting requires careful handling of the difference training sets (denoted with a T in the algorithm) and prediction sets (denoted with a P in the algorithm) to ensure that the data for unit i are not used in the training set of regression functions used to make predictions for that unit. This allows the analysis of the estimators by conditioning on the training sample, and then “averaging” across training samples to obtain the final result.

6 Numerical studies

In this section, we perform a numerical simulation study with three main narrow but important objectives:

  1. To test the implementation of the algorithms in the R package lcm and demonstrate that the package produces correct answer in the sense that the estimator performance matches what is expected from theory.

  2. To provide an illustration of the performance of the estimator under the practical use case where machine learning is used for the estimation of the nuisance quantities.

  3. To illustrate the sequential double robustness of the estimator, which allows the algorithm to recover from inconsistent estimation of the outcome regression.

We achieve these two aims by constructing a Monte Carlo simulation as follows. First, we generate datasets with τ = 2 time points from a distribution with binary A t and Y , and where L t , Z t , and M t are the categorical variables that can take on three levels each. All probability distributions are the functions of the type

P ( W = k H W = h W ) = 0.8 × exp ( h 2 , W t β k ) k = 1 K exp ( h 2 , W t β k ) + 0.1 ,

where W is one of A t , Z t , M t , L t , and Y , and h 2 , W t is a vector with all main terms and two-way interactions for h W t . For A t and Y , we set K = 2 , and for L t , Z t , and M t , we set K = 3 . The coefficients β k used in the simulation may be found on GitHub along with all other simulation codes.

In principle, estimation could be carried out using a non-parametric maximum-likelihood estimator (NPMLE) because the data are categorical. However, the data-generating mechanism consists of 3 6 × 2 3 = 5,832 categories, which would imply prohibitive sample sizes to obtain an NPMLE with good properties. We therefore use our developed estimators to compute estimates of the direct and indirect effects with a ¯ = 1 and a ¯ = 0 , using flexible regression methods. We generated 1,000 datasets of sizes n { 500 , 1,000 , 5,000 } and applied the estimators proposed. We then approximated quantities such as bias and coverage probabilities as empirical quantities across simulated datasets.

To achieve the first objective of this numerical study, all of the estimators of nuisance parameters in η consist of saturated logistic regression with an 1 regularization penalty, where the regularization parameter is chosen with cross-validation. This estimator is a particular case of the highly adaptive lasso estimator (HAL [51]), which has been proven to converge at the rates required by Theorem 3. This implies that under this estimation strategy, a correct implementation of our estimator should yield bias that converges to zero at n -rate, as well as confidence intervals with the nominal coverage. The results of this simulation are presented in Table 1.

Table 1

Simulation results for HAL estimation of nuisance parameters η

n Estimand Bias n × Bias n × Var. n × MSE 95% CI coverage
500 Direct 0.004 0.080 2.912 2.908 0.96
Indirect 0.004 0.098 0.180 0.189 0.99
Total 0.008 0.178 3.165 3.186 0.95
1,000 Direct 0.002 0.070 2.340 2.337 0.97
Indirect 0.001 0.042 0.111 0.113 0.99
Total 0.001 0.028 2.488 2.481 0.97
5,000 Direct 0.000 0.002 2.129 2.122 0.95
Indirect 0.001 0.056 0.059 0.061 0.95
Total 0.001 0.058 2.204 2.200 0.94

The results in Table 1 illustrate that the bias, indeed, converges to zero at rate n , as expected. Nonetheless, the coverage of the confidence intervals is larger than the nominal value. This is due to the use of a non-targeted estimator of the variance of the estimator; this phenomenon has been observed before for this type of estimator (e.g., [58,59]). Alternative variance estimators that alleviate this problem have been proposed for other contexts [60]; the extension of those estimators to our problem will be the subject of future work.

To achieve the second objective of illustrating the performance of the methods in a typical setting under the use of typical data-adaptive regression methods in the statistics and machine learning literature, we fit the regressions in η with an ensemble method known as the super-learner [61], which builds a combination of algorithms in a user-supplied list of algorithms. The weights in the combination are chosen using cross-validation, and the super-learner has been shown to have certain desirable oracle optimality properties [62]. As candidate learners, we chose two estimators: (i) a logistic or linear (depending on whether the outcome takes values on { 0 , 1 } or not) regression model with only main terms and (ii) a tree ensemble built using gradient boosting machines with hyper-parameters tuned using cross-validation. The results are presented in Table 2 and show good performance of the estimators in terms of both coverage of the confidence intervals and bias.

Table 2

Simulation results for super-learner estimation of nuisance parameters η

n Estimand Bias n × Bias n × Var. n × MSE 95% CI coverage
500 Direct 0.007 0.159 2.441 2.458 0.97
Indirect 0.000 0.001 0.223 0.223 0.98
Total 0.007 0.157 2.405 2.422 0.98
1,000 Direct 0.003 0.104 2.394 2.397 0.94
Indirect 0.001 0.019 0.136 0.136 0.99
Total 0.003 0.085 2.264 2.264 0.95
5,000 Direct 0.003 0.185 2.071 2.098 0.95
Indirect 0.000 0.025 0.076 0.077 0.96
Total 0.002 0.160 2.040 2.059 0.95

Note that neither of the candidates in the super learner estimator used in this simulation contains the true data-generating mechanisms. However, the candidates in the library are the type of flexible estimators that we encourage users of our approach to use in practice. Therefore, this table serves as an illustration of estimator consistency in practical situations where the outcome sequential regressions may be unknown but are estimated using flexible regression. We caution, however, that performance of confidence intervals can vary wildly under model misspecification, that our theoretical results do not guarantee correct coverage under general forms of model misspecification, and that our simulation results for coverage and n -bias should not be taken as indicative of general performance.

To achieve the third objective of illustrating the sequential double robustness of the estimator, we fit the regressions for the outcome Y using a logistic regression with main terms (inconsistent), the models for A 2 and M 2 using HAL (consistent), the models for L 2 and Z 2 using HAL (consistent), and the models for A 1 and M 1 using logistic regression with main terms (inconsistent). The results in Table 3 illustrate that this scenario leads to estimators with low large-sample bias (possibly consistent), but higher n -bias (possibly not n -consistent), as predicted by theory. As with the results of Table 2, the correct approximately coverage observed in this simulation should not be expected in general misspecification situations.

Table 3

Simulation results under misspecification of the outcome regression

n Estimand Bias n × Bias n × Var. n × MSE 95% CI coverage
500 Direct 0.005 0.104 2.201 2.204 0.96
Indirect 0.000 0.003 0.161 0.160 1.00
Total 0.005 0.107 2.210 2.214 0.98
1,000 Direct 0.001 0.045 2.092 2.087 0.96
Indirect 0.000 0.004 0.117 0.117 0.99
Total 0.001 0.041 2.117 2.111 0.98
5,000 Direct 0.004 0.305 2.049 2.136 0.95
Indirect 0.000 0.001 0.063 0.062 0.98
Total 0.004 0.304 2.031 2.117 0.96

7 Illustrative application

We applied our proposed estimators to a longitudinal mediation question from a comparative effectiveness trial of XR-NTX vs BUP-NX for the treatment of OUD [63]. Specifically, we were interested in estimating the extent to which differences in use of illicit opioids during the first month of treatment between the two medications was due to mediation by self-reported craving of opioids, among those completing the detoxification requirement and initiating treatment. This involved estimating the interventional direct effect of being treated with XR-NTX vs BUP-NX on risk of using illicit opioids during the first 4 weeks of treatment, not operating through differences in craving of opioids; and the interventional indirect effect of being treated with XR-NTX vs BUP-NX on risk of using illicit opioids during the first 4 weeks of treatment that did operate through differences in craving. Patients report less opioid use when on XR-NTX vs BUP-NX (as well as methadone) [64,65], but the reasons underlying this difference are not well understood. In this study, patients were randomized to receive XR-NTX or BUP-NX. At time of randomization, a large number (over 30) of baseline covariates, denoted L 1 (listed in the supplemental materials), were measured. Although patient assignment to XR-NTX vs BUP-NX was randomized, we are estimating effects only among those who initiated treatment. Treatment initiation is not randomized and likely depends on patient characteristics. Initiation is also more difficult for those assigned to XR-NTX, because it requires complete detoxification [63]. Our adjustment for an extensive set of possibly confounding variables ( L 1 ) helps address lack of randomization in the exposure groups. We use A 1 = 1 to denote initiation with XR-NTX and A 1 = 0 to denote initiation with BUP-NX. The outcome of this study is opioid use as detected by weekly urine drug screen or timeline followback interview [66]. We use L t to denote opioid use measured at week t + 1 for t { 1 , 2 , 3 } , which detects use in the several days prior (via urine drug screen) to week prior (via interview), and where t represents the number of weeks since randomization. We hypothesized that if XR-NTX reduces craving more than BUP-NX, that this could provide a partial explanation of the lower opioid use while on XR-NTX. We use M t to denote craving during week t + 1 since randomization, for t { 1 , 2 , 3 } . There may also be differences in depressive symptoms [67] and withdrawal symptoms [68] between the two medications, so we incorporated measures of each (Hamilton depression scale, subjective opioid withdrawal scale; see [69,70]), as time-varying confounders. We use Z t to denote these confounders measured during week t + 1 for t { 1 , 2 , 3 } .

In addition, patients could drop out of treatment or otherwise be lost to follow-up starting in week 3 after randomization. Importantly, the outcomes L t are always observed, as it is assumed that a patient who has missing opioid-use data (most likely due to a missed visit) would have been positive for opioid use [71,72]. For t { 2 , 3 } , we use A t = 1 to denote that a patient’s Z t and M t have been measured, and we let A t = 0 otherwise. Thus, our intervention variable is given by a combination of treatment and censoring, A ¯ = { A 1 , A 2 , A 3 } . In summary, we can write our observed data for this example as O = ( L 1 , A 1 , Z 1 , M 1 , L 2 , A 2 , Z 2 , M 2 , L 3 , A 3 , Z 3 , M 3 , L 4 ) .

Note that the assumption of a negative (or positive) outcome for patients loss-to-follow-up may not be sensible in all applications. In such cases, we encourage the use of a loss-to-follow-up indicator variable and an intervention on loss-to-follow-up as described in Section 2.

We estimated the interventional direct effect of initiating treatment with XR-NTX vs BUP-NX on illicit opioid use during week 4 of treatment, not operating through craving, under a hypothetical intervention to prevent dropout. That is, for a ¯ = ( 1 , 1 , 1 ) and a ¯ = ( 0 , 1 , 1 ) , we estimated E ( Y ( a ¯ , G ¯ ( a ¯ ) ) Y ( a ¯ , G ¯ ( a ¯ ) ) ) . We also estimated the interventional indirect effect of initiating treatment with XR-NTX vs BUP-NX on illicit opioid use during week 4 of treatment operating through craving, had those who dropped out not dropped out. That is, we estimated E ( Y ( a ¯ , G ¯ ( a ¯ ) ) Y ( a ¯ , G ¯ ( a ¯ ) ) ) . We used an ensemble of machine learning algorithms in fitting the nuisance parameters. The ensemble included an intercept-only model, lasso, multiple additive regression splines, and extreme gradient boosting. The weights in the ensemble were chosen using super-learning [61]. We used cross-fitting with fivefolds.

For the interventional direct effect, we estimated that initiating treatment with XR-NTX vs BUP-NX, not operating through craving, would reduce risk of using illicit opioids during week 4 of treatment by 8.8 percentage points (risk difference: 0.088 , 95% CI: 0.129 , 0.048 ). For the interventional indirect effect, we estimated that initiating treatment with XR-NTX vs BUP-NX, operating through craving, would not meaningfully decrease risk of using illicit opioids during week 4 of treatment (risk difference: 0.004 , 95% CI: 0.019 , 0.011). Thus, we conclude that reductions in risk of illicit opioid use during treatment with XR-NTX vs BUP-NX are not due to differences that operate through the treatments’ effects on craving.

8 Discussion

Our approach generalizes interventional causal effects to allow for high-dimensional time-dependent variables measured post- and pre-treatment. We present an estimation algorithm that leverages machine learning to alleviate misspecification bias while retaining statistical properties such as n -consistency, non-parametric efficiency, and asymptotic normality.

While this approach allows great flexibility in the data structure and estimation method, some limitations remain. We assume that the mediator is categorical and computational tractability of our proposed estimator requires that it takes values on a small set. This limitation seems fundamental and hard to overcome within an interventional effect framework, as all estimators will require the estimation of the density of the counterfactual variable M ¯ ( a ¯ ) . We know of no method that can do this non-parametrically in the case of a continuous or high-dimensional variable M ¯ , although recent approaches on the estimation of counterfactual densities in single time-point settings are promising [73].

In addition, interventional effects have some limitations. First, they do not decompose the average treatment effect E [ Y ( a ¯ ) Y ( a ¯ ) ] . Second, the interventional indirect effect can be non-zero, even when there are no paths of the type A M Y in the DAG in Figure 1 [30]. Solving this limitation would require a different framework for mediation analysis. We proposed such a framework for single time-point exposures in ref. [74], and future work will include generalizing that framework to the case of time-varying exposures and mediators.

Acknowledgement

This work was supported through a Patient-Centered Outcomes Research Institute (PCORI) Project Program Award (ME-2021C2-23636-IC).

  1. Conflict of interest: Prof. Iván Díaz is a member of the Editorial Board of the Journal of Causal Inference but was not involved in the review process of this article.

References

[1] Vander Weele TJ. Mediation and mechanism. European J Epidemiol. 2009;24(5):217–24. 10.1007/s10654-009-9331-1Search in Google Scholar PubMed

[2] Gilbert PB, Montefiori DC, McDermott A, Fong Y, Benkeser DC, Deng W, et al. Immune correlates analysis of the mRNA-1273 COVID-19 vaccine efficacy trial. MedRxiv. 2021. 10.1101/2021.08.09.21261290Search in Google Scholar PubMed PubMed Central

[3] Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology. 1992;3:143–55. 10.1097/00001648-199203000-00013Search in Google Scholar PubMed

[4] Pearl J. Direct and indirect effects. In: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence. UAI ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 2001. p. 411–20. http://dl.acm.org/citation.cfm?id=647235.720084. Search in Google Scholar

[5] Avin C, Shpitser I, Pearl J. Identifiability of path-specific effects. In: IJCAI International Joint Conference on Artificial Intelligence. 2005. p. 357–63. Search in Google Scholar

[6] Rudolph KE, Gimbrone C, Díaz I. Helped into harm: mediation of a housing voucher intervention on mental health and substance use in boys. Epidemiology. 2021;32(3):336–46. 10.1097/EDE.0000000000001334Search in Google Scholar PubMed PubMed Central

[7] Robins JM, Richardson TS. Alternative graphical causal models and the identification of direct effects. Causality and Psychopathology: Finding the Determinants of Disorders and Their Cures. 2010. p. 103–58. 10.1093/oso/9780199754649.003.0011Search in Google Scholar

[8] Tchetgen Tchetgen EJ, Phiri K. Bounds for pure direct effect. Epidemiology (Cambridge, Mass). 2014;25(5):775. 10.1097/EDE.0000000000000154Search in Google Scholar PubMed PubMed Central

[9] Miles CH, Kanki P, Meloni S, Tchetgen Tchetgen EJ. On partial identification of the pure direct effect. 2015. arXiv: http://arXiv.org/abs/arXiv:150901652. Search in Google Scholar

[10] Stensrud MJ, Young JG, Didelez V, Robins JM, Hernán MA. Separable effects for causal inference in the presence of competing events. J Amer Stat Assoc. 2022;117(537):175–83. 10.1080/01621459.2020.1765783Search in Google Scholar

[11] Díaz I, Hejazi NS. Causal mediation analysis for stochastic interventions. J R Stat Soc Ser B (Stat Methodol). 2020;82(3):661–83. 10.1111/rssb.12362Search in Google Scholar

[12] Hejazi NS, Rudolph KE, van der Laan MJ, Díaz I. Nonparametric causal mediation analysis for stochastic interventional (in) direct effects. 2020. arXiv: http://arXiv.org/abs/arXiv:200906203. Search in Google Scholar

[13] Petersen ML, Sinisi SE, van der Laan MJ. Estimation of direct causal effects. Epidemiology. 2006;17(3):276–84. 10.1097/01.ede.0000208475.99429.2dSearch in Google Scholar PubMed

[14] van der Laan MJ, Petersen ML. Direct effect models. Int J Biostat. 2008;4(1). 10.2202/1557-4679.1064Search in Google Scholar PubMed

[15] Vander Weele TJ, Vansteelandt S, Robins JM. Effect decomposition in the presence of an exposure-induced mediator-outcome confounder. Epidemiology (Cambridge, Mass). 2014;25(2):300. 10.1097/EDE.0000000000000034Search in Google Scholar PubMed PubMed Central

[16] Zheng W, van der Laan MJ. Targeted maximum likelihood estimation of natural direct effects. Int J Biostat. 2012;8(1):1–40. 10.2202/1557-4679.1361Search in Google Scholar PubMed PubMed Central

[17] Díaz I, Hejazi NS, Rudolph KE, van Der Laan MJ. Nonparametric efficient causal mediation with intermediate confounders. Biometrika. 2021;108(3):627–41. 10.1093/biomet/asaa085Search in Google Scholar

[18] Vander Weele TJ, Tchetgen Tchetgen EJ. Mediation analysis with time varying exposures and mediators dl. J R Stat Soc Ser B Stat Methodol. 2017;79(3):917. 10.1111/rssb.12194Search in Google Scholar PubMed PubMed Central

[19] Tai AS, Lin SH, Chu YC, Yu T, Puhan MA, Vander Weele T. Causal mediation analysis with multiple time-varying mediators. Epidemiology. 2022;34(1):8–19. 10.1097/EDE.0000000000001555Search in Google Scholar PubMed

[20] Zheng W, van der Laan M. Longitudinal mediation analysis with time-varying mediators and exposures, with application to survival outcomes. J Causal Inference. 2017;5(2). 10.1515/jci-2016-0006Search in Google Scholar PubMed PubMed Central

[21] Vansteelandt S, Linder M, Vandenberghe S, Steen J, Madsen J. Mediation analysis of time-to-event endpoints accounting for repeatedly measured mediators subject to time-varying confounding. Stat Med. 2019;38(24):4828–40. 10.1002/sim.8336Search in Google Scholar PubMed PubMed Central

[22] Mittinty MN, Vansteelandt S. Longitudinal mediation analysis using natural effect models. Amer J Epidemiol. 2020;189(11):1427–35. 10.1093/aje/kwaa092Search in Google Scholar PubMed

[23] Ge L, Wang J, Shi C, Wu Z, Song R. A reinforcement learning framework for dynamic mediation analysis. 2023. arXiv: http://arXiv.org/abs/arXiv:230113348. Search in Google Scholar

[24] Loh WW, Moerkerke B, Loeys T, Vansteelandt S. Nonlinear mediation analysis with high-dimensional mediators whose causal structure is unknown. Biometrics. 2022;78(1):46–59. 10.1111/biom.13402Search in Google Scholar PubMed

[25] Tai AS, Lin PH, Huang YT, Lin SH. Path-specific effects in the presence of a survival outcome and causally ordered multiple mediators with application to genomic data. Stat Meth Med Res. 2022;31(10):1916–33. 10.1177/09622802221104239Search in Google Scholar PubMed

[26] Tanner KT, Sharples LD, Daniel RM, Keogh RH. Methods of analysis for survival outcomes with time-updated mediators, with application to longitudinal disease registry data. Stat Meth Med Res. 2022;31(10):1959–75. 10.1177/09622802221107104Search in Google Scholar PubMed PubMed Central

[27] Tai AS, Lin SH. Complete effect decomposition for an arbitrary number of multiple ordered mediators with time-varying confounders: a method for generalized causal multi-mediation analysis. Stat Meth Med Res. 2023;32(1):100–17. 10.1177/09622802221130580Search in Google Scholar PubMed

[28] Pearl J. Causality: models, reasoning, and inference. Cambridge: Cambridge University Press; 2000. Search in Google Scholar

[29] Didelez V. Defining causal mediation with a longitudinal mediator and a survival outcome. Lifetime Data Anal. 2019;25(4):593–610. 10.1007/s10985-018-9449-0Search in Google Scholar PubMed

[30] Miles CH. On the causal interpretation of randomized interventional indirect effects. 2022. arXiv: http://arXiv.org/abs/arXiv:220300245. Search in Google Scholar

[31] Richardson TS, Robins JM. Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series Working Paper. 2013;128(30):2013. Search in Google Scholar

[32] Rubin D, van der Laan MJ. A doubly robust censoring unbiased transformation. Int J Biostat. 2007;3(1). 10.2202/1557-4679.1052Search in Google Scholar PubMed

[33] van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer; 2011. 10.1007/978-1-4419-9782-1Search in Google Scholar

[34] Luedtke AR, Sofrygin O, van der Laan MJ, Carone M. Sequential double robustness in right-censored longitudinal models. 2017. arXiv: http://arXiv.org/abs/arXiv:170502459. Search in Google Scholar

[35] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, et al. Double/debiased machine learning for treatment and structural parameters. Econometr J. 2018;21(1):C1–68. 10.1111/ectj.12097Search in Google Scholar

[36] Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–73. 10.1111/j.1541-0420.2005.00377.xSearch in Google Scholar PubMed

[37] Rotnitzky A, Robins J, Babino L. On the multiply robust estimation of the mean of the g-functional. 2017. arXiv: http://arXiv.org/abs/arXiv:170508582. Search in Google Scholar

[38] Dukes O, Martinussen T, Tchetgen Tchetgen EJ, Vansteelandt S. On doubly robust estimation of the hazard difference. Biometrics. 2019;75(1):100–9. 10.1111/biom.12943Search in Google Scholar PubMed PubMed Central

[39] vonMises R. On the asymptotic distribution of differentiable statistical functions. Ann Math Stat. 1947;18(3):309–48. 10.1214/aoms/1177730385Search in Google Scholar

[40] van der Vaart AW. Asymptotic statistics. Cambridge: Cambridge University Press; 1998. 10.1017/CBO9780511802256Search in Google Scholar

[41] Robins J, Li L, Tchetgen Tchetgen E, van der Vaart AW. Quadratic semiparametric von mises calculus. Metrika. 2009;69(2–3):227–47. 10.1007/s00184-008-0214-3Search in Google Scholar PubMed PubMed Central

[42] Bickel PJ, Klaassen CA, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Berlin: Springer-Verlag; 1997. Search in Google Scholar

[43] Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. In: Proceedings of the American Statistical Association. 2000. Search in Google Scholar

[44] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Amer Stat Assoc. 1994 September;89(427):846–66. 10.1080/01621459.1994.10476818Search in Google Scholar

[45] van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer; 2003. 10.1007/978-0-387-21700-0Search in Google Scholar

[46] van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1). https://doi.org/10.2202/1557-4679.1043. 10.2202/1557-4679.1043Search in Google Scholar

[47] van der Laan MJ, Rose S. Targeted learning in data science: causal inference for complex longitudinal studies. New York: Springer; 2018. 10.1007/978-3-319-65304-4Search in Google Scholar

[48] Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of Lasso and Dantzig selector. Ann Stat. 2009;37(4):1705–32. 10.1214/08-AOS620Search in Google Scholar

[49] Wager S, Walther G. Adaptive concentration of regression trees, with application to random forests. 2015. arXiv: http://arXiv.org/abs/arXiv:150306388. Search in Google Scholar

[50] Chen X, White H. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Trans Inform Theory. 1999;45(2):682–91. 10.1109/18.749011Search in Google Scholar

[51] Benkeser D, van der Laan M. The highly adaptive lasso estimator. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE; 2016. p. 689–96. 10.1109/DSAA.2016.93Search in Google Scholar PubMed PubMed Central

[52] Díaz I, Williams N, Hoffman KL, Schenck EJ. Nonparametric causal effects based on longitudinal modified treatment policies. J Amer Stat Assoc. 2021:1–16. Search in Google Scholar

[53] Díaz I, Hoffman KL, Hejazi NS. Causal survival analysis under competing risks using longitudinal modified treatment policies. 2022. arXiv: http://arXiv.org/abs/arXiv:220203513. 10.1007/s10985-023-09606-7Search in Google Scholar

[54] van der Vaart AW, Wellner JA. Weak convergence and empirical processes. New York: Springer-Verlag; 1996. 10.1007/978-1-4757-2545-2Search in Google Scholar

[55] Bickel PJ. On adaptive estimation. Ann Stat. 1982;10(3):647–71. 10.1214/aos/1176345863Search in Google Scholar

[56] Klaassen CA. Consistent estimation of the influence function of locally asymptotically linear estimators. Ann Stat. 1987;15(4):1548–62. 10.1214/aos/1176350609Search in Google Scholar

[57] Zheng W, van der Laan MJ. Cross-validated targeted minimum-loss-based estimation. In: Targeted learning. New York: Springer; 2011. p. 459–74. 10.1007/978-1-4419-9782-1_27Search in Google Scholar

[58] Benkeser D, Carone M, Laan MVD, Gilbert P. Doubly robust nonparametric inference on the average treatment effect. Biometrika. 2017;104(4):863–80. 10.1093/biomet/asx053Search in Google Scholar PubMed PubMed Central

[59] Balzer L, Ahern J, Galea S, van der Laan M. Estimating effects with rare outcomes and high dimensional covariates: knowledge is power. Epidemiol Meth. 2016;5(1):1–18. 10.1515/em-2014-0020Search in Google Scholar PubMed PubMed Central

[60] Tran L, Petersen M, Schwab J, van der Laan MJ. Robust variance estimation and inference for causal effect estimation. 2018. arXiv: http://arXiv.org/abs/arXiv:181003030. Search in Google Scholar

[61] van der Laan MJ, Polley E, Hubbard A. Super learner. Stat Appl Genetics Molecular Biol. 2007;6(25):25. 10.2202/1544-6115.1309Search in Google Scholar PubMed

[62] van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. In: Division of biostatistics. Berkeley: University of California; 2004. Search in Google Scholar

[63] Lee JD, Nunes Jr EV, Novo P, Bachrach K, Bailey GL, Bhatt S, et al. Comparative effectiveness of extended-release naltrexone versus buprenorphine-naloxone for opioid relapse prevention (X: BOT): a multicentre, open-label, randomised controlled trial. Lancet. 2018;391(10118):309–18. 10.1016/S0140-6736(17)32812-XSearch in Google Scholar PubMed PubMed Central

[64] Greiner MG, Shulman M, Choo TH, Scodes J, Pavlicova M, Campbell AN, et al. Naturalistic follow-up after a trial of medications for opioid use disorder: Medication status, opioid use, and relapse. J Substance Abuse Treatment. 2021;131:108447. 10.1016/j.jsat.2021.108447Search in Google Scholar PubMed PubMed Central

[65] Solli KK, Latif ZeH, Opheim A, Krajci P, Sharma-Haase K, Benth JS, et al. Effectiveness, safety and feasibility of extended-release naltrexone for opioid dependence: a 9-month follow-up to a 3-month randomized trial. Addiction. 2018;113(10):1840–9. 10.1111/add.14278Search in Google Scholar PubMed

[66] Sobell L, Sobell M. Alcohol timeline followback users’ manual. Toronto, Canada: Addiction Research Foundation; 1995. Search in Google Scholar

[67] Rudolph KE, Díaz I, Hejazi NS, van der Laan MJ, Luo SX, Shulman M, et al. Explaining differential effects of medication for opioid use disorder using a novel approach incorporating mediating variables. Addiction. 2021;116(8):2094–103. 10.1111/add.15377Search in Google Scholar PubMed

[68] SAMHSA. Medications for opioid use disorder for healthcare and addiction professionals, policymakers, patients, and families: treatment improvement protocol TIP 63. 2021. https://store.samhsa.gov/sites/default/files/SAMHSA_Digital_Download/PEP21-02-01-002.pdf. Search in Google Scholar

[69] Hamilton M. The Hamilton depression scale-accelerator or break on antidepressant drug discovery. Psychiatry. 1960;23:56–62. 10.1136/jnnp.23.1.56Search in Google Scholar PubMed PubMed Central

[70] Cooper ZD, Johnson KW, Pavlicova M, Glass A, Vosburg SK, Sullivan MA, et al. The effects of ibudilast, a glial activation inhibitor, on opioid withdrawal symptoms in opioid-dependent volunteers. Addiction Biol. 2016;21(4):895–903. 10.1111/adb.12261Search in Google Scholar PubMed PubMed Central

[71] Hser YI, Evans E, Huang D, Weiss R, Saxon A, Carroll KM, et al. Long-term outcomes after randomization to buprenorphine/naloxone versus methadone in a multi-site trial. Addiction. 2016;111(4):695–705. 10.1111/add.13238Search in Google Scholar PubMed PubMed Central

[72] Weiss RD, Potter JS, Griffin ML, Provost SE, Fitzmaurice GM, McDermott KA, et al. Long-term outcomes from the national drug abuse treatment clinical trials network prescription opioid addiction treatment study. Drug Alcohol Depend. 2015;150:112–9. 10.1016/j.drugalcdep.2015.02.030Search in Google Scholar PubMed PubMed Central

[73] Kennedy EH, Balakrishnan S, Wasserman L. Semiparametric counterfactual density estimation. 2021. arXiv: http://arXiv.org/abs/arXiv:210212034. Search in Google Scholar

[74] Díaz I. Causal influence, causal effects, and path analysis in the presence of intermediate confounding. 2022. arXiv: http://arXiv.org/abs/arXiv:220508000. Search in Google Scholar

Received: 2022-11-16
Revised: 2023-04-12
Accepted: 2023-04-13
Published Online: 2023-07-05

© 2023 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Research Articles
  2. Adaptive normalization for IPW estimation
  3. Matched design for marginal causal effect on restricted mean survival time in observational studies
  4. Robust inference for matching under rolling enrollment
  5. Attributable fraction and related measures: Conceptual relations in the counterfactual framework
  6. Causality and independence in perfectly adapted dynamical systems
  7. Sensitivity analysis for causal decomposition analysis: Assessing robustness toward omitted variable bias
  8. Instrumental variable regression via kernel maximum moment loss
  9. Randomization-based, Bayesian inference of causal effects
  10. On the pitfalls of Gaussian likelihood scoring for causal discovery
  11. Double machine learning and automated confounder selection: A cautionary tale
  12. Randomized graph cluster randomization
  13. Efficient and flexible mediation analysis with time-varying mediators, treatments, and confounders
  14. Minimally capturing heterogeneous complier effect of endogenous treatment for any outcome variable
  15. Quantitative probing: Validating causal models with quantitative domain knowledge
  16. On the dimensional indeterminacy of one-wave factor analysis under causal effects
  17. Heterogeneous interventional effects with multiple mediators: Semiparametric and nonparametric approaches
  18. Exploiting neighborhood interference with low-order interactions under unit randomized design
  19. Robust variance estimation and inference for causal effect estimation
  20. Bounding the probabilities of benefit and harm through sensitivity parameters and proxies
  21. Potential outcome and decision theoretic foundations for statistical causality
  22. 2D score-based estimation of heterogeneous treatment effects
  23. Identification of in-sample positivity violations using regression trees: The PoRT algorithm
  24. Model-based regression adjustment with model-free covariates for network interference
  25. All models are wrong, but which are useful? Comparing parametric and nonparametric estimation of causal effects in finite samples
  26. Confidence in causal inference under structure uncertainty in linear causal models with equal variances
  27. Special Issue on Integration of observational studies with randomized trials - Part II
  28. Personalized decision making – A conceptual introduction
  29. Precise unbiased estimation in randomized experiments using auxiliary observational data
  30. Conditional average treatment effect estimation with marginally constrained models
  31. Testing for treatment effect twice using internal and external controls in clinical trials
Downloaded on 20.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/jci-2022-0077/html
Scroll to top button