Home Mediation Analysis with Attributable Fractions
Article Publicly Available

Mediation Analysis with Attributable Fractions

  • Arvid Sjölander EMAIL logo
Published/Copyright: April 13, 2018
Become an author with De Gruyter Brill

Abstract

The attributable fraction is a common measure in epidemiological research, which quantifies the public health impact of a particular exposure on a particular outcome. Often, the exposure effect may be mediated through a third variable, which lies on the causal pathway between the exposure and the outcome. To assess the role of such mediators we propose a decomposition of the attributable fraction into a direct component and a mediated component. We show how these components can be estimated in cross-sectional, cohort and case-control studies, using either maximum likelihood or doubly robust estimation methods. We illustrate the proposed methods by an application to a study of physical activity, overweight and CVD. In an Appendix we provide R-code, which implements the proposed methods.

1 Introduction

The attributable fraction (AF) is a common measure in epidemiological research, which quantifies the public health impact of a particular exposure on a particular outcome. It is defined as the proportion of outcome events that would be eliminated if the exposure was hypothetically eliminated from the population (Levin 1953). The AF has been in focus of intensive research, and various estimation techniques have been developed for all common study designs (e.g. Miettinen 1974; Sturmans, Mulder, and Valkenburg 1977; Deubner et al. 1980; Bruzzi et al. 1985; Greenland and Drescher 1993; Sjölander and Vansteelandt 2011) (e.g. Bruzzi et al. 1985; Deubner et al. 1980; Greenland and Drescher 1993; Miettinen 1974; Sjölander and Vansteelandt 2011; Sturmans, Mulder, and Valkenburg 1977).

However, less attention has been given to scenarios where there is more than one exposure of interest. For such scenarios, Eide and Gefeller (1995) proposed to estimate ‘sequential’ AFs, by considering a sequence of hypothetical scenarios where the exposures are eliminated one at a time. Obviously, the values of these sequential AFs may depend on the order in which we eliminate the exposures. Eide and Gefeller (1995) proposed to consider all possible orderings and, for each exposure, take the average over its order-specific sequential AFs, thus obtaining an ‘average’ AF for each exposure.

The average AF has the virtue of being simple, since it is one scalar measure independent of exposure ordering. However, it may not be a particularly relevant measure when the ordering is etiologically/scientifically important. This would typically be the case when one of the exposures acts as a mediator for another exposure. For instance, it has been shown (Mendis, Puska, and Norrving 2011) that both physical inactivity and overweight are strong risk factors for cardiovascular disease (CVD). Being physical inactive also increases the risk of becoming obese, so part of the effect of physical inactivity on CVD may be mediated through overweight. However, the average AF does not provide information about the existence or magnitude of such a mediated effect.

Like the AF, mediation analysis has been in focus of intensive research. Several definitions of direct and indirect (i.e. mediated) effects have been developed (Robins and Greenland 1992; Pearl 2001; Rubin 2004), as well as estimation techniques for scenarios where the effects are identifiable (see Vansteelandt (2009) and the references therein), and non-parametric bounds for scenarios where the effects are not identifiable (e.g. Cai et al. 2008; Sjölander 2009). In this paper we develop methods for mediation analysis with attributable fractions. We focus on scenarios with two exposures, where the effect of the first exposure (e.g. physical inactivity) on the outcome (e.g. CVD) is potentially mediated through the second exposure (e.g. overweight). We propose a decomposition of the AF into a mediated component and a direct component, which measure the proportion of outcome events that can be attributed to a mediated effect and direct effect, respectively, of the first exposure. These components sum up to the total AF for the first exposure.

The paper is organized as follows. In Section 2 we present basic notation and definitions, and our proposed decomposition of the AF. In Section 3 we show how the decomposed AF can be estimated with either maximum likelihood (ML) or doubly robust (DR) estimation methods. Whereas ML methods rely on one regression model, and are generally inconsistent if this model is incorrect, DR methods use two regression models and give consistent estimates if either model is correct, not necessarily both. We consider estimation in both cross-sectional studies, cohort studies and case-controls studies. In Section 4 we present the results from a simulation study, and in Section 5 we illustrate the proposed methods by an application to a study of physical activity, overweight and CVD. In an Appendix we provide R-code, which implements the proposed methods.

2 Notation and definitions

Let A and Y be the binary (0/1) exposure and outcome of interest, respectively, and let YA=0 be the counterfactual outcome for a given subject when A is set to 0 for that subject. The AF for exposure A, with respect to outcome Y, is defined as

(1)AF=p(Y=1)p(YA=0=1)p(Y=1),

where p(Y=1) is the factual outcome prevalence, and p(YA=0=1) is the counterfactual outcome prevalence when A is set to 0 for everyone. For instance, if the factual outcome prevalence is 10%, and setting A to 0 for everyone reduces the outcome prevalence to 5%, then (0.10.05)/0.1=50% of all outcome events can be attributed to a causal effect of A on Y.

Let M be a mediator of the effect of A on Y. We allow for M to be binary, multi-level categorical or continuous. We propose to decompose the ‘total’ AF for A in eq. (1), into the ‘natural direct AF’, defined as

(2)NDAF=p(Y=1)p(YA=0,M=1)p(Y=1),

and the ‘natural indirect AF’, defined as

(3)NIAF=p(YA=0,M=1)p(YA=0=1)p(Y=1).

Here, YA=0,M is the counterfactual outcome for a given subject, when A is set to 0 and M is simultaneously set to its naturally observed value (i.e. the observed value of M when no intervention is made on A). We note that YA=0 and YA=0,M may also be written as YA=0,M0 and YA=0,MA, respectively, where M0 is the counterfactual value of M when A is set to 0, and MA is the value of M when A is set to its naturally observed value. In graphical jargon, such interventions have been referred to as ‘node interventions’ and ‘edge interventions’, respectively (Shpitser and Tchetgen Tchetgen 2016).

By setting M to its naturally observed value in eq. (2), while setting A to 0, which may be counterfactual, we break the influence of A on M. Thus, NDAF measures the proportion of outcomes that can be attributed to a direct (i.e. not mediated through M) effect of A on Y. Similarly, NIAF measures the proportion of outcomes that can be attributed to an indirect (i.e. mediated through M) effect of A on Y. If there is no direct effect of A on Y, then setting A to 0 has no effect on the outcome when the influence of A on M is broken, so that the counterfactual outcome YA=0,M is equal to the factual outcome Y for every subject. It follows that p(YA=0,M=1)=p(Y=1) and NDAF=0. Likewise, if there is no indirect effect of A on Y, then setting M to its naturally observed value has no effect on the outcome when A is set to 0, so that the counterfactual outcomes YA=0,M and YA=0 are equal. It follows that p(YA=0,M=1)=p(YA=0=1) and NIAF=0. Trivially, we have that the natural direct and indirect AF’s sum up to the total AF, i.e. NDAF+NIAF=AF.

We note that the natural direct and indirect AFs in eqs. (2) and (3) are close in spirit to the natural direct and indirect effects proposed by Robins and Greenland (1992) and Pearl (2001). An important difference though, is that the natural direct and indirect effects utilize nested counterfactuals where A is set to some fixed value a and M is set to the value it would have had if A would have been set to some other fixed value a (Pearl 2001). In contrast, the natural direct and indirect AFs utilize counterfactuals where A is set to 0 and M is set to the value it would have had, if A would not have been set at all. The latter guarantees that NDAF=0 and NIAF=0 in the absence of direct and indirect effects, respectively.

3 Model-based estimation

3.1 Counterfactual independence assumptions

The natural direct and indirect AFs are causal (counterfactual) parameters, and consistent estimation of these from observational data requires that we link the counterfactual variables to the observed variables, and that we make appropriate control for confounders. We thus make the standard ‘consistency assumption’ (Pearl 2009) that an intervention to set variable(s) V to value v does not affect the outcome, if V is already observed to be equal v:

(4)V=vYV=v=Y.

We further assume that data have been collected on a set of variables C, which is sufficient for confounding control. Formally, we assume that that YA=0 is conditionally independent of A, given C:

(5)YA=0A|C,

and that YA=0,M is conditionally independent of A, given (M,C):

(6)YA=0,MA|(M,C),

Assumption (5) requires that C is sufficient for control of A-Y confounding, and assumption (6) requires that C is sufficient for control of both A-Y confounding and M-Y confounding. In addition, assumption (6) requires that there are no confounders for M and Y that are affected by A. Assumptions (5) and (6) would thus, for instance, hold under the causal diagram in Figure 1. We refer to Pearl (2009) for general techniques to evaluate whether assumptions (5) and (6) hold under other causal diagrams. We note that the graphical identification criteria proposed by Shpitser and VanderWeele (2011) for natural direct and indirect effects also suffice for identification of our parameters of interest.

Figure 1: A causal diagram under which assumptions (6) and (5) hold.
Figure 1:

A causal diagram under which assumptions (6) and (5) hold.

Under assumptions (4)-(6) we can rewrite the natural direct and indirect AFs, so that they are free from counterfactual probabilities. Towards this end we define ψ=E{p(Y=1|A=0,M,C)}, where the outer expectation is taken over the marginal distribution of (M,C). We can write the counterfactual probability p(YA=0,M=1) as

(7)p(YA=0,M=1)=E{p(YA=0,M=1|M,C)}\hfill=E{p(YA=0,M=1|A=0,M,C)}\hfill=ψ,\hfill

where the first equality follows from the law of iterated expectations, the second equality follows from assumption (6) and the third equality follows from assumption (4). Similarly, define ϕ=E{p(Y=1|A=0,C)}, where the outer expectation is taken over the marginal distribution of C. We can write the counterfactual probability p(YA=0=1) as

(8)p(YA=0=1)=E{p(YA=0=1|C)}\hfill=E{p(YA=0=1|A=0,C)}\hfill=ϕ,\hfill

where the first equality follows from the law of iterated expectations, the second equality follows from assumption (5) and the third equality follows from assumption (4).

Define π=p(Y=1). It follows that we can write the natural direct and indirect AFs as

(9)NDAF=πψπ

and

(10)NIAF=ψϕπ.

The expressions in (9) and (10) are both free from counterfactual variables, which enables us to estimate these from observational data, as described in the following two sections.

We end this section by emphasizing the importance of appropriate confounding control, when applying the estimation techniques developed in the following sections. Arguably, there is always unmeasured confounding in real observational studies. However, with a careful study design, the degree of unmeasured confounding may be reasonably low, in which case the crucial assumptions (5) and (6) may hold as an approximation.

3.2 Cross-sectional/cohort studies

In cross-sectional studies and cohort studies, data consist of an iid sample of size n from p(Y,A,M,C). Under this sampling scheme we may estimate π, ψ and ϕ separately, and plug the estimates into eqs. (9) and (10).

Trivially, we may use the sample prevalence πˆ as an estimate of the population prevalence π. A simple way to estimate ψ and ϕ would be to use regression models for p(Y|A,M,C) and p(Y|A,C), respectively. Since Y is binary it is natural to use logistic regression models, which can be fitted with standard ML methods. From the fitted models we may obtain predictions of p(Y=1|A=0,M,C) and p(Y=1|A=0,C) for each subject (i.e. for each observed value of (M,C) and C, respectively). By averaging these predictions we obtain ML estimates of ψ and ϕ, respectively.

A disadvantage of this ML method is that p(Y|A,M,C) and p(Y|A,C) are not variation independent, but related through the integral

(11)p(Y|A,C)=p(Y|A,M=m,C)p(M=m|A,C)dm.

This feature is somewhat awkward, since it implies that it may difficult to postulate models for p(Y|A,M,C) and p(Y|A,C) that are simultaneously correct. For instance, a logistic form for p(Y|A,M,C) generally implies a more complex functional form for p(Y|A,C), and vice versa.

One solution to this problem would be to replace the model for p(Y|A,C) with a model for p(M|A,C), then use the relation in eq. (11) to derive the implied model for p(Y|A,C). Since p(Y|A,M,C) and p(M|A,C) are variation independent, the implied model for p(Y|A,C) is guaranteed to be compatible with the model for p(Y|A,M,C). However, this method has important disadvantages for non-binary mediators. First, the integral in eq. (11) requires a model for the full conditional distribution of M (e.g. a mean model does not suffice), which may be difficult to well specify. Second, the integral in eq. (11) does not generally have an analytic solution, which makes the method computationally demanding.

To reduce sensitivity to model misspecification, while avoiding having to model the mediator, we instead propose to use DR estimators for both ψ and ϕ. These estimators use one model for the outcome and one model for the exposure, and are consistent if either of these are correctly specified, not necessarily both.

Specifically, to estimate ψ we use one model for p(Y|A,M,C) and one model for p(A|M,C). After having fitted these models we estimate ψ as

ψˆ=i=1npˆ(Y=1|A=0,Mi,Ci)+(1Ai)pˆ1(Ai|Mi,Ci){Yipˆ(Y=1|A=0,Mi,Ci)}n,

where pˆ(Y=1|A=0,Mi,Ci) and pˆ(Ai|Mi,Ci) are model-based predictions for subject i. It follows from more general results in Robins (2000) that ψˆ is DR, in the sense that it is consistent if the model for p(Y|A,M,C) or the model for p(A|M,C) is correct, not necessarily both. Similarly, to estimate ϕ we use one model for p(Y|A,C) and one model for p(A|C). After having fitted these models we estimate ϕ as

ϕˆ=i=1npˆ(Y=1|A=0,Ci)+(1Ai)pˆ1(Ai|Ci){Yipˆ(Y=1|A=0,Ci)}n,

where pˆ(Y=1|A=0,Ci) and pˆ(Ai|Ci) are model based predictions for subject i. Again, it follows from Robins (2000) that ϕˆ is DR, in the sense that it is consistent if the model for p(Y|A,C) or the model for p(A|C) is correct, not necessarily both.

We note that p(Y|A,M,C) and p(A|C) are variation independent. This means that the DR estimators give the researcher a ‘genuine chance’ of having a correct model when estimating ψ, while simultaneously having a correct model when estimating ϕ.

In Appendix A we use the theory for M-estimators (Stefanski and Boos 2002) to show that (πˆ,ψˆ,ϕˆ) has an asymptotic normal distribution, and we derive the asymptotic variance-covariance matrix for (πˆ,ψˆ,ϕˆ). We subsequently use the delta method to derive the asymptotic variances for NDAFˆ and NIAFˆ, obtained by plugging (πˆ,ψˆ,ϕˆ) into eqs. (9) and (10), respectively.

Table 1 summarizes what models are required to be correct, to guarantee consistency of the ML and DR estimators of NDAF, NIAF and AF. In this table we have explicitly spelled out NDAF, NIAF and AF as functions of the parameters being estimated by the models.

Table 1:

Model dependence of the ML and DR estimators for cross-sectional/cohort studies.

The ML estimator of ……requires correct model(s) for…
NDAF(π,ψ)p(Y|A,M,C)
NIAF(π,ψ,ϕ)p(Y|A,M,C)and p(Y|A,C)
AF(π,ϕ)p(Y|A,C)
The DR estimator of ……requires correct model(s) for…
NDAF(π,ψ)p(Y|A,M,C)or p(A|M,C)
NIAF(π,ψ,ϕ){p(Y|A,M,C)or p(A|M,C)} and {p(Y|A,C) or p(A|C)}
AF(π,ϕ)p(Y|A,C)or p(A|C)

3.3 Case-control studies

In case-control studies, data consist of n1 iid cases sampled from p(A,M,C|Y=1), and n0 iid controls sampled from p(A,M,C|Y=0). Under this sampling scheme, it is not possible to obtain consistent estimates of π, ψ and ϕ, since this sampling scheme mutilates the outcome distribution. Instead, we use results in Bruzzi et al. (1985) to approximate NDAF and NIAF as functions of conditional odds ratios, which can be consistently estimated from case-control data (Prentice and Pyke 1979).

Let OR(M,C)=p(Y=1,A=1|M,C)p(Y=0,A=0|M,C)p(Y=0,A=1|M,C)p(Y=1,A=0|M,C) be the conditional odds ratio for A and Y, given (M,C), and define ψ˜=E{ORA(M,C)|Y=1}, where the outer expectation is taken over the conditional distribution of (A,M,C), given Y=1. Similarly, let OR(C)=p(Y=1,A=1|C)p(Y=0,A=0|C)p(Y=0,A=1|C)p(Y=1,A=0|C) be the conditional odds ratio for A and Y, given C, and define ϕ˜=E{ORA(C)|Y=1}, where the outer expectation is taken over the conditional distribution of (A,C), given Y=1. It follows from results in Bruzzi et al. (1985) that ψ/πψ˜ and ϕ/πϕ˜, provided that the outcome is rare. We thus have that

(12)NDAF1ψ˜

and

(13)NIAFψ˜ϕ˜.

A simple way to estimate ψ˜ and ϕ˜ is to use logistic regression models for p(Y|A,M,C) and p(Y|A,C), respectively, which can be fitted with standard ML techniques. From the fitted models we may obtain predictions of ORA(M,C) and ORA(C) for each subject (i.e. for each observed value of (A,M,C) and (A,C), respectively). By averaging these predictions among the cases we obtain ML estimates of ψ˜ and ϕ˜, respectively.

However, to protect against model misspecification we may also use DR estimators. Specifically, to estimate ψ˜ we may use one logistic regression model for p(Y|A,M,C) and one logistic regression model for p(A|Y,M,C), which are constrained by having the same conditional odds ratio OR(M,C). We let θ be the parameter (vector) indexing OR(M,C) in these models. A standard choice of models would be logit{p(Y=1|A,M,C)}=θA+α0+α1C+α2M and logit{p(A=1|Y,M,C)}=θY+β0+β1C+β2, in which logit{OR(M,C;θ)}=θ is assumed to be constant across levels of M and C. Other examples are logit{p(Y=1|A,M,C)}=θ1A+θ2AM+α0+α1C+α2M and logit{p(A=1|Y,M,C)}=θ1Y+θ2YM+β0+β1C+β2M, in which logit{OR(M,C;θ)}=θ1+θ2M is allowed to depend linearly on M. The two logistic regression models can be fitted with ML techniques, and the fitted models may be combined, as described by Tchetgen Tchetgen, Robins, and Rotnitzky (2010), to produce an estimate of θ that is consistent if either model is correct, not necessarily both. We use the DR estimate of θ to obtain a prediction of ORA(M,C;θ) for each subject. Finally, we obtain a DR estimate of ψ˜ by averaging these predictions among the cases. We obtain a DR estimate of ϕ˜ in a similar fashion, by using models for p(Y|A,C) and p(A|Y,C) instead of models for p(Y|A,M,C) and p(A|Y,M,C).

In Appendix A we use the theory for M-estimators (Stefanski and Boos 2002) to show that (ψ˜ˆ,ϕ˜ˆ) has an asymptotic normal distribution, and we derive the asymptotic variance-covariance matrix for (ψ˜ˆ,ϕ˜ˆ). We subsequently use the delta method to derive the asymptotic variances for NDAFˆ and NIAFˆ, obtained by plugging (ψ˜ˆ,ϕ˜ˆ) into eqs. (12) and (13), respectively.

Table 2 summarizes what models are required to be correct, to guarantee consistency of the ML and DR estimators of NDAF, NIAF and AF. In this table we have explicitly spelled out NDAF, NIAF and AF as functions of the parameters being estimated by the models.

Table 2:

Model dependence of the ML and DR estimators for case-control studies

The ML estimator of ……requires correct model(s) for…
NDAF(ψ˜)p(Y|A,M,C)
NIAF(ψ˜,ϕ˜)p(Y|A,M,C)and p(Y|A,C)
AF(ϕ˜)p(Y|A,C)
The DR estimator of ……requires correct model(s) for…
NDAF(ψ˜)p(Y|A,M,C)or p(A|Y,M,C)
NIAF(ψ˜,ϕ˜){p(Y|A,M,C)or p(A|Y,M,C)} and {p(Y|A,C) or p(A|Y,C)}
AF(ϕ˜)p(Y|A,C)or p(A|Y,C)

We end this section with a technical remark. The validity of the OR-based estimators developed above depends on the approximate relations ψ/πψ˜ and ϕ/πϕ˜; the better approximations, the less (asymptotic) bias in the estimators. It can be shown (Bruzzi et al. 1985) that these approximations are exact, if the odds ratios OR(M,C) and OR(C) in ψ˜ and ϕ˜, respectively, are replaced by the corresponding risk ratios. Indirectly, the validity of the OR-based estimators is thus related to the concept of non-collapsibility, which is a well-known feature of the odds ratio (Greenland, Robins, and Pearl 1999). When the outcome is rare, the odds ratio approximates the risk ratio, which is collapsible. As the outcome becomes more common, the odds becomes increasingly more non-collapsible, and increasingly different from the risk ratio.

4 Simulations

4.1 Cross-sectional/cohort studies

To investigate the performance of the proposed estimators for cross-sectional/cohort studies (Section 3.2) we generated 10,000 samples, of n=1000 iid observations each, from the model

(14)C1N(0,1)C2Ber(0.5)ABer{expit(2+C1+C22C1C2)}MN(A+C1+C2,1)YBer{expit(5+A+0.5MC1C2+2C1C2)}

For this model we have that p(Y=1)=0.02, NDAF=0.2, NIAF=0.05 and AF=0.25. Each sample was analyzed with the working models

logit{p(Y=1|A,M,C)}=α0+α1C1+α2C2+α12C1C2+α3A+α4Mlogit{p(A=1|M,C)}=β0+β1C1+β2C2+β12C1C2+β3Mlogit{p(Y=1|A,C)}=γ0+γ1C1+γ2C2+γ12C1C2+γ3Alogit{p(A=1|C)}=δ0+δ1C1+δ2C2+δ12C1C2

When analyzing the simulated data, four scenarios were considered. In scenario I, no constraints were imposed on the working model parameters. In scenario II, the interaction terms α12 and γ12 were omitted from the working models. In scenario III, the interaction terms β12 and δ12 were omitted. In scenario IV, all interaction terms were omitted. For this analysis scheme, there are two types of model misspecification. First, the functional form for the probabilities p(A|M,C) and p(Y|A,C) is somewhat misspecified across all scenarios, since the logistic forms for p(Y|A,M,C) and p(A|C) in eq. (22), together with the normal distribution for M|(A,C), do not imply logistic forms for p(A|M,C) and p(Y|A,C). Second, the linear predictor for a particular working model is misspecified in a given scenario, if the interaction term between C1 and C2 is omitted from the working model in that scenario.

For each of the four scenarios scenario we used the working models to calculate the ML estimators and DR estimators of NDAF, NIAF and AF=NDAF+NIAF, as described in Section 3.2, together with their standard errors, as described in Appendix A. We finally calculated the mean (over the 10,000 samples) estimates, the mean standard errors and the standard deviation of the estimates. We note that, from the perspective of the ML estimators, scenarios I and III are identical and scenarios II and IV are identical, since the ML estimators only use the models for p(Y|A,M,C) and p(Y|A,C).

Table 3 displays the results. As expected, the ML estimator of NDAF is unbiased in scenarios I and III, where the model for p(Y|A,M,C) is correct. Interestingly, the ML estimators of NIAF and AF are unbiased in these scenarios as well, even though the functional form for p(Y|A,C) is misspecified for these (and all other) scenarios. The ML estimators of NDAF, NIAF and AF are biased in scenarios II and IV though, where the linear predictors for p(Y|A,M,C) and p(Y|A,C) are misspecified. We observe a similar pattern for the DR estimators. The DR estimator of NDAF is unbiased in scenarios I, II and III, where the linear predictor for either (or both) p(Y|A,M,C) and p(A|M,C) is (are) correct, even though the functional form for p(A|M,C) is misspecified. The DR estimators of NIAF and AF are also unbiased in scenarios I, II and III, where the linear predictor for either (or both) p(Y|A,C) and p(A|C) is (are) correct, even though the functional form for p(Y|A,C) is misspecified. In scenario IV, the DR estimators of both NDAF, NIAF and AF are biased. These results indicate that the DR estimators are indeed more robust to model misspecification than the ML estimators, and that both estimators have some robustness against misspecification of the functional form of the modeled probabilities. The mean standard errors agree well with the standard deviation of the estimates, for both the ML and DR estimators and across all scenarios. The mean standard error for the DR estimator is close to the mean standard error for the ML estimator across all scenarios, which indicates that the additional roubstness of the DR estimator comes at a relatively minor loss in precision.

We caution the reader that the observed robustness against functional form misspecification is not guaranteed by theory, and may depend heavily on the data generating model. In a setting similar to our, VanderWeele and Vansteelandt (2010) showed that functional form misspecification is minor when the outcome is rare and the mediator is continuous with constant variance, which is precisely the model we have simulated from above. Thus, one may suspect that bias due to functional form misspecification would be more pronounced if the outcome was more common and/or the mediator had a skewed distribution. We investigated this in an additional simulation (see Appendix B), but observed no more bias than in the simulation above. This gives some further support to the conclusions that, while misspecification of the linear predictor can give rise to serious bias, the impact of function form misspecification may be less severe.

Much of the literature on mediation has focused on the scenario where there are exposure-mediator interactions (see, for instance, VanderWeele (2015) and the references therein). We have added a simulation in Appendix B, in which exposure-mediator interaction is present. The conclusions from this simulation are again similar to those from the simulation above.

Table 3:

Simulation results for cross-sectional/cohort studies. The table reports mean estimates (est), mean standard errors (se) and standard deviations (sd) of the estimates, over 10,000 simulated samples. The models for p(Y|A,M,C) and p(A|C) have correct functional form in all scenarios. The models for p(Y|A,C) and p(A|M,C) have incorrect functional forms in all scenarios. The models for p(A|M,C) and p(A|C) have correct linear predictors in scenarios I and II, and incorrect linear predictors in Scenarios III and IV. The models for p(Y|A,M,C) and p(Y|A,C) have correct linear predictors in scenarios I and III, and incorrect linear predictors in Scenarios II and IV. True values are NDAF=0.20, NIAF=0.05 and AF=0.25. Mean estimates marked with ‘*’ are expected to be close to the true values, according to theory (see Table 1).

MLDR
ests.e.s.d.ests.e.s.d.
I
NDAF0.200.120.130.200.130.14
NIAF0.060.040.040.050.060.07
AF0.250.110.110.250.110.11
II
NDAF0.140.140.140.190.140.15
NIAF0.080.050.050.050.070.08
AF0.220.120.120.250.110.12
III
NDAF0.200.120.130.200.140.15
NIAF0.060.040.040.050.080.09
AF0.250.110.110.250.110.11
IV
NDAF0.140.140.140.160.150.16
NIAF0.080.050.050.070.090.10
AF0.220.120.120.230.120.12

4.2 Case-control studies

To investigate the performance of the proposed estimators for case-control studies (Section 3.3) we generated a population of 10,000,000 subjects from the model in eq. (22). From this population we drew 10,000 samples of n1=500 cases and n0=500 controls. Each sample was analyzed with the working models

logit{p(Y=1|A,M,C)}=θA+α0+α1C1+α2C2+α12C1C2+α3Mlogit{p(A=1|Y,M,C)}=θY+β0+β1C1+β2C2+β12C1C2+β3Mlogit{p(Y=1|A,C)}=ηA+γ0+γ1C1+γ2C2+γ12C1C2logit{p(A=1|Y,C)}=ηY+δ0+δ1C1+δ2C2+δ12C1C2

In these working models, OR(M,C)=exp(θ) and OR(C)=exp(η). We considered the same four scenarios as in the previous section. Thus, only the functional form for p(Y|A,M,C) is correct, since logistic forms for p(Y|A,M,C) and p(A|C), together with the normal distribution for M|(A,C), do not imply logistic forms for p(A|Y,M,C), p(Y|A,C) or p(A|Y,C). For each scenario we used the working models to calculate the ML estimators and DR estimators of NDAF, NIAF and AF=NDAF+NIAF, as described in Section 3.3, together with their standard errors, as described in Appendix A. We finally calculated the mean (over the 10,000 samples) estimates, the mean standard errors and the standard deviation of the estimates. Again we note that, from the perspective of the ML estimators, scenarios I and III are identical and scenarios II and IV are identical, since the ML estimators only use the models for p(Y|A,M,C) and p(Y|A,C).

Table 4 displays the results. We observe the same pattern as in the previous section. The ML estimators are unbiased in scenarios I and III, where the linear predictors for p(Y|A,M,C) and p(Y|A,C) are correct, but biased in scenarios II and IV where these linear predictors are misspecified. The DR estimator of NDAF is unbiased in scenarios I, II and III, where the linear predictor for either (or both) p(Y|A,M,C) and p(A|Y,M,C) is (are) correct, and the DR estimator of NIAF is also unbiased in scenarios I, II and III, where the linear predictor for either (or both) p(Y|A,C) and p(A|Y,C) is (are) correct. Misspecification of the functional form for the modeled probabilities does not seem to give bias for any of the ML or DR estimators.

Table 4:

Simulation results for case-control studies. The table reports mean estimates (est), mean standard errors (se) and standard deviations (sd) of the estimates, over 10,000 simulated samples. True values are NDAF=0.20, NIAF=0.05 and AF=0.25. The model for p(Y|A,M,C) has correct functional form in all scenarios. The models for p(Y|A,C), p(A|Y,M,C) and p(A|Y,C) have incorrect functional forms in all scenarios. The models for p(A|Y,M,C) and p(A|Y,C) have correct linear predictors in scenarios I and II, and incorrect linear predictors in scenarios III and IV. The models for p(Y|A,M,C) and p(Y|A,C) have correct linear predictors in scenarios I and III, and incorrect linear predictors in scenarios II and IV. True values are NDAF=0.20, NIAF=0.05 and AF=0.25. Mean estimates marked with ‘*’ are expected to be close to the true values, according to theory (see Table 2).

MLDR
ests.e.s.d.ests.e.s.d.
I
NDAF0.210.040.040.190.040.04
NIAF0.050.010.010.050.020.02
AF0.260.030.030.250.030.03
II
NDAF0.120.050.050.210.030.03
NIAF0.090.020.020.040.010.01
AF0.200.030.030.260.030.03
III
NDAF0.210.040.040.190.040.04
NIAF0.050.010.010.050.020.02
AF0.260.030.030.240.030.03
IV
NDAF0.120.050.050.090.050.05
NIAF0.090.020.020.100.030.03
AF0.200.030.030.190.030.03

5 Applied example

In this section we illustrate the proposed methods by an application to a study of physical activity, overweight and CVD. Sjölander (2011) and Sjölander and Vansteelandt (2011) aimed to estimate the fraction of CVD cases, during a 10-year period, that can be attributed to overweight. They used data from the National March Cohort, which was established in 1997 during a fund-raising event organized by the Swedish Cancer Society, and included 300,000 individuals. Questionnaire data on known or suspected risk factors for CVD were obtained on a subset of 43,880 individuals. The cohort was followed until 2006, and each CVD event recorded. The binary exposure was defined as the indicator of BMI being below 18.5 kg/m2 or above 25 kg/m2, and the binary outcome was defined as the indicator of any CVD event during follow up. In the analysis, Sjölander (2011) and Sjölander and Vansteelandt (2011) controlled for age and physical activity at baseline, where the latter was obtained from the questionnaire and summarized into a scalar integer ranging from 0 (minimal level of activity) to 10 (maximal level of activity).

We used the same data as Sjölander (2011) and Sjölander and Vansteelandt (2011), but considered physical activity as the exposure and BMI as the mediator. Specifically, we wished to estimate the fraction of CVD cases that can be attributed to a direct effect of physical inactivity on CVD, and the fraction of CVD cases that can be attributed to an indirect effect, mediated through overweight. As attributable fractions require the exposure to be binary, we defined a binary exposure A as the indicator of having a physical activity level below a fixed threshold k. Below, we present results for k=1,2,3. With this dichotomization of physical activity, the attributable fractions should be interpreted as fractions of CVD cases that would be prevented, if the distribution of physical activity levels among those who are factually inactive (activity level k) would be mutilated to resemble the distribution of physical activity levels among those that are factually active (activity level M). Since our proposed estimation methods do not require the mediator to be binary, we used BMI as a continuous mediator Y. As in Sjölander (2011) and Sjölander and Vansteelandt (2011) we defined the binary outcome C=(age, sex) as the indicator of any CVD event during follow up.

In all analysis we controlled for the potential confounders k=1,2,3. Arguably, there could be other important confounders for both the exposure-outcome relationship and the mediator-outcome relationship, such as socio-economic status and ethnicity. Thus, the estimated AFs should be taken with a pinch of salt, as these may suffer from moderate-to-severe bias due to violations of assumptions (5) and (6).

For each of the thresholds NDAF, we calculated the ML and DR estimates of NIAF, AF=NDAF+NIAF and p(Y|A,M,C), as described in Section 3.2, using logistic regression models for p(A|M,C), p(Y|A,C), p(A|C) and NDAF. Most real studies use models with only main effects. To mimic this common practice we only included main effects of physical activity, BMI, age and sex in the logistic regression models. A possible concern is that the ML estimator may suffer from bias due to model misspecification, if the data generating mechanism includes, for instance, interaction terms and non-linearities. The DR estimator offers a simple way to partly protect against this type bias.

In Appendix C we provide outputs for all fitted regression models. Table 5 displays the estimates of NIAF, AF and estimate±1.96×standard error, together with 95% Wald confidence intervals (k=1). The ML estimates for \lt indicate that about 4.07 and 0.48 out of 1000 CVD cases would be eliminated, through a direct and indirect effect of physical activity on CVD, respectively, if all subjects with activity level 11 would be forced to an activity level k=2. For k=3 and NDAF the corresponding ML estimates are 10.71 and 2.63, and 22.05 and 5.52, respectively. The DR estimates of both NIAF and k are close to the corresponding ML estimates for all NIAF. This gives some (informal) evidence against severe model misspecification. All estimates are statistically significant, at 5% significance level, except the DR estimate of k=1 at k=1.

Table 5:

Analysis results for the National March Cohort.

MLDR
est95% CIest95% CI
1000×NDAF
1000×NIAF4.07(0.54,7.61)4.20(0.67,7.74)
1000×AF0.48(0.25,0.71)0.35(-0.06,0.77)
k=24.56(0.99, 8.12)4.55(0.99,8.12)
1000×NDAF
1000×NIAF10.71(0.80,20.62)11.14(1.22,21.07)
1000×AF2.63(1.68,3.58)2.19(0.92,3.45)
k=313.33(3.45,23.22)13.33(3.46,23.21)
1000×NDAF
1000×NIAF22.05(5.73,38.36)22.42(6.08,38.76)
1000×AF5.52(3.66,7.38)4.97(2.52,7.43)
p(Y|A,M,C;ν1)27.56(11.42,43.71)27.39(11.27,43.52)

6 Discussion

In this paper we have proposed a decomposition of the AF into a direct component and a mediated component. For given exposure, outcome and mediator, these components measure the proportion of outcome events that can be attributed to a mediated effect and direct effect, respectively, of the exposure. We have shown how the direct and mediated AFs can be estimated with ML and DR methods, in both cross-sectional, cohort and case-control studies. In simulations we observed that the DR estimators were almost as efficient as the ML estimators. Furthermore, the ML estimators were often biased when the underlying model was incorrectly specified. These results speak in favor of the DR estimators.

We have considered unmatched case-control studies, but the proposed methods may also be extended to matched case-control studies. If ordinary logistic regression is used, and the matching variables are controlled for by explicitly adding them to the model, then the proposed estimators in Section 3.3 apply without modification. To account for the correlation of subjects within matched strata, a minor modification is required for the calculation of standard errors; we explain this in Appendix A. If conditional logistic regression is used, then the ML estimation method from Section 3.3 can be used without modification. However, DR estimation becomes more difficult, since the DR estimation methods developed by Tchetgen Tchetgen, Robins, and Rotnitzky (2010) does not apply to odds ratios estimated with conditional logistic regression models.

A common obstacle for practitioners who wish to use novel methodology is the lack of software implementation. To facilitate the use of our proposed methods we have written two R-functions, which implement the estimators developed in Sections 3.2 and 3.3, respectively. These functions use analytic expressions for all standard errors, as derived in Appendix A, and thus avoid time consuming bootstrap resampling techniques. We provide the code for these functions in Appendix D.

Funding statement: This work was supported by The Swedish Research Council (Grant No. 2016-01267).

Appendix

A Asymptotic distribution of estimators

We derive the asymptotic distribution of the proposed DR estimators; the distribution of the ML estimators can be derived in a similar way.

We first consider the estimators for cross-sectional studies and cohort studies. Assume parametric models p(A|M,C;ν2), p(Y|A,C;ν3), p(A|C;ν4) and Si(ν1) and let Si(ν2), Si(ν3), Si(ν4) and i be contributions from subject ν=(ν1T,ν2T,ν3T,ν4T)T to the ML score functions for these models, respectively. Define Si(ν), and let Si(ν1) be the estimating function obtained by stacking Si(ν2), Si(ν3), Si(ν4) and ξ=(π,ψ,ϕ,νT)T. Define ψ. The DR estimators of ϕ and i=1nUi(ξ)=i=1nYiπHi,ψ(ψ,ν)Hi,ϕ(ϕ,ν)Si(ν)=0, proposed in Section 3.2 are obtained by solving the equation system

(15)Hi,ψ(ψ,ν)=p(Y=1|A=0,Mi,Ci;ν1)111+(1Ai)p1(Ai|Mi,Ci;ν2){Yip(Y=1|A=0,Mi,Ci;ν1)}ψ

where

Hi,ϕ(ϕ,ν)=p(Y=1|A=0,Ci;ν3)111+(1Ai)p1(Ai|Ci;ν4){Yip(Y=1|A=0,Ci;ν3)}ϕ.

and

E{Hψ(ψ,ν)}=0

It follows from more general results in Robins (2000) that p(Y|A,M,C;ν1) if either model p(A|M,C;ν2) or model ψ is correct, which implies that the estimator of E{Hϕ(ϕ,ν)}=0 is consistent if either of these models is correct. Similarly, it follows from more general results in Robins (2000) that p(Y|A,C;ν3) if either model p(A|C;ν4) or model ϕ is correct, which implies that the estimator of ξˆ is consistent if etiher of these models is correct. Let ξ be the solution to eq. (15) and let E{U(ξ)} be the solution to n1/2(ξˆξ). From the theory for M-estimators (Stefanski and Boos 2002) it follows that Σ(ξ)=EU(ξ)ξT1var{U(ξ)}EU(ξ)ξT1T. has an asymptotic normal distribution, with mean 0 and variance-covariance matrix

λ(ξ)=πψπψϕππϕπ

Define

(16)n1/2{λ(ξˆ)λ(ξ)}

Using the delta method we have that λ(ξ)ξTΣ(ξ)λ(ξ)ξTT. has an asymptotic normal distribution, with mean 0 and variance-covariance matrix

(17)λ(ξˆ)

A consistent estimate of the variance of ξ is obtained by replacing the true value of ξˆ with p(Y|A,M,C;θ,ν1), and the population moments with their sample counterparts.

We next consider case-control studies. Assume parametric models p(A|Y,M,C;θ,ν2) and OR(M,C;θ), which are constrained to have the same conditional odds ratio p(Y|A,C;η,ν3), and parametric models p(A|Y,C;η,ν4) and OR(C;η), which are constrained to have the same conditional odds ratio Si(θ,ν1). Let Si(θ,ν2), Si(η,ν3), Si(η,ν4) and i be contributions from subject ν=(θT,ν1T,θTν2T,ηT,ν3T,ηT,ν4T)T to the ML score functions for these models. Define Si(ν), and let Si(θ,ν1) be the estimating function obtained by stacking Si(θ,ν2), Si(η,ν3), Si(ην4) and ξ=(ψ˜,ϕ˜,θ,η,νT)T. Define ψ˜. The DR estimators of ψ˜ and i=1nUi(ξ)=i=1nYi{ORAi(Mi,Ci;θ)ψ˜}Yi{ORAi(Ci;η)ϕ˜}Hi,θ(θ,ν)Hi,η(η,ν)Si(ν)=0, proposed in Section 3.3 are obtained by solving the equation system

(18)Hi,θ(θ,ν)={Yip(Y=1|Ai,Mi,Ci;θ,ν1)}111×log{OR(Mi,Ci;θ)}θ{Aidθ(Mi,Ci;θ,ν1,ν2)},

where

dθ(Mi,Ci;θ,ν1,ν2)=1+p(A=0|Y=0,Mi,Ci;ν2)p(Y=1|A=0,Mi,Ci;ν1)p(A=1|Y=0,Mi,Ci;ν2)p(Y=1|A=1,Mi,Ci;θ,ν1)1,
Hi,η(η,ν)={Yip(Y=1|Ai,Ci;η,ν3)}111×log{OR(Ci;η)}η{Aidη(Ci;η,ν3,ν4)},
dη(Ci;η,ν3,ν4)=1+p(A=0|Y=0,Ci;ν4)p(Y=1|A=0,Ci;ν3)p(A=1|Y=0,Ci;ν4)p(Y=1|A=1,Ci;η,ν3)1,

and

E{Hθ(θ,ν)}=0

It follows from more general results in Tchetgen Tchetgen, Robins, and Rotnitzky (2010) that p(Y|A,M,C;θ,ν1) if either model p(A|Y,M,C;θ,ν2) or model θ is correct, which implies that the estimator of ψ˜, and thus also the estimator of E{Hη(η,ν)}=0, is consistent if either of these models is correct. Similarly, it follows from more general results in Tchetgen Tchetgen, Robins, and Rotnitzky (2010) that p(Y|A,C;η,ν3) if either model p(A|Y,C;η,ν4) or model η is correct, which implies that the estimator of ϕ˜, and thus also the estimator of ξ, is consistent if either of these models is correct. Let E{U(ξ)} be the solution to n1/2(ξˆξ). From the theory for M-estimators (Stefanski and Boos 2002) it follows that Σ(ξ)=EU(ξ)ξT1var{U(ξ)}EU(ξ)ξT1T. has an asymptotic normal distribution, with mean 0 and variance-covariance matrix

λ(ξ)=1ψ˜ψ˜ϕ˜1ϕ˜

Define

(19)n1/2{λ(ξˆ)λ(ξ)}

Using the delta method we have that λ(ξ)ξTΣ(ξ)λ(ξ)ξTT. has an asymptotic normal distribution, with mean 0 and variance-covariance matrix

(20)λ(ξˆ)

A consistent estimate of the variance of ξ is obtained by replacing the true value of ξˆ with n, and the population moments with their sample counterparts.

Sometimes, data consist of clusters in which subjects are correlated. This occurs, for instance, in family studies and in matched studies. A simple way to account for this correlation is to let Ui(ξ) be the total number of clusters, and redefine (18) in eqs. (15) and i as the sum of the subject-specific contributions to the estimating equation within cluster n=1000. Provided that the clusters are independent, the theory for M-estimators (Stefanski and Boos 2002) still applies after this minor modification.

B Additional simulation results

B.1 Common outcome and skewed mediator

In this simulation we generated 10,000 samples, of C1N(0,1)C2Ber(0.5)ABer{expit(2+C1+C22C1C2)}MExpeA+C1+C2YBer{expit(2+A+0.5MC1C2+2C1C2)} iid observations each, from the model

(21)p(Y=1)=0.19

For this model we have that p(Y|A,M,C), which makes the outcome much more common than in Section 4.1. Furthermore, by generating the mediator from an exponential distribution we ensure that we are far from the scenario considered by VanderWeele and Vansteelandt (2010). The true values of NDAF, NIAF and AF are 0.12, -0.03 and 0.09, respectively.

We analyzed the samples in the same way as in Section 4.1. Table 6 displays the results. We observe the same pattern as in Section 4.1; misspecification of the linear predictor gives substantial bias, whereas misspecification of the functional form does not seem to have any major impact.

Table 6:

Simulation results for cross-sectional/cohort studies. The table reports mean estimates (est), mean standard errors (se) and standard deviations (sd) of the estimates, over 10,000 simulated samples. The models for p(A|C) and p(Y|A,C) have correct functional form in all scenarios. The models for p(A|M,C) and p(A|M,C) have incorrect functional forms in all scenarios. The models for p(A|C) and p(Y|A,M,C) have correct linear predictors in scenarios I and II, and incorrect linear predictors in Scenarios III and IV. The models for p(Y|A,C) and NDAF=0.12 have correct linear predictors in scenarios I and III, and incorrect linear predictors in Scenarios II and IV. True values are NIAF=0.03, AF=0.09 and NDAF. Mean estimates marked with ‘*’ are expected to be close to the true values, according to theory (see Table 1).

MLDR
ests.e.s.d.ests.e.s.d.
I
0.120.030.030.12NIAF0.030.03
−0.020.010.01−0.03AF0.020.02
0.090.030.030.08NDAF0.040.04
II
0.060.030.030.12NIAF0.040.03
−0.100.010.01−0.03AF0.030.02
−0.040.040.040.08NDAF0.050.04
III
0.120.030.030.12NIAF0.030.03
−0.020.010.01−0.02AF0.010.01
0.090.030.030.09NDAF0.030.03
IV
NIAF0.060.030.030.060.040.04
AF−0.100.010.01−0.090.020.02
n=1000−0.040.040.04−0.030.040.04

B.2 Exposure-mediator interactions

In this simulation we generated 10,000 samples, of C1N(0,1)C2Ber(0.5)ABer{expit(2+C1+C22C1C2)}MN(A+C1+C2,1)YBer{expit(5+A+0.5MC1C2+2C1C2+A×M)} iid observations each, from the model

(22)p(Y=1)=0.06

For this model we have that NDAF=0.67, NIAF=0.02, AF=0.69 and A.

We analyzed the samples in the same way as in Section 4.1, with the exception that an interaction term was added for M and p(Y|A,M,C) in the model for p(Y|A,M,C). Table 7 displays the results. Again, we observe the same pattern as in Section 4.1; misspecification of the linear predictor gives substantial bias, whereas misspecification of the functional form does not seem to have any major impact.

Table 7:

Simulation results for cross-sectional/cohort studies. The table reports mean estimates (est), mean standard errors (se) and standard deviations (sd) of the estimates, over 10,000 simulated samples. The models for p(A|C) and p(Y|A,C) have correct functional form in all scenarios. The models for p(A|M,C) and p(A|M,C) have incorrect functional forms in all scenarios. The models for p(A|C) and p(Y|A,M,C) have correct linear predictors in scenarios I and II, and incorrect linear predictors in Scenarios III and IV. The models for p(Y|A,C) and NDAF=0.67 have correct linear predictors in scenarios I and III, and incorrect linear predictors in Scenarios II and IV. True values are NIAF=0.02, AF=0.69 and NDAF. Mean estimates marked with ‘*’ are expected to be close to the true values, according to theory (see Table 1).

MLDR
ests.e.s.d.ests.e.s.d.
I
0.670.070.070.67NIAF0.080.08
0.020.010.020.02AF0.030.04
0.690.070.060.69NDAF0.070.06
II
0.640.080.080.67NIAF0.080.08
0.040.020.020.02AF0.030.04
0.680.070.070.69NDAF0.070.06
III
0.670.070.070.67NIAF0.080.08
0.020.010.020.02AF0.030.05
0.690.070.060.69NDAF0.070.06
IV
NIAF0.640.080.080.650.080.09
AF0.040.020.020.030.040.05
p(Y=1|A,M,C)0.680.070.070.680.070.07

C Regression outputs for the applied example

C.1 Threshold k=1

The logistic regression model for p(Y=1|A,M,C):

Table 8:

The logistic regression model for p(Y=1|A,C).

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−9.9596590.226483−43.975< 2e-16 ***
pa.binary0.6228930.2308722.6980.00698 **
bmi0.0468750.0064997.2135.47e-13 ***
age0.0939320.00213843.935< 2e-16 ***
sexmale0.8557870.04520818.930< 2e-16 ***
Table 9:

The logistic regression model for p(A=1|M,C) .

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−8.7266910.141025−61.880< 2e-16 ***
pa.binary0.7198560.2309683.1170.00183 **
age0.0930350.00210344.234< 2e-16 ***
sexmale< 2e-16 ***0.8674040.04519419.193
Table 10:

The logistic regression model for p(A=1|C) .

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−7.2951890.358801−20.332< 2e−16 ***
Bmi0.1110440.0126228.798< 2e−16 ***
age−0.0178760.003662−4.8821.05e−06 ***
sexmale0.7805650.1221796.3891.67e−10 ***
Table 11:

The logistic regression model for p(Y=1|A,M,C) .

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−4.760360.18398−25.874< 2e−16 ***
age−0.012630.00349−3.6180.000296 ***
sexmale0.814370.122056.6722.52e−11 ***

C.2 Threshold k=2

The logistic regression model for p(Y=1|A,M,C):

Table 12:

The logistic regression model for p(Y=1|A,C) .

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−9.9697880.226375−44.041< 2e−16 ***
pa.binary0.1983660.0880172.2540.0242 *
bmi0.0464790.0065107.1409.33e−13 ***
age0.0941030.00214243.934< 2e−16 ***
sexmale0.8572420.04519718.967< 2e−16 ***
Table 13:

The logistic regression model for p(A=1|M,C).

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−8.7538710.141819−61.726< 2e−16 ***
pa.binary0.2520320.0877082.8740.00406 **
age0.0932750.00210844.243< 2e−16 ***
sexmale0.8685900.04518419.224< 2e−16 ***
Table 14:

The logistic regression model for p(A=1|C) .

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−3.9905960.125507−31.796< 2e−16 ***
Bmi0.0799510.00481416.609< 2e−16 ***
age−0.0127370.001179−10.800< 2e−16 ***
sexmale0.1703770.0393554.3291.5e−05 ***
Table 15:

The logistic regression model for p(Y=1|A,M,C).

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−2.2031330.058720−37.520< 2e−16 ***
age−0.0088530.001129−7.8384.56e−15 ***
sexmale0.2035850.0392215.1912.10e−07 ***

C.3 Threshold k=3

Table 16:

The logistic regression model for p(Y=1|A,C).

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−9.9669750.226333−44.037< 2e−16 ***
pa.binary0.1654700.0597512.7690.00562 **
bmi0.0457790.0065277.0142.32e−12 ***
age0.0941080.00213943.990< 2e−16 ***
sexmale0.8586400.04519918.997< 2e−16 ***
Table 17:

The logistic regression model for p(A=1|M,C) .

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−8.7741930.142165−61.72< 2e−16 ***
pa.binary0.2102910.0593973.540.000399 ***
age0.0933070.00210644.30< 2e−16 ***
sexmale0.8699980.04518419.25< 2e−16 ***
Table 18:

The logistic regression model for p(A=1|C).

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−2.95066920.0902247−32.704< 2e−16 ***
bmi0.07218460.003543120.373< 2e−16 ***
age−0.00939960.0008165−11.512< 2e−16 ***
sexmale0.09941210.02739673.6290.000285 ***
Table 19:

The logistic regression model for p(Y=1|A,M,C) .

EstimateStd. Errorz valuePr(>|z|)
(Intercept)−1.34474460.0412541−32.597< 2e−16 ***
age−0.00597980.0007839−7.6292.37e−14 ***
sexmale0.13190020.02724594.8411.29e−06 ***

D R-code

The code below requires a not yet released version of the package drgee, which contains functions that were written for this paper specifically. A binary version of this package for windows is available as supplementary material online for this paper.

The code contains two functions; mediationAF and mediationAF.casectrl. The function mediationAF implements the methods proposed in Section 3.2 and the function mediationAF.casectrl implements the methods proposed in Section 3.3.

The function mediationAF has arguments formulaYAM, formulaYA, formulaAM, formulaA, data, A and Y. The arguments formulaYAM, formulaYA, formulaAM and formulaA specify logistic regression models for p(Y=1|A,C), p(A=1|M,C), p(A=1|C) and p(Y=1|A=0,M,C), respectively, on the usual R-format. If only formulaYAM and formulaYA are given, then ML estimation is performed. If additionally formulaAM and formulaA are given, then DR estimation is performed. The argument data is the data frame containing the variables in the models. The arguments A and Y are the names of the exposure and outcome, respectively, as strings.

The function mediationAF.casectrl has arguments formulaYAM, formulaAYM, formulaYA, formulaAY, iaformulaM, iaformula, data, A and Y. The arguments formulaYAM, formulaAYM, formulaYA and formulaAY specify logistic regression models for p(A=1|Y=0,M,C), p(Y=1|A=0,C), p(A=1|Y=0,C) and OR(M,C), respectively, on the usual R-format. If only formulaYAM and formulaYA are given, then ML estimation is performed. If additionally formulaAYM and formulaAY are given, then DR estimation is performed. The argument iaformulaM specifies the odds ratio OR(M,C). If this argument is not given, then M is assumed constant across levels of C and OR(C). The argument iaformula specifies the odds ratio OR(C). If this argument is not given, then C is assumed constant across levels of C. We refer to manual for the drgee package for further details on how to interpret and specify the arguments iaformulaM and iaformula. The arguments data, A and Y are similar to those for the function mediationAF.

library(drgee)                mediationAF <- function(formulaYAM, formulaYA, formulaAM, formulaA, data, A, Y){                    n <- nrow(data)    data0 <- data    data0[, A] <- 0    p <- mean(data[, Y])    Up <- data[, Y] - p    dUp <- -1    #ML for mu1 = E{p(Y=1|A=0,M,C)}    if(!missing(formulaYAM) & missing(formulaAM)){        fitYAM <- glm(formula=formulaYAM, family="binomial", data=data)        pred <- predict(object=fitYAM, newdata=data0, type="response")        mu1 <- mean(pred)        Umu1 <- pred - mu1        UbetaYAM <- residuals(object=fitYAM, type="response") *          model.matrix(object=formulaYAM, data=data)        U1 <- cbind(Umu1, UbetaYAM)        dUmu1.dmu1 <- -1        g <- family(fitYAM)$mu.eta        dm.deta <- g(predict(object=fitYAM, newdata=data0))        deta.dbeta <- model.matrix(object=formulaYAM, data=data0)        dm.dbeta <- dm.deta * deta.dbeta        dUmu1.dbetaYAM <- colMeans(dm.dbeta)        dUbetaYAM.dbetaYAM <- -solve(summary(object=fitYAM)$cov.unscaled) / n        dUmu1 <- c(dUmu1.dmu1, dUmu1.dbetaYAM)        dUbetaYAM <- cbind(0, dUbetaYAM.dbetaYAM)        dU1 <- rbind(dUmu1, dUbetaYAM)     }                    #DR for mu1 = E{p(Y=1|A=0,M,C)}    if(!missing(formulaYAM) & !missing(formulaAM)){        fitYAM <- glm(formula=formulaYAM, family="binomial", data=data)        fitAM <- glm(formula=formulaAM, family="binomial", data=data)        pred <- predict(object=fitYAM, newdata=data0, type="response")        p0 <- 1-fitAM$fitted.values        w <- (1-data[, A]) / p0        mu1 <- mean(pred + w*(data[, Y] - pred))        Umu1 <- pred + w*(data[, Y] - pred) - mu1        UbetaYAM <- residuals(object=fitYAM, type="response") *          model.matrix(object=formulaYAM, data=data)        UbetaAM <- residuals(object=fitAM, type="response") *          model.matrix(object=formulaAM, data=data)        U1 <- cbind(Umu1, UbetaYAM, UbetaAM)        dUmu1.dmu1 <- -1        g <- family(fitYAM)$mu.eta        dm.deta <- g(predict(object=fitYAM, newdata=data0))        deta.dbeta <- model.matrix(object=formulaYAM, data=data0)        dm.dbeta <- dm.deta * deta.dbeta        dUmu1.dbetaYAM <- colMeans(dm.dbeta * (1-w))        g <- family(fitAM)$mu.eta        dm.deta <- g(fitAM$linear.predictors)        deta.dbeta <- model.matrix(object=formulaAM, data=data)        dm.dbeta <- dm.deta * deta.dbeta        dUmu1.dbetaAM <- colMeans(-(1-data[, A]) / p0^2 *          dm.dbeta * (data[, Y] - pred))        dUbetaYAM.dbetaYAM <- -solve(summary(object=fitYAM)$cov.unscaled) / n        dUbetaAM.dbetaAM <- -solve(summary(object=fitAM)$cov.unscaled) / n        dUmu1 <- c(dUmu1.dmu1, dUmu1.dbetaYAM, dUmu1.dbetaAM)        npar.YAM <- length(fitYAM$coef)        npar.AM <- length(fitAM$coef)        dUbetaYAM <- cbind(0, dUbetaYAM.dbetaYAM, matrix(0, nrow=npar.YAM,            ncol=npar.AM))        dUbetaAM <- cbind(matrix(0, nrow=npar.AM, ncol=npar.YAM+1),            dUbetaAM.dbetaAM)        dU1 <- rbind(dUmu1, dUbetaYAM, dUbetaAM)     }                    #ML for mu2 = E{p(Y=1|A=0,C)}    if(!missing(formulaYA) & missing(formulaA)){        fitYA <- glm(formula=formulaYA, family="binomial", data=data)        pred <- predict(object=fitYA, newdata=data0, type="response")        mu2 <- mean(pred)        Umu2 <- pred - mu2        UbetaYA <- residuals(object=fitYA, type="response") *            model.matrix(object=formulaYA, data=data)        U2 <- cbind(Umu2, UbetaYA)        dUmu2.dmu2 <- -1        g <- family(fitYA)$mu.eta        dm.deta <- g(predict(object=fitYA, newdata=data0))        deta.dbeta <- model.matrix(object=formulaYA, data=data0)        dm.dbeta <- dm.deta * deta.dbeta        dUmu2.dbetaYA <- colMeans(dm.dbeta)        dUbetaYA.dbetaYA <- -solve(summary(object=fitYA)$cov.unscaled) / n        dUmu2 <- c(dUmu2.dmu2, dUmu2.dbetaYA)        dUbetaYA <- cbind(0, dUbetaYA.dbetaYA)        dU2 <- rbind(dUmu2, dUbetaYA)     }                     #DR for mu2 = E{p(Y=1|A=0,C)}     if(!missing(formulaYA) & !missing(formulaA)){        fitYA <- glm(formula=formulaYA, family="binomial", data=data)        fitA <- glm(formula=formulaA, family="binomial", data=data)        pred <- predict(object=fitYA, newdata=data0, type="response")        p0 <- 1-fitA$fitted.values        w <- (1-data[, A]) / p0        mu2 <- mean(pred + w*(data[, Y] - pred))        Umu2 <- pred + w*(data[, Y] - pred) - mu1        UbetaYA <- residuals(object=fitYA, type="response") *            model.matrix(object=formulaYA, data=data)        UbetaA <- residuals(object=fitA, type="response") *            model.matrix(object=formulaA, data=data)        U2 <- cbind(Umu2, UbetaYA, UbetaA)        dUmu2.dmu2 <- -1        g <- family(fitYA)$mu.eta        dm.deta <- g(predict(object=fitYA, newdata=data0))        deta.dbeta <- model.matrix(object=formulaYA, data=data0)        dm.dbeta <- dm.deta * deta.dbeta        dUmu2.dbetaYA <- colMeans(dm.dbeta * (1-w))        g <- family(fitA)$mu.eta        dm.deta <- g(fitA$linear.predictors)        deta.dbeta <- model.matrix(object=formulaA, data=data)        dm.dbeta <- dm.deta * deta.dbeta        dUmu2.dbetaA <- colMeans(-(1-data[, A]) / p0^2 *            dm.dbeta * (data[, Y] - pred))        dUbetaYA.dbetaYA <- -solve(summary(object=fitYA)$cov.unscaled) / n        dUbetaA.dbetaA <- -solve(summary(object=fitA)$cov.unscaled) / n        dUmu2 <- c(dUmu2.dmu2, dUmu2.dbetaYA, dUmu2.dbetaA)        npar.YA <- length(fitYA$coef)        npar.A <- length(fitA$coef)        dUbetaYA <- cbind(0, dUbetaYA.dbetaYA, matrix(0, nrow=npar.YA, ncol=npar.A))        dUbetaA <- cbind(matrix(0, nrow=npar.A, ncol=npar.YA+1), dUbetaA.dbetaA)        dU2 <- rbind(dUmu2, dUbetaYA, dUbetaA)    }                    #bread    n1 <- ncol(U1)    n2 <- ncol(U2)    dUp <- c(dUp, rep(0, n1+n2))    dU1 <- cbind(0 , dU1, matrix(0, nrow=n1, ncol=n2))    dU2 <- cbind(matrix(0, nrow=n2, ncol=1+n1), dU2)    I <- rbind(dUp, dU1, dU2)                    #meat    U <- cbind(Up, U1, U2)    J <- var(U)                    #variance-covariance matrix    V <- (solve(I) %*% J %*% t(solve(I)) / n)[c(1,2,n1+2), c(1,2,n1+2)]                    #af    ndaf <- (p - mu1) / p    niaf <- (mu1 - mu2) / p    af <- ndaf+niaf    dndaf <- c(mu1/p^2, -1/p, 0)    dniaf <- c(-(mu1-mu2)/p^2, 1/p, -1/p)    daf <- c(mu2/p^2, 0, -1/p)    d <- cbind(dndaf, dniaf, daf)    vcov <- t(d) %*% V %*% d                    #output    out <- list(est=c(ndaf, niaf, af), vcov=vcov)    class(out) <- "medAF"    return(out)                    }                    mediationAF.casectrl <- function(formulaYAM, formulaAYM, formulaYA, formulaAY,     iaformulaM, iaformula, data, A, Y){    n <- nrow(data)    #ML for mu1 = E{OR(M,C)^(-A)|Y=1}    if(!missing(formulaYAM) & missing(formulaAYM)){        fitYAM <- glm(formula=formulaYAM, family="binomial", data=data)        data0 <- data        data0[, A] <- 0        lp <- predict(object=fitYAM, newdata=data)        lp0 <- predict(object=fitYAM, newdata=data0)        logor <- lp - lp0        mu1 <- sum(data[, Y] * exp(-logor)) / sum(data[, Y])        Umu1 <- data[, Y] * (exp(-logor) - mu1)        UbetaYAM <- residuals(object=fitYAM, type="response") *            model.matrix(object=formulaYAM, data=data)        U1 <- cbind(Umu1, UbetaYAM)        dUmu1.dmu1 <- -mean(data[, Y])        design <- model.matrix(object=formulaYAM, data=data)        design0 <- model.matrix(object=formulaYAM, data=data0)        dUmu1.dbetaYAM <- colMeans(-data[, Y] * exp(-logor) * (design - design0))        dUbetaYAM.dbetaYAM <- -solve(summary(object=fitYAM)$cov.unscaled) / n        dUmu1 <- c(dUmu1.dmu1, dUmu1.dbetaYAM)        dUbetaYAM <- cbind(0, dUbetaYAM.dbetaYAM)        dU1 <- rbind(dUmu1, dUbetaYAM)    }                    #DR for mu1 = E{OR(M,C)^(-A)|Y=1}    if(!missing(formulaYAM) & !missing(formulaAYM)){        if(missing(iaformulaM))        iaformulaM <- formula(~1)        fit <- drgee(oformula=formulaYAM, eformula=formulaAYM, iaformula=iaformulaM,            olink="logit", elink="logit", data=data, estimation.method="dr")        logor <- colSums(t(fit$drgee.data$ax)*fit$coefficients)        mu1 <- sum(data[, Y] * exp(-logor)) / sum(data[, Y])        Umu1 <- data[, Y] * (exp(-logor) - mu1)        Ubeta <- fit$U        U1 <- cbind(Umu1, Ubeta)        dUmu1.dmu1 <- -mean(data[, Y])        dUmu1.dbetaDR <- colMeans(-data[, Y] * exp(-logor) * fit$drgee.data$ax)        dUmu1 <- c(dUmu1.dmu1, dUmu1.dbetaDR,            rep(0, length(fit$coefficients.all) - length(fit$coefficients)))        dUbeta <- cbind(0, fit$d.U.sum / n)        dU1 <- rbind(dUmu1, dUbeta)    }                    #ML for mu2 = E{OR(C)^(-A)|Y=1}    if(!missing(formulaYA) & missing(formulaAY)){        fitYA <- glm(formula=formulaYA, family="binomial", data=data)        data0 <- data        data0[, A] <- 0        lp <- predict(object=fitYA, newdata=data)        lp0 <- predict(object=fitYA, newdata=data0)        logor <- lp - lp0        mu2 <- sum(data[, Y] * exp(-logor)) / sum(data[, Y])        Umu2 <- data[, Y] * (exp(-logor) - mu2)        UbetaYA <- residuals(object=fitYA, type="response") *            model.matrix(object=formulaYA, data=data)        U2 <- cbind(Umu2, UbetaYA)        dUmu2.dmu2 <- -mean(data[, Y])        design <- model.matrix(object=formulaYA, data=data)        design0 <- model.matrix(object=formulaYA, data=data0)        dUmu2.dbetaYA <- colMeans(-data[, Y] * exp(-logor) * (design - design0))        dUbetaYA.dbetaYA <- -solve(summary(object=fitYA)$cov.unscaled) / n        dUmu2 <- c(dUmu2.dmu2, dUmu2.dbetaYA)        dUbetaYA <- cbind(0, dUbetaYA.dbetaYA)        dU2 <- rbind(dUmu2, dUbetaYA)    }                    #DR for mu2 = E{OR(C)^(-A)|Y=1}    if(!missing(formulaYA) & !missing(formulaAY)){        if(missing(iaformula))            iaformula <- formula(~1)    fit <- drgee(oformula=formulaYA, eformula=formulaAY, iaformula=iaformula,            olink="logit", elink="logit", data=data, estimation.method="dr")        logor <- colSums(t(fit$drgee.data$ax)*fit$coefficients)        mu2 <- sum(data[, Y] * exp(-logor)) / sum(data[, Y])        Umu2 <- data[, Y] * (exp(-logor) - mu2)        Ubeta <- fit$U        U2 <- cbind(Umu2, Ubeta)        dUmu2.dmu2 <- -mean(data[, Y])        dUmu2.dbetaDR <- colMeans(-data[, Y] * exp(-logor) * fit$drgee.data$ax)        dUmu2 <- c(dUmu2.dmu2, dUmu2.dbetaDR,            rep(0, length(fit$coefficients.all) - length(fit$coefficients)))        dUbeta <- cbind(0, fit$d.U.sum / n)        dU2 <- rbind(dUmu2, dUbeta)    }                    #bread    n1 <- ncol(U1)    n2 <- ncol(U2)    dU1 <- cbind(dU1, matrix(0, nrow=n1, ncol=n2))    dU2 <- cbind(matrix(0, nrow=n2, ncol=n1), dU2)    I <- rbind(dU1, dU2)                    #meat    U <- cbind(U1, U2)    J <- var(U)                    #variance-covariance matrix    V <- (solve(I) %*% J %*% t(solve(I)) / n)[c(1,n1+1), c(1,n1+1)]                    #af    ndaf <- 1 - mu1    niaf <- mu1 - mu2    af <- ndaf+niaf    dndaf <- c(-1, 0)    dniaf <- c(1, -1)    daf <- c(0, -1)    d <- cbind(dndaf, dniaf, daf)    vcov <- t(d) %*% V %*% d                    #output    out <- list(est=c(ndaf, niaf, af), vcov=vcov)    class(out) <- "medAF"    return(out)}

References

Bruzzi, P., Green, S., Byar, D., Brinton, L., and Schairer, C. (1985). Estimating the population attributable risk for multiple risk factors using case-control data. American Journal of Epidemiology, 122:904–914.10.1093/oxfordjournals.aje.a114174Search in Google Scholar

Cai, M., Kuroki, Z., Pearl, J., Tian, J. (2008). Bounds on direct effects in the presence of confounded intermediate variables. Biometrics, 64:695–701.10.1111/j.1541-0420.2007.00949.xSearch in Google Scholar

Deubner, D., Wilkinson, W., Helms, M., Herman, T., Curtis, G. (1980). Logistic model estimation of death attributable to risk factors for cardiovascular disease in evans county, georgia. American Journal of Epidemiology, 112:135–143.10.1093/oxfordjournals.aje.a112963Search in Google Scholar

Eide, G., Gefeller, O. (1995). Sequential and average attributable fractions as aids in the selection of preventive strategies. Journal of Clinical Epidemiology, 48:645–655.10.1016/0895-4356(94)00161-ISearch in Google Scholar

Greenland, S., Drescher, K. (1993). Maximum likelihood estimation of the attributable fraction from logistic models. Biometrics, 49:865–872.10.2307/2532206Search in Google Scholar

Greenland, S., Robins, J., Pearl, J. (1999). Confounding and collapsibility in causal inference. Statistical Science, 14:29–46.10.1214/ss/1009211805Search in Google Scholar

Levin, M. (1953). The occurrence of lung cancer in man. Acta-Unio Internationalis Contra Cancrum, 9:531–541.Search in Google Scholar

Mendis, S., Puska, P., Norrving, B., et al. (2011). Global atlas on cardiovascular disease prevention and control. Geneva: World Health Organization.Search in Google Scholar

Miettinen, O. (1974). Proportion of disease caused or prevented by a given exposure, trait or intervention. American Journal of Epidemiology, 99:325–332.10.1093/oxfordjournals.aje.a121617Search in Google Scholar PubMed

Pearl, J. (2001). Direct and indirect effects. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence, 411–420. Morgan Kaufmann Publishers Inc.Search in Google Scholar

Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd Edition. New York: Cambridge University Press.10.1017/CBO9780511803161Search in Google Scholar

Prentice, R., Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika, 66:403–411.10.1093/biomet/66.3.403Search in Google Scholar

Robins, J. (2000). Robust estimation in sequentially ignorable missing data and causal inference models. In: Proceedings of the American Statistical Association, volume 1999, 6–10.Search in Google Scholar

Robins, J., Greenland, S. (1992). Identifiability and exchangeability for direct and indirect effects. Epidemiology, 48:143–155.10.1097/00001648-199203000-00013Search in Google Scholar PubMed

Rubin, D. (2004). Direct and indirect causal effects via potential outcomes. Scandinavian Journal of Statistics, 31:161–170.10.1111/j.1467-9469.2004.02-123.xSearch in Google Scholar

Shpitser, I., Tchetgen Tchetgen, E. (2016). Causal inference with a graphical hierarchy of interventions. The Annals of Statistics, 44:2433–2466.10.1214/15-AOS1411Search in Google Scholar PubMed PubMed Central

Shpitser, I., VanderWeele, T. (2011). A complete graphical criterion for the adjustment formula in mediation analysis. The International Journal of Biostatistics, 7:1–24.10.2202/1557-4679.1297Search in Google Scholar PubMed PubMed Central

Sjölander, A. (2009). Bounds on natural direct effects in the presence of confounded intermediate variables. Statistics in Medicine, 28:558–571.10.1002/sim.3493Search in Google Scholar PubMed

Sjölander, A. (2011). Estimation of attributable fractions using inverse probability weighting. Statistical methods in medical research, 20:415–428.10.1177/0962280209349880Search in Google Scholar PubMed

Sjölander, A., Vansteelandt, S. (2011). Doubly robust estimation of attributable fractions. Biostatistics, 12:112–121.10.1093/biostatistics/kxq049Search in Google Scholar PubMed

Stefanski, L., Boos, D. (2002). The calculus of m-estimation. The American Statistician, 56:29–38.10.1198/000313002753631330Search in Google Scholar

Sturmans, F., Mulder, P., Valkenburg, H. (1977). Estimation of the possible effect of interventive measures in the area of ischemic heart diseases by the attributable risk percentage. American Journal of Epidemiology, 105:281–289.10.1093/oxfordjournals.aje.a112384Search in Google Scholar PubMed

Tchetgen Tchetgen, E., Robins, J., Rotnitzky, A. (2010). On doubly robust estimation in a semiparametric odds ratio model. Biometrika, 97:171–180.10.1093/biomet/asp062Search in Google Scholar PubMed PubMed Central

VanderWeele, T. (2015). Explanation in causal inference: methods for mediation and interaction. Oxford University Press.Search in Google Scholar

VanderWeele, T., Vansteelandt, S. (2010). Odds ratios for mediation analysis for a dichotomous outcome. American Journal of Epidemiology 172:1339–1348.10.1093/aje/kwq332Search in Google Scholar PubMed PubMed Central

Vansteelandt, S. (2009). Estimating direct effects in cohort and case–control studies. Epidemiology, 20:851–860.10.1097/EDE.0b013e3181b6f4c9Search in Google Scholar PubMed

Received: 2017-07-29
Revised: 2017-12-21
Accepted: 2017-12-21
Published Online: 2018-04-13

© 2018 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 11.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/em-2017-0010/html?lang=en
Scroll to top button