Potential outcome and decision theoretic foundations for statistical causality

Thomas S. Richardson; James M. Robins

doi:10.1515/jci-2022-0012

Article Open Access

Potential outcome and decision theoretic foundations for statistical causality

Thomas S. Richardson and James M. Robins

Published/Copyright: October 25, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Journal of Causal Inference Volume 11 Issue 1

Abstract

In a recent work published in this journal, Philip Dawid has described a graphical causal model based on decision diagrams. This article describes how single-world intervention graphs (SWIGs) relate to these diagrams. In this way, a correspondence is established between Dawid's approach and those based on potential outcomes such as Robins’ finest fully randomized causally interpreted structured tree graphs. In more detail, a reformulation of Dawid s theory is given that is essentially equivalent to his proposal and isomorphic to SWIGs.

Keywords: directed acyclic graph; decision theory; finest fully randomized causally interpreted structured tree graph; potential outcome; single-world intervention graph

MSC 2010: 62A01; 62D20; 62H22

1 Introduction

In his recent article, Decision Theoretic Foundations for Causality, Philip Dawid elaborates on an earlier theory that he advanced previously [1]. We welcome Dawid’s efforts to build a foundation for causal models that aims to develop a graphical framework, while placing an emphasis on making assumptions that are both transparent and testable. Similar concerns have also motivated much of our previous work on potential outcome models represented in terms of finest fully randomized causally interpreted structured tree graphs (FFRCISTGs) [2] and single-world intervention graphs (SWIGs) [3].

Indeed, like Dawid, we have argued that, in contrast, the assumption of independent errors that is typically adopted by users of Pearl’s non-parametric structural equations (also called structural causal models) is untestable and also imposes (superexponentially) many assumptions that are unnecessary for most purposes; furthermore, the independent error assumptions allow the identification of causal quantities that cannot be identified via any randomized experiment on the observed variables [4]. Thus, this assumption contradicts the dictum “no causation without manipulation” and severs the connection between experimentation and causal inference that has been central to much of the conceptual progress during the last century. We also note that [5] cites the move to specifying causal models using potential outcomes rather than error terms as underpinning the “credibility revolution” in Econometrics.

In our view, Dawid’s updated theory represents a marked advance on his earlier proposal in that it requires stronger ontological commitments, specifically, the existence of an “intent-to-treat” (ITT) variable, before a model may be called causal. ITT variables are necessary and important in order to encode the notion of ignorability and the effect of treatment on the treated.

In addition, as noted by Dawid, the ITT variables make it possible to connect his approach to that based on potential outcomes^[1] and SWIGs. The connection between the two approaches may help to illuminate the strengths and weakness of each formalism. We also present a reformulation of Dawid’s theory that is essentially equivalent to his proposal and isomorphic to SWIGs.

We thank Philip Dawid for helpful feedback on our article; in particular, for pointing out a significant omission regarding our proposed definition of distributional consistency for SWIGs. We also thank him for his patience regarding the completion of this manuscript.

2 Relating observational and experimental worlds

At a high level, every approach to causal inference relates a model describing a factual passively observed world and models describing hypothetical “interventional” worlds in which a treatment (or exposure) variable takes on a specific value.

In both the current and previous decision-theoretic conceptions advocated by Dawid, these worlds “exist” at least hypothetically, as different distributions. The relation is then created by the assertion of equalities linking different parts of these distributions. In Dawid’s formalism, the set of distributions is represented using a single kernel object in which non-random regime indicators (also called “policy variables” by [6]) index the different distributions; there is no requirement that these distributions live on the same probability space. Dawid encodes the equalities between the observational and interventional worlds via extended conditional independence (ECI) relations, including independence from (and conditional on) regime indicators.

In the standard presentation of the potential outcome approach, random variables corresponding to the outcomes for an individual under all possible interventions^[2] are assumed to exist, living on a common probability space. The consistency assumption then serves to construct the factual variables as a deterministic function of the potential outcomes. Owing to the fundamental problem of causal inference, the resulting factual distribution is consistent with many different intervention distributions. However, under additional Markov restrictions on the joint distribution of the potential outcomes, the interventional distribution becomes identified from the joint distribution of the factuals under a positivity assumption. Notwithstanding this, often in practice, data are obtained on a subset of the factual variables in which case some or even all interventional distributions become only partially identified from the available (i.e., the observed) data.

3 SWIGs

The SWIG approach is designed to provide a simple way to relate graphs representing joint distributions over the observed variables and those representing joint distributions over potential outcomes. The approach is “single world” in that each of the constraints defining the model concerns a set of potential outcomes corresponding to a single joint intervention on the target variables.^[3]

Following [2,3], we will assume throughout that there is a set of variables indexed by V = { 1 , … , p } and that a pre-specified (possibly strict) subset A ⊆ V of these variables are targets for intervention. Often, we will, with a slight abuse of notation, also refer to the corresponding sets of random variables as V and A , respectively.

However, for proofs and formal statements, it is sometimes necessary to distinguish between the random variables and the sets that index them. For this purpose, we introduce the following notation: we define X B ≡ { X i , i ∈ B ⊆ V } , so that the complete set of factual variables is X V and the subset that are targets for intervention are X A . We use X i as the state space for the variable X i , and we will let X V ≡ × i ∈ V X i and X A ≡ × i ∈ A X i be the state spaces for the variables with indices in V and A , respectively. Similarly, given an assignment x V to the variables (with indices) in V , we let x i and x B refer to the value assigned to X i and to the set X B . We also make use of the usual shorthand, using, for example, A i to refer to X A i , A for X A , and a i to denote x a i .

Definition 1

Given a directed acyclic graph (DAG) G with vertex set V , the SWIG G ( a ) corresponding to an intervention that sets the variables in A = { A 1 , … , A k } ⊆ V to a = ( a 1 , … , a k ) ∈ X A is constructed as follows:

Every vertex A i ∈ A is split into two halves, a “random half” and a “fixed half.”
The random half contains A i and inherits all of the incoming edges directed into A i in the original graph.
The fixed half inherits all of the outgoing edges directed out of A i in the original graph and is labeled with the value a i .
Random vertices in nodes on the graph are then re-labeled according to one of the schemes mentioned subsequently.

There are three labeling schemes that may be employed in step (4):

Uniform labeling: Every random vertex Y in the SWIG G ( a ) is labeled with the full vector Y ( a 1 , … , a k ) .
Temporal labeling: Given a total ordering of the vertices on the original graph, each random vertex Y is labeled Y ( a 1 , … a i ) with the values corresponding to those vertices A 1 , … , A i that are ordered prior to Y .
Ancestral labeling: Each random vertex Y is labeled Y ( a a n G ( a ) ( Y ) ) , where a a n G ( a ) ( Y ) corresponds to those fixed vertices a i that are still ancestors of Y after splitting the nodes in A .

Temporal labeling may be seen as encoding the assumption that interventions in the future do not affect outcomes in the past. Thus, the potential outcome for a variable Y ( a 1 , … , a k ) , in a world in which there is an intervention on A 1 , … , A k , is a function only of those interventions A 1 , … , A i that took place (temporally) before Y , so that Y ( a 1 , … , a k ) = Y ( a 1 , … , a i ) . This is the natural labeling scheme to apply in the context where all variables are temporally ordered and missing edges correspond (solely) to the absence of population-level direct effects.

Ancestral labeling encodes the assumption that the potential outcome Y ( a 1 , … , a k ) is solely a function of those interventions that are (still) causally antecedent to Y in the context of the other interventions that are being carried out. Thus, for example, in Figure 1(d), the vertex for C is labeled C ( b ) and not C ( a , b ) because after intervention on B , there is no directed path from A to C . This labeling corresponds to the interpretation of missing edges in the graph in terms of the absence of individual-level direct effects so that, for example, C ( a , b ) = C ( b ) in Figure 1(d). [3, §7] also discuss more general schemes that assume a time order, but also allow some missing edges to be interpreted at the individual level and others at the population (or distribution) level; in that article ancestral labeling is termed “minimal labeling.”

$Figure 1 Illustration of SWIG labeling schemes. (a) DAG G {\mathcal{G}} representing the observed joint distribution p ( A , B , C ) p\left(A,B,C) ; (b) SWIG G ( a , b ) {\mathcal{G}}\left(a,b) with uniform labeling; (c) SWIG G ( a , b ) {\mathcal{G}}\left(a,b) with temporal labeling; and (d) SWIG G ( a , b ) {\mathcal{G}}\left(a,b) with ancestral labeling. (These and other figures were created using the swings TikZ package, available on CTAN.)$

Figure 1

Illustration of SWIG labeling schemes. (a) DAG G representing the observed joint distribution p ( A , B , C ) ; (b) SWIG G ( a , b ) with uniform labeling; (c) SWIG G ( a , b ) with temporal labeling; and (d) SWIG G ( a , b ) with ancestral labeling. (These and other figures were created using the swings TikZ package, available on CTAN.)

Uniform labeling corresponds to the absence of any assumption regarding equality of potential outcomes (as random variables) across different interventions.^[4] In the potential outcome framework, this would often appear somewhat unnatural. However, in this article, we will use this labeling to show that although we may wish to adopt the additional equalities between potential outcomes that are implied by the temporal and/or causal relationships, our results do not require these equalities. In addition, SWIGs with this labeling scheme are essentially isomorphic to the augmented decision diagrams proposed in [8]. In particular, note that under the uniform labeling scheme, the set of random variables appearing in two SWIGs G ( a ) and G ( a ∗ ) , where a , a ∗ ∈ X A , have no overlap; this will continue to hold when, in section 3.7, we consider SWIGs G ( b ) where we intervene on a (possibly empty) subset B ⊆ A .

3.1 Distributional consistency for SWIGs

In order to relate passively observed distributions to those under intervention, we introduce a consistency assumption relating sets of counterfactual distributions. For this purpose, we introduce the following notation:

(1) P A ≡ { p ( V ( a ) ) ∣ a ∈ X A } ,

(2) P A ⊆ ≡ ⋃ D ⊆ A P D .

Thus, P A is the set of counterfactual distributions over V that arise from all possible joint interventions setting the variables in A to a value a ∈ X A . Likewise, P A ⊆ is the set of counterfactual distributions over V resulting from all possible joint interventions on subsets D of A ; this includes the case D = ∅ , corresponding to the observed distribution, so p ( V ) ∈ P A ⊆ .

We make the following consistency assumption.^[5]

Definition 2

(Distributional consistency for SWIGs) The set of distributions P A ⊆ will be said to obey distributional consistency if, given B i ∈ A and C ⊆ A \ { B i } , where C may be empty, for all y , b , c :

(3) p ( Y ( b , c ) = y , B i ( b , c ) = b ) = p ( Y ( c ) = y , B i ( c ) = b ) ,

where Y = V \ { B i } . As a special case, if C is empty, then for all y , b :

(4) p ( Y ( b ) = y , B i ( b ) = b ) = p ( Y = y , B i = b ) .

Equalities (3) and (4) simply state that the probability of the event { Y = y , B i = b } , where B i is the “natural” or (in Dawid’s terminology) ITT variable, remains the same whether or not there is (subsequently) an intervention that targets B i and sets it to b .

(4) implies that p ( B i ( b ) = b ) = p ( B i = b ) ,^[6] and thus, p ( Y ( b ) = y ∣ B i ( b ) = b ) = p ( Y = y ∣ B i = b ) . This has the interpretation that an intervention on B i setting it to b is “ideal” in the sense that for the remaining variables Y , the intervention does not change the distribution of Y given B i = b . That p ( B i ( b ) = b ) = p ( B i = b ) can be seen as following from the fact that B i and B i ( b ) represent, respectively, the natural value taken by B i in the absence of an intervention and the natural value of B i immediately prior to an intervention.

Under a standard potential outcome model that includes equalities between random variables, (3) follows directly from the consistency assumption and recursive substitution:

B i ( b , c ) = b ⇒ B i ( c ) = B i ( b , c ) = b ⇒ Y ( b , c ) = Y ( B i ( c ) , c ) = Y ( c ) .

As with the discussion of labeling earlier, in a potential outcome theory, it is natural to assume consistency at the level of random variables. Our motivation here for formulating consistency via (3) as a relation between distributions is solely to make clear that we do not require the stronger assumption for our results. However, proceeding in this way makes the notation more cumbersome since every potential outcome variable is labeled with every intervention.

Distributional consistency may also be formulated in terms of a dynamic regime. Let g i ∗ denote the dynamic regime^[7] on B which “intervenes” to set the intervention target to the “natural” value that the variable B i would take in the absence of an intervention. Let V ( g i ∗ , c ) be the set of potential outcomes that would arise under g i ∗ in conjunction with an intervention setting C to c . We may then re-express (3) as:

(5) p ( V ( g i ∗ , c ) ) = p ( V ( c ) ) .

In words, in the context of an intervention setting C to c , a dynamic regime that intervenes to set B i to the value that it would have taken anyway has no effect on the distribution of V .^[8]

Though the distributional consistency assumption involves a single variable B i , repeated applications imply the same conclusion for a set B .

Lemma 3

If P A ⊆ obeys distributional consistency, B and C are disjoint subsets of A , where C may be empty, then for all y , b , c :

(6) p ( Y ( b , c ) = y , B ( b , c ) = b ) = p ( Y ( c ) = y , B ( c ) = b ) ,

where Y = V \ B .

Proof

We prove this by induction on the size of B . The base case follows by definition of distributional consistency. Let B i be a variable in B , and let B − i = B \ { B i } .

p ( Y ( b , c ) = y , B ( b , c ) = b ) = p ( Y ( b i , b − i , c ) = y , B i ( b i , b − i , c ) = b i , B − i ( b i , b − i , c ) = b − i ) = p ( Y ( b − i , c ) = y , B i ( b − i , c ) = b i , B − i ( b − i , c ) = b − i ) = p ( Y ( c ) = y , B i ( c ) = b i , B − i ( c ) = b − i ) = p ( Y ( c ) = y , B ( c ) = b ) .

Here, the second equality applies distributional consistency, taking “ C ” to be B − i ∪ C ; the third applies the induction hypothesis, taking “ Y ” to be Y ∪ { B i } and “ B ” to be B − i .□

The next lemma relates equality of conditional distributions with and without an intervention on B .

Lemma 4

Suppose P A ⊆ obeys distributional consistency. Let B and C be disjoint subsets of A , where C may be empty, and let Y and W be disjoint subsets of V \ B . It then follows that:

(7) p ( Y ( b , c ) = y ∣ B ( b , c ) = b , W ( b , c ) = w ) = p ( Y ( c ) = y ∣ B ( c ) = b , W ( c ) = w ) .

Proof

This follows by applying Lemma 3 to p ( Y ( b , c ) , B ( b , c ) , W ( b , c ) ) , and p ( B ( b , c ) , W ( b , c ) ) .□

In addition, we have the following:

Lemma 5

Suppose P A ⊆ obeys distributional consistency, and let B and C be disjoint subsets of A , where C may be empty. If B ⊆ W ⊆ V and p ( W ( b , c ) ) is not a function of b , then it follows from distributional consistency that p ( W ( b , c ) ) = p ( W ( c ) ) .

Proof

p ( X W ( b , c ) = w ) = p ( X W \ B ( b , c ) = w W \ B , X B ( b , c ) = w B ) = p ( X W \ B ( w B , c ) = w W \ B , X B ( w B , c ) = w B ) = p ( X W \ B ( c ) = w W \ B , X B ( c ) = w B ) = p ( X W ( c ) = w ) .

Here, we use that p ( W ( b , c ) ) is not a function of b in the second equality and distributional consistency via Lemma 3 in the third.□

Note that distributional consistency (3) does not imply the analogous result for conditional distributions. In particular, it is possible to have B i ∈ Y , p ( Y ( b ) ∣ M ( b ) ) not be a function of b and yet p ( Y ( b ) ∣ M ( b ) ) ≠ p ( Y ∣ M ) . This is because even if p ( Y ( b ) ∣ M ( b ) ) is not a function of b , both p ( Y ( b ) , M ( b ) ) and p ( M ( b ) ) may still be functions of b , in which case there is no way to apply (3) to relate them to distributions in which B is not intervened on.

However, when the conditioning set contains B , we have the following:

Lemma 6

Suppose P A ⊆ obeys distributional consistency, with B and C disjoint subsets of A , where C may be empty. Further, let Y and W be disjoint sets with B ⊆ W . If p ( Y ( b , c ) ∣ W ( b , c ) ) is not a function of b , then

(8) p ( Y ( b , c ) ∣ W ( b , c ) ) = p ( Y ( c ) ∣ W ( c ) ) .

Proof

p ( X Y ( b , c ) = y ∣ X W ( b , c ) = w ) = p ( X Y ( b , c ) = y ∣ X W \ B ( b , c ) = w W \ B , X B ( b , c ) = w B ) = p ( X Y ( w B , c ) = y ∣ X W \ B ( w B , c ) = w W \ B , X B ( w B , c ) = w B ) = p ( X Y ( c ) = y ∣ X W \ B ( c ) = w W \ B , X B ( c ) = w B ) .

Here, the second equality uses the fact that p ( X Y ( b , c ) = y ∣ X W ( b , c ) = w ) is not a function of b , while the third follows from distributional consistency via Lemma 4.□

3.2 Local Markov property defining the SWIG model

Although we derive a SWIG graphically from the original DAG by node splitting, we will define the model by associating a local Markov property with the SWIG and the potential outcome distribution. The resulting model corresponds to the FFRCISTG model of [2] (see [3, Appendix C]). We will then derive the Markov property for the original DAG and the observed distribution from these by applying distributional consistency.

Given a DAG G with vertices V = { 1 , … , p } , we will use pa G ( i ) to indicate the (index) set of variables that are the parents of W i in the original DAG G , and let pre ≺ ( i ) indicate { 1 , … , i − 1 } , the predecessors of i under a total ordering ≺ that is consistent with the edges in G . We will drop the subscript when the DAG or ordering is clear from context.

The SWIG local Markov property is defined on the set of distributions P A ≡ { p ( V ( a ) ) ∣ a ∈ X A } , where A ⊆ V is the maximal set of variables that may be intervened on see ([10, §1.2.4] and [11]).

Definition 7

A set of potential outcome distributions P A obeys the SWIG ordered local Markov property for DAG G under ≺ if for all i ∈ V , a ∈ X A , and w ∈ X pre ≺ ( i ) ,

(9) p ( X i ( a ) ∣ X pre ≺ ( i ) ( a ) = w )

is a function only of a pa G ( i ) ∩ A and w pa G ( i ) \ A .

In words, (9) states that after intervening on A , the distribution of X i ( a ) given its predecessors depends solely on the values taken by intervention targets in A that are parents of i , and by any other (random) variables that are parents of i but that are not intervened on, and hence are not in A .^[9]

Though the function of the local property is to define and characterize the potential outcome model, intuition may be gained by observing that the local property follows from d -separation applied to the SWIG G ( a ) .^[10] Specifically, the condition (9) corresponds to two sets of d -separations.

d-separation from fixed nodes: That p ( X i ( a ) ∣ X pre ( i ) ( a ) ) does not depend on a A \ pa G ( i ) is encoded in the SWIG G ( a ) by the d -separation of X i ( a ) from fixed nodes a j that correspond to vertices A j that are not the parents of X i in G given the parents of X i ( a ) in G ( a ) , both random and fixed (see [7,10,12]). Specifically, we have:

(10) X i ( a ) ⊥ ⊥ d a A \ pa ( i ) ∣ a A ∩ pa ( i ) , X pa ( i ) \ A ( a ) ,

where here we used ⊥ ⊥ d to indicate d -separation^[11] in the SWIG G ( a ) and use lowercase letters, e.g., a A \ pa ( i ) , to refer to fixed nodes. We may further decompose the set of fixed nodes a A \ pa ( i ) :

(11) X i ( a ) ⊥ ⊥ d a A \ pre ( i ) , ︷ time order a ( A ∩ pre ( i ) ) \ pa ( i ) ︷ causal Markov prop. ∣ a A ∩ pa ( i ) , ︸ fixed parents X pa ( i ) \ A ( a ) . ︸ random parents

The set of fixed nodes in a A \ pre ( i ) correspond to interventions on variables that occur after X i and thus do not change p ( X i ( a ) ∣ X pre ( i ) ( a ) ) . Likewise, the effects of the fixed nodes in a ( A ∩ pre ( i ) ) \ pa ( i ) are screened off by the random and fixed nodes that are parents of X i ( a ) .

d-separation from random nodes: That p ( X i ( a ) ∣ X pre ( i ) ( a ) = w ) does not depend on w pre ( i ) \ ( pa G ( i ) \ A ) is encoded in G ( a ) by the d -separation of X i ( a ) from X pre ( i ) \ ( pa ( i ) \ A ) ( a ) conditioning on the parents of X i ( a ) in G ( a ) , both random and fixed:

(12) X i ( a ) ⊥ ⊥ d X pre ( i ) \ ( pa ( i ) \ A ) ( a ) ∣ a A ∩ pa ( i ) , X pa ( i ) \ A ( a ) .

The random vertices X pre ( i ) \ ( pa ( i ) \ A ) ( a ) may be further decomposed:

(13) X i ( a ) ⊥ ⊥ d X pre ( i ) \ pa ( i ) ( a ) , ︷ assoc. Markov prop. X pa ( i ) ∩ A ( a ) ︷ ignorability ∣ a A ∩ pa ( i ) , ︸ fixed parents X pa ( i ) \ A ( a ) . ︸ random parents

The d -separation of X i ( a ) from nodes representing the natural value of variables that are in A and parents of X i in G corresponds to ignorability. On the other hand, the d -separation of X i ( a ) from variables that are predecessors, but not parents, of X i in G can be regarded as an associational Markov property.

3.3 Example

The d -separations given by (11) and (13) can obviously be stated as a single graphical condition for each random vertex V i ( a ) in G ( a ) . In Tables 1 and 2, we give the SWIG local Markov property corresponding to the SWIG G ( x ) ≡ G ( x 1 , x 2 ) , shown in Figure 2(b),^[12] under the ordering ( H , X 0 , Z , X 1 , Y ) : Table 1 in terms of factorization; Table 2 via d -separation. Note that for each V i , the number of arguments on which p ( V i ( x ) ∣ V pre ( i ) ( x ) ) depends corresponds exactly to the number of parents (random and fixed) of the corresponding random variable in G ( x ) in Figure 2(b): zero for H ( x ) and X 0 ( x ) and two for Z ( x ) , X 1 ( x ) , and Y ( x ) . This is also the number of terms listed to the right of the conditioning bar in Table 2. Here, as elsewhere in this article, we use the uniform labeling because we wish to emphasize that our results do not require any equalities between random variables.

Table 1

Defining properties for the SWIG G ( x ) in Figure 2(b), expressed via factorization

Local Markov property for G ( x 1 , x 2 ) via factorization terms

Arguments in p ( V i ( x ) ∣ V pre ( i ) ( x ) ) on which this term does not depend are colored red. Note that the arguments in V pre ( i ) ( x ) on which the term depends, correspond to the parents of V i ( x ) in G ( x ) ; these are written in black. For example, for the term corresponding to V i = Y , the arguments are x 1 and Z ( x ) , and these are the parents of Y ( x ) in G ( x ) .

Table 2

d -separation relations corresponding to the SWIG local Markov property in the SWIG G ( x 1 , x 2 ) in Figure 2(b)

Local Markov property for G ( x 1 , x 2 ) via d -separation
H ( x 0 , x 1 )	⊥ ⊥ d	x 0 , x 1
X 0 ( x 0 , x 1 )	⊥ ⊥ d	H ( x 0 , x 1 ) , x 0 , x 1
Z ( x 0 , x 1 )	⊥ ⊥ d	X 0 ( x 0 , x 1 ) , x 1	∣	H ( x 0 , x 1 ) , x 0
X 1 ( x 0 , x 1 )	⊥ ⊥ d	X 0 ( x 0 , x 1 ) , x 0 , x 1	∣	H ( x 0 , x 1 ) , Z ( x 0 , x 1 )
Y ( x 0 , x 1 )	⊥ ⊥ d	X 0 ( x 0 , x 1 ) , X 1 ( x 0 , x 1 ) , H ( x 0 , x 1 ) , x 0	∣	Z ( x 0 , x 1 ) , x 1

Here, x 1 and x 2 refer to the fixed nodes, and ⊥ ⊥ d indicates d -separation in the SWIG (see also footnote 11 regarding the formal inclusion of fixed nodes on the RHS of the conditioning bar).

$Figure 2 (a) The DAG G {\mathcal{G}} originally considered in Ex. 11.3.3, Fig. 11.12 in [15, p. 353]; here, H H is unobserved; (b) the SWIG G ( x ) {\mathcal{G}}\left({\bf{x}}) under uniform labeling, x ≡ ( x 1 , x 2 ) {\bf{x}}\equiv \left({x}_{1},{x}_{2}) ; Figure 15 in [8] shows the SWIG with ancestral labeling; (c) the reformulated augmented graph G ∗ {{\mathscr{G}}}^{\ast } (this corresponds to Dawid’s ITT DAG G ∗ {{\mathscr{G}}}^{\ast } shown in Figure 13 of [8, p. 62] after marginalizing X 0 {X}_{0} , X 1 {X}_{1} and then removing ∗ {}^{\ast } from the ITT variables); and (d) the resulting graph under the regime F 0 = x 0 {F}_{0}={x}_{0} , F 1 = x 1 {F}_{1}={x}_{1} ; this is the graph that encodes the reformulated Markov property; see Definition 18.$

Figure 2

(a) The DAG G originally considered in Ex. 11.3.3, Fig. 11.12 in [15, p. 353]; here, H is unobserved; (b) the SWIG G ( x ) under uniform labeling, x ≡ ( x 1 , x 2 ) ; Figure 15 in [8] shows the SWIG with ancestral labeling; (c) the reformulated augmented graph G ∗ (this corresponds to Dawid’s ITT DAG G ∗ shown in Figure 13 of [8, p. 62] after marginalizing X 0 , X 1 and then removing ∗ from the ITT variables); and (d) the resulting graph under the regime F 0 = x 0 , F 1 = x 1 ; this is the graph that encodes the reformulated Markov property; see Definition 18.

3.4 Consequences of the local Markov property

Under distributional consistency, it follows from the SWIG local Markov property that whether or not interventions in the future occur has no effect on the distribution of prior variables.

Lemma 8

If P A ⊆ obeys distributional consistency and P A obeys the SWIG ordered local Markov property for DAG G under ≺ , then for all k ∈ V and a ∈ X A :

(14) p ( X 1 ( a ) , … , X k ( a ) ) = p ( X 1 ( a pre ( k ) ∩ A ) , … , X k ( a pre ( k ) ∩ A ) ) .

Proof

First observe that since

p ( X 1 ( a ) , … , X k ( a ) ) = ∏ i = 1 k p ( X i ( a ) ∣ X pre ( i ) ( a ) ) ,

and the local Markov property implies that p ( X i ( a ) ∣ X pre ( i ) ( a ) ) does not depend on a A \ pre ( i ) , it follows that p ( X 1 ( a ) , … , X k ( a ) ) does not depend on a A \ pre ( k ) .

We now prove the claim by reverse induction on the ordering of the vertices in V .

For the base case, suppose k is the maximal vertex in V . If k ∉ A , then (14) holds trivially since A = pre ( k ) ∩ A . If k ∈ A , then since k ∉ pre ( k ) , p ( X 1 ( a ) , … , X k ( a ) ) does not depend on a k , and thus, by Lemma 5, p ( X 1 ( a ) , … , X k ( a ) ) = p ( X 1 ( a A \ { k } ) , … , X k ( a A \ { k } ) ) .

Our inductive hypothesis is that (14) holds for k = j + 1 , so that

p ( X 1 ( a ) , … , X j + 1 ( a ) ) = p ( X 1 ( a pre ( j + 1 ) ∩ A ) , … , X j + 1 ( a pre ( j + 1 ) ∩ A ) ) .

Summing both sides over x j + 1 , we obtain:

(15) p ( X 1 ( a ) , … , X j ( a ) ) = p ( X 1 ( a pre ( j + 1 ) ∩ A ) , … , X j ( a pre ( j + 1 ) ∩ A ) ) .

If j ∉ A , then (15) establishes the claim since pre ( j + 1 ) ∩ A = pre ( j ) ∩ A . If j ∈ A , then note that we have already established earlier that the left-hand side (LHS) of (15) is not a function of a j . Consequently, the right-hand side (RHS) is also not a function of a j . It then follows from Lemma 5 that

p ( X 1 ( a pre ( j + 1 ) ∩ A ) , … , X j ( a pre ( j + 1 ) ∩ A ) ) = p ( X 1 ( a pre ( j ) ∩ A ) , … , X j ( a pre ( j ) ∩ A ) ) .

This completes the proof.□

The next lemma gives a simple characterization of the consequences of the SWIG local Markov property in conjunction with distributional consistency.

Lemma 9

If P A ⊆ obeys distributional consistency and P A obeys the SWIG ordered local Markov property for DAG G under ≺ , then:

(16) p ( X i ( a ) ∣ X pre ( i ) ( a ) )

(17) = p ( X i ( a pre ( i ) ∩ A ) ∣ X pre ( i ) ( a pre ( i ) ∩ A ) )

(18) = p ( X i ( a pa ( i ) ∩ A ) ∣ X pre ( i ) ( a pa ( i ) ∩ A ) )

(19) = p ( X i ( a pa ( i ) ∩ A ) ∣ X pa ( i ) ( a pa ( i ) ∩ A ) )

(20) = p ( X i ( a pa ( i ) ∩ A ) ∣ X pa ( i ) \ A ( a pa ( i ) ∩ A ) ) .

Since the SWIG local Markov property (9) states that (16) is not a function of a A \ pa G ( i ) , the equality of (16) and (18) may appear to follow immediately. However, as noted in the discussion prior to Lemma 6, the fact that a counterfactual conditional distribution p ( Y ( a j ) ∣ W ( a j ) ) does not depend on the specific value, a j , of an intervention on A j does not imply that p ( Y ( a j ) ∣ W ( a j ) ) = p ( Y ∣ W ) .

Proof

Here, (17) follows since by Lemma 8

p ( X i ( a ) , X pre ( i ) ( a ) ) = p ( X i ( a pre ( i ) ∩ A ) , X pre ( i ) ( a pre ( i ) ∩ A ) ) .

(18) follows from Definition 7 and Lemma 6. Finally, (19) and (20) follow from the SWIG local Markov property via (8) since p ( X i ( a ) ∣ X pre ( i ) ( a ) = x pre ( i ) ) does not depend on x pre ( i ) \ ( pa ( i ) \ A ) = ( x pre ( i ) \ pa ( i ) , x pa ( i ) ∩ A ) .□

3.5 Markov property for the observed distribution

We now show that distributional consistency together with the SWIG local Markov property implies the usual local Markov property [13] for the observed distribution.

Theorem 10

If P A ⊆ obeys distributional consistency and P A obeys the SWIG ordered local Markov property for G and ≺ , then p ( V ) obeys the usual DAG ordered local Markov property w.r.t. G and ≺ .

Proof

Let v ∗ ∈ X pre ( i ) .

(21) p ( X i = v ∣ X pre ( i ) = v ∗ ) = p ( X i ( v pre ( i ) ∩ A ∗ ) = v ∣ X pre ( i ) ( v pre ( i ) ∩ A ∗ ) = v ∗ ) = p ( X i ( v pa ( i ) ∩ A ∗ ) = v ∣ X pa ( i ) \ A ( v pa ( i ) ∩ A ∗ ) = v pa ( i ) \ A ∗ ) .

Here, the first equality follows from distributional consistency via Lemma 4. The second follows directly from the equality of (17) and (20) in Lemma 9. Since the last line is not a function of v pre ( i ) \ pa ( i ) ∗ , the ordered local Markov property for the DAG holds.□

3.5.1 Discussion of relation to Dawid

Dawid takes the reverse approach to ours: he proposes additional extended Markovian conditions that, when added to the usual Markov property for the observable law, will imply the Markov property for his extended graph. However, as we describe in detail below, our approach appears to be simpler in that, given distributional consistency, it requires only one property per variable, giving ∣ V ∣ constraints in total; in contrast, Dawid requires one property for every observed variable in V , together with two additional properties for each intervention target in A for a total of ( ∣ V ∣ + 2 ∣ A ∣ ).

In addition, our approach captures context-specific independences, corresponding to “dashed” edges in Dawid’s diagrams; furthermore, these are not captured directly in Dawid’s A+B formulation. We show that by restating the SWIG local property in Dawid’s notation, we are able to provide a characterization of the (extended) Markov properties for the augmented graph and the original graph that also requires only one constraint per variable, plus distributional consistency.

It is the case that Dawid incorporates distributional consistency into his defining independences, whereas we state it as a separate property that precedes the definition of the model. However, as we have shown earlier, distributional consistency may be seen as a tautologous property, the truth of which is implicit in the notion of an ideal intervention: distributional consistency states that if B would naturally take the value b , then an ideal intervention that would set B to b has no effect on the distribution of (all) the other variables. For this reason, we believe it is natural to distinguish consistency from the other properties being used to define the model.

However, in the spirit of Dawid’s approach, in Appendix A.1, we show that if P A ⊆ obeys distributional consistency, then P A will obey the SWIG local Markov property corresponding to G if: (i) p ( V ) is positive and obeys the (ordinary) local Markov property for the graph G ; and (ii) P A obeys the SWIG local Markov property corresponding to G ¯ , a complete supergraph of G . This formulation requires 2 ∣ V ∣ restrictions.

3.6 Identification of the potential outcome distribution p ( V ( a ) ) from p ( V )

We show that it follows from the SWIG local Markov property that p ( V ( a ) ) is identified given the distribution over the observables provided that the relevant conditional distributions are identified from the distribution of the observables.

Theorem 11

Suppose that P A ⊆ obeys distributional consistency and P A obeys the SWIG ordered local Markov property for G and ≺ . Let a ∈ X A be an assignment to the intervention targets in A , and let v ∈ X V . Then, for all i :

(22) p ( X i ( a ) = v i ∣ X pre ( i ) ( a ) = v pre ( i ) ) = p ( X i = v i ∣ X pa ( i ) \ A = v pa ( i ) \ A , X pa ( i ) ∩ A = a pa ( i ) ∩ A ) .

Consequently, p ( V ( a ) ) is identified from p ( V ) and obeys d-separation in the SWIG G ( a ) , whenever the conditional distributions on the RHS of (22) are identified by p ( V ) .

The equality (22) here corresponds to the property referred to as “modularity” in [3]; this is also an instance of the extended g-formula of [2,14].

Proof

(23) p ( X i ( a ) = v i ∣ X pre ( i ) ( a ) = v pre ( i ) ) = p ( X i ( a pa ( i ) ∩ A ) = v i ∣ X pa ( i ) ∩ A ( a pa ( i ) ∩ A ) = v pa ( i ) ∩ A , X pa ( i ) \ A ( a pa ( i ) ∩ A ) = v pa ( i ) \ A ) = p ( X i ( a pa ( i ) ∩ A ) = v i ∣ X pa ( i ) ∩ A ( a pa ( i ) ∩ A ) = a pa ( i ) ∩ A , X pa ( i ) \ A ( a pa ( i ) ∩ A ) = v pa ( i ) \ A ) = p ( X i = v i ∣ X pa ( i ) ∩ A = a pa ( i ) ∩ A , X pa ( i ) \ A = v pa ( i ) \ A ) .

Here, the first equality follows from the equality of (16) and (19); the second follows from the equality of (19) and (20); the third follows from distributional consistency via (7).□

3.7 Distributions resulting from fewer interventions

Finally, we show that if p ( V ( a ) ) obeys the SWIG local Markov property for G and distributional consistency, then if we intervene on B ⊂ A , the resulting distribution p ( V ( b ) ) will obey the SWIG local Markov property for G with respect to this reduced set of intervention targets. The two previous theorems can be seen as the special case in which B = ∅ .

Theorem 12

Suppose that P A ⊆ obeys distributional consistency and P A obeys the SWIG ordered local Markov property for G and ≺ . Let b be an assignment to the intervention targets in B ⊆ A , and let v ∈ X V . Then for all i :

(24) p ( X i ( b ) = v i ∣ X pre ( i ) ( b ) = v pre ( i ) ) = p ( X i = v i ∣ X pa ( i ) ∩ B = b pa ( i ) ∩ B , X pa ( i ) \ B = v pa ( i ) \ B ) .

Consequently, every p ( V ( b ) ) ∈ P B obeys the Markov property for the SWIG G ( b ) and is identified whenever the conditional distributions on the RHS of (24) are identified by p ( V ) .

Proof

p ( X i ( b ) = v i ∣ X pre ( i ) ( b ) = v pre ( i ) ) = p ( X i ( b pre ( i ) ∩ B ) = v i ∣ X pre ( i ) ( b pre ( i ) ∩ B ) = v pre ( i ) ) = p ( X i ( b pre ( i ) ∩ B , v pre ( i ) ∩ ( A \ B ) ) = v i ∣ X pre ( i ) ( b pre ( i ) ∩ B , v pre ( i ) ∩ ( A \ B ) ) = v pre ( i ) ) = p ( X i = v i ∣ X pa ( i ) \ A = v pa ( i ) \ A , X pa ( i ) ∩ B = b pa ( i ) ∩ B , X pa ( i ) ∩ ( A \ B ) = v pa ( i ) ∩ ( A \ B ) ) = p ( X i = v i ∣ X pa ( i ) \ B = v pa ( i ) \ B , X pa ( i ) ∩ B = b pa ( i ) ∩ B ) .

Here, the first equality is by Lemma 5; the second is distributional consistency via Lemma 4; the third follows from Theorem 11 applied to G ( a ) ; the fourth is a simplification.□

4 Critique of Dawid’s proposal

We have the following four main issues, which we describe in detail as follows:

The inclusion of ITT variables within Dawid’s theory appears necessary in order to distinguish causal relationships from happenstance agreement between observational and (“fat hand”) intervention distributions. However, including all three of T (the “actual” treatment), T ∗ (the ITT variable), and F T (the regime indicator) introduces deterministically related variables and thereby obscures the content of Dawid’s defining conditional independences A and B .
Related to the previous point, d -separation is no longer a complete criterion for determining conditional independence on a graph in which there are definitional deterministic relationships between the variables.^[13]
Dawid’s ITT augmented diagrams incorporate context-specific independence (via dashed edges) but his results do not establish that the resulting distribution obeys all of the implied context-specific independences; these are not implied by his defining conditional independences A + B ; these independences will not hold without additional information concerning the relation of T to T ∗ and F T that is not captured in A + B .
Dawid makes use of what he terms “fictitious” independence relations, but he argues that these are assumptions that can be made without loss of generality. This is not the case in general, though, as we show, in the context of his arguments, the resulting logical “gap” can be filled.

We show that all of these issues may be avoided by re-formulating his theory in two simple ways:

Marginalizing out the post-intervention treatment variable T while keeping the ITT variable T ∗ .^[14]
Formulating the defining extended independence relations in terms of distributional consistency and the augmented ITT diagram (after marginalizing T ) and intervening on all the variables in A ; the local Markov property for the original variables is then implied.

The resulting theory is formally isomorphic to the SWIG theory described earlier; the augmented ITT graph can be viewed as containing the union of the nodes and edges in the original DAG G and the SWIG G ( a ) , with the fixed nodes in the SWIG corresponding to the (non-idle) regime indicators in the augmented DAG.

4.1 The simplest setting

Consider the setting in which there is a single exposure T and an outcome Y ; suppose that T takes a finite set of states T . Dawid’s augmented causal graph with the intention-to-treat variable T ∗ is shown in Figure 3(d). Here, T ∗ represents the natural value of treatment which an individual is “selected to receive” [8, p. 52] in the absence of an intervention that would override this. This is distinct from T the “treatment actually applied” [8, p. 54, Def. 1]; F T is a regime indicator taking values in T ∪ { ∅ } . Under Dawid’s proposal the graph in Figure 3(d) represents the kernel p ( T ∗ , T , Y ∣ F T ) ; F T = ∅ indicates the observational regime in which case T = T ∗ (see Figure 3(e)) where we have used a colored edge, T^∗ T, to indicate the deterministic relationship between T and T ∗ . Similarly, F T = t ∈ T indicates the interventional regime in which case T = t (see Figure 3(f)). Note that T ⊥ ⊥ T ∗ ∣ F T ≠ ∅ , which is represented by the dashed edge from T ∗ to T in Figure 3(d) and by the absence of the edge between T ∗ and T in Figure 3(e).

Figure 3

The simplest case of a single treatment T and outcome Y in the absence of confounding. (a) DAG G representing the observed joint distribution p ( T ∗ , Y ) ; (b) SWIG G ( t ) corresponding to G representing p ( T ∗ , Y ( t ) ) ; (c) Dawid’s augmented DAG representing the set of kernels p ( Y , T ∣ F T ) , where F T is a regime indicator; (d) Dawid’s augmented DAG with ITT variables, representing the kernels p ( Y , T ∗ , T ∣ F T ) , where F T is a regime indicator; the dashed edge indicates that the edge between T ∗ and T is absent in the interventional regime, while the red edges indicate deterministic relationships; (e) the ITT augmented graph representing the observational regime p ( T ∗ , T , Y ∣ F T = ∅ ) = p ( T ∗ , T , Y ) under which T ∗ = T ; (f) the ITT augmented graph for p ( T ∗ , T , Y ∣ F T = t ) = p ( T ∗ , t , Y ∣ F T = t ) , an intervention setting T to t , so F T = t ≠ ∅ ; and (g) the latent projection of the graph in (d) after marginalizing T . Note that in (a), (b) we use T ∗ (rather than T ) for the natural value of treatment in order to highlight the correspondence to the ITT variables in Dawid’s proposal. The graph in (g) corresponds to (a) and (b), under the correspondence t ⇔ F T = t , Y ( t ) ⇔ Y ∣ F T = t .

For comparison, Figure 3(a) and (b), respectively, show the representations of the observed distribution p ( T ∗ , Y ) and the joint distribution p ( T ∗ , Y ( t ) ) ; as suggested by the graphical structures, there is a close correspondence between these approaches when ITT variables are included in the decision theory graph. In what follows, we will show that in fact, the two theories can be shown to be isomorphic up to labeling of variables (Table 3).

Table 3

Correspondence between the potential outcome/SWIG approach and the decision theoretic approach

	Potential outcome	Decision theoretic
Graph for observed data	G	ITT DAG, F T = ∅
Graph representing intervention on T	G ( t )	ITT DAG, F T = t
Observed distribution	p ( T ∗ , Y )	p ( T ∗ , Y ∣ F T = ∅ )
Distribution resulting from setting T = t directly after observing T ∗	p ( T ∗ , Y ( t ) )	p ( T ∗ , Y ∣ F T = t )

Here, in the potential outcome approach we use T ∗ (rather than T ) to denote the natural value of treatment so as to make the correspondence more self-evident

Although Dawid includes ITT variables in the development here, they were absent in [16] and ultimately his goal is to remove the ITT variables, leaving the DAG shown in Figure 3(c) containing only the original variables and the treatment indicators (see bottom of [8, p. 65]). Dawid states that the augmented graphs without ITT variables are sufficient for reasoning about point interventions.

Given this, one may ask why it is necessary to introduce the ITT variables into the theory in the first place. One issue that arises is that without the ITT variables, the decision theoretic approach lacks the language to describe concepts such as the effect of treatment on the treated. In addition, the approach lacks the concepts necessary to distinguish different scenarios where there is equality between distributions in the observed and interventional worlds: those scenarios where the equality reflects agreement between an observational study and a randomized experiment due to the absence of confounding, versus those where the equality is purely “contingent” or spurious.

To illustrate this, consider the following story. Suppose that a manufacturer of dietary supplements carries out an observational study. They find that those who regularly consume the supplement ( T = 1 ) have lower levels of “bad” cholesterol ( Y ) than the people who do not ( T = 0 ). Buoyed by these results, the manufacturer hires a company to perform a randomized trial. The results of the previous study are given to the company; it is made clear that the manufacturer would like these results confirmed and that repeat business depends on the firm achieving this. In order to comply with this, the testing company carry out a non-blinded study and also modify the software in the cholesterol-measuring system to ensure that the results agree with those in the observational study (see Figure 4(b)), here H represents unobserved confounding and the edge F T → Y indicates the compromised measurement process.^[15] Since the experimental and observational distributions agree, it will hold that Y ⊥ ⊥ F T ∣ T , as implied by the decision-theoretic graph in Figure 4(a).

$Figure 4 Illustration of the necessity of ITT (aka “natural value of treatment”) variable T ∗ {T}^{\ast } in Dawid’s proposal. (a) An augmented DAG (without ITT nodes) corresponding to an observational study without confounding and a perfect intervention on T T . (b) An augmented DAG (without ITT nodes) representing an observational study with confounding ( H H ) and a mis-targeted (“fat-hand”) intervention affecting both T T and Y Y . If the mis-targeted intervention matches the effect of confounding, then there will be equality of the observational and interventional distributions p ( Y ∣ T = t , F T = ∅ ) = p ( Y ∣ T = t , F T = t ) p\left(Y\hspace{0.05em}| \hspace{0.05em}T=t,{F}_{T}=\varnothing )=p\left(Y\hspace{0.05em}| \hspace{0.05em}T=t,{F}_{T}=t) so that the extended independence Y ⊥ ⊥ F T ∣ T Y\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}{F}_{T}\hspace{0.05em}| \hspace{0.05em}T will hold, and hence the causal diagram shown in (a) cannot be refuted. The inclusion of T ∗ {T}^{\ast } resolves this. (c) The DAG with ITT variables corresponding to the study without confounding, this implies Y ⊥ ⊥ F T , T ∗ ∣ T Y\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}{F}_{T},{T}^{\ast }\hspace{0.05em}| \hspace{0.05em}T , which is not implied by the ITT augmented DAG (d) when confounding is present.$

Figure 4

Illustration of the necessity of ITT (aka “natural value of treatment”) variable T ∗ in Dawid’s proposal. (a) An augmented DAG (without ITT nodes) corresponding to an observational study without confounding and a perfect intervention on T . (b) An augmented DAG (without ITT nodes) representing an observational study with confounding ( H ) and a mis-targeted (“fat-hand”) intervention affecting both T and Y . If the mis-targeted intervention matches the effect of confounding, then there will be equality of the observational and interventional distributions p ( Y ∣ T = t , F T = ∅ ) = p ( Y ∣ T = t , F T = t ) so that the extended independence Y ⊥ ⊥ F T ∣ T will hold, and hence the causal diagram shown in (a) cannot be refuted. The inclusion of T ∗ resolves this. (c) The DAG with ITT variables corresponding to the study without confounding, this implies Y ⊥ ⊥ F T , T ∗ ∣ T , which is not implied by the ITT augmented DAG (d) when confounding is present.

To be clear, the critique here is not that someone who was unaware of the presence of confounding and the devious activities of the company running the trial would infer the wrong causal effect. Rather, it is that without the ITT variables, the decision-theoretic approach lacks the conceptual apparatus necessary to distinguish the situations in Figure 4(a) and (b).^[16] In contrast, if the ITT variables T ∗ are included, then no such difficulty arises: the corresponding augmented DAG, shown in Figure 4(c), now additionally requires that Y ⊥ ⊥ T ∗ ∣ F T = t , which will fail to hold if there is unobserved confounding between T ∗ and Y . Note that this latter condition is essentially equivalent to the ignorability condition Y ( t ) ⊥ ⊥ T ∗ in the potential outcome framework; we return to this point below.

4.2 Dawid’s defining ECI relations

Under Dawid’s formalism, the augmented graph with ITT variables, shown in Figure 3(d), defines a causal model via the following ECI relations:

(25) A : T ∗ ⊥ ⊥ F T ,

(26) B : Y ⊥ ⊥ T ∗ , F T ∣ T ,

(see [8, Eq. (62), (63)]).

4.2.1 Dawid’s independence A

The first independence (25) states that whether or not there is an intervention on T has no effect on the (distribution of the) ITT value T ∗ . Indeed, Dawid states:

Now, T ∗ is determined prior to any (actual or hypothetical) treatment application, and behaves as a covariate […] this distribution is then the same in all regimes [8, Section 8, p.54].

Similarly, in the potential outcome framework, it is assumed that intervention on a treatment variable does not affect variables whose values are realized prior to that intervention, including the natural value of that treatment variable, T ∗ , so that T ∗ ( t ) = T ∗ .

However, Dawid’s reference to T ∗ being a covariate that is determined prior to an actual or hypothetical treatment application is perhaps surprising: if the value taken by T ∗ is determined prior to the decision regarding the regime F T , then this would appear to imply that, in fact, the random variables in the distributions p ( T ∗ ∣ F T = ∅ ) and p ( T ∗ ∣ F T = t ) must live on a common probability space. But in this case, it is hard to see why the random variables in the distributions p ( T ∗ , T , Y ∣ F T = ∅ ) and p ( T ∗ , T , Y ∣ F T = t ) should not also live on a common probability space! The primary obstacle to so doing appears to be the use of Y and T to indicate what are distinct random variables (corresponding to different regimes) that are defined on the same space. This problem can obviously be overcome by simply using ( T ∗ , T , Y ) and ( T ∗ , T ( t ) , Y ( t ) ) to refer to the random variables under the idle and intervention regimes, respectively; following Definition 1 in [8], this would imply that T = T ∗ (under the idle regime) and T ( t ) = t (under an intervention).

An analyst who adopted this notation is not obligated to impose any additional equalities relating these random variables – such as those implied by consistency – should they not wish to do so. As we did earlier in Section 3, one might choose instead to follow Dawid by merely imposing distributional consistency (see also further discussion below). However, from the perspective of the potential outcome framework, this leads to an unnecessary multiplicity of random variables and more cumbersome notation. For example, in the simple case of a binary treatment, this approach requires three random variables { Y , Y ( 0 ) , Y ( 1 ) } corresponding to the response, rather than just two { Y ( 0 ) , Y ( 1 ) } with consistency at the level of random variables.^[17] It is unclear what is gained by assuming consistency at the level of distributions rather than individuals.

4.2.2 Dawid’s independence B

The fact that T is a deterministic function of T ∗ and F T means that the number of non-trivial conditional independence statements in (26) is not self-evident. A casual reader might imagine that in (26) the pair ( F T , T ∗ ) might take ( ∣ T ∣ + 1 ) ∣ T ∣ different values for each value of the conditioning variable T . However, given T = t , there are only ∣ T ∣ + 1 possible values for ( F T , T ∗ ) :

T = t ⇒ ( F T , T ∗ ) ∈ { ( ∅ , t ) } ∪ { ( t , s ) : s ∈ T } ,

since either we are in the idle regime, F T = ∅ and T = T ∗ , or we are in the interventional regime, in which case F T = t and T ∗ may take any value. Thus, given T = t , (26) corresponds to a set of ∣ T ∣ equalities:

(27) p ( Y ∣ T ∗ = t , F T = t , T = t ) = p ( Y ∣ T ∗ = t , F T = ∅ , T = t ) ,

(28) p ( Y ∣ T ∗ = t , F T = t , T = t ) = p ( Y ∣ T ∗ = s , F T = t , T = t ) , for s ≠ t .

4.2.3 Distributional consistency in B

Equation (27) corresponds to distributional consistency, which [8, eq. (14)] defines as:

(29) p ( Y ∣ F T = ∅ , T = t ) = p ( Y ∣ T ∗ = t , F T = t ) .

Dawid notes that this implies:

(30) Y ⊥ ⊥ F T ∣ T ∗ , T

(see [8, Lemma 1]). However, this formulation also somewhat obscures the actual number of constraints: if T ∗ ≠ T = t , then F T = t so that the statement becomes trivial, while if T ∗ = T = t , then F T only takes two possible values ∅ and t . Given this, it becomes clear that (30) may be reformulated by defining a dynamic regime g ∗ that “intervenes” to set T to be T ∗ . By defining a special regime indicator, denoted F T ∗ , that takes only two values ∅ or g ∗ , we can re-express (30) as:

(31) Y , T ∗ ⊥ ⊥ F T ∗ .

Note that in so doing, we do not need to refer to T ^[18] (see Figure 5(d) for a graphical depiction).

Figure 5

Encoding distributional consistency via a special dynamic regime in the setting of a reformulated decision diagram (having marginalized the intervention target “ T ”). (a) Reformulated augmented graph G ∗ representing the observed joint distribution p ( T ∗ , Y ∣ F T ) ; (b) augmented graph G ∗ corresponding to p ( T ∗ , Y ∣ F T = ∅ ) ; (c) augmented graph G ∗ corresponding to p ( T ∗ , Y ∣ F T = g ∗ ) ; corresponding to the dynamic “regime” that “intervenes” on the target setting it to the natural value T ∗ ; (d) a graph illustrating distributional consistency (31); here, F T ∗ is a special regime indicator taking only the values ∅ and g ∗ . The graph (d) encodes the distributional consistency assumption: the distribution over Y and T ∗ resulting from the “intervention” g ∗ is identical to having no intervention.

$Figure 6 (a) Reformulated augmented graph G ∗ {{\mathcal{G}}}^{\ast } representing the observed joint distribution p ( T ∗ , Y ∣ F T ) p\left({T}^{\ast },Y| {F}_{T}) ; (b) graph illustrating that, if desired, the “applied treatment” variable T T may be added to G ∗ {{\mathscr{G}}}^{\ast } since it is a deterministic function of T ∗ {T}^{\ast } and F T {F}_{T} . Note that although it may seem counterintuitive that T T is not a parent of Y Y in this graph, this is formally correct.$

Figure 6

(a) Reformulated augmented graph G ∗ representing the observed joint distribution p ( T ∗ , Y ∣ F T ) ; (b) graph illustrating that, if desired, the “applied treatment” variable T may be added to G ∗ since it is a deterministic function of T ∗ and F T . Note that although it may seem counterintuitive that T is not a parent of Y in this graph, this is formally correct.

In terms of potential outcomes, the independence (31) may be expressed as:

(32) p ( Y , T ∗ = t ) = p ( Y , T ∗ = t ∣ F T ∗ = ∅ ) = p ( Y , T ∗ = t ∣ F T ∗ = g ∗ ) = p ( Y ( t ) , T ∗ ( t ) = t ) ,

which corresponds to distributional consistency (see (4)).

4.2.4 Ignorability in B

Equation (28) expresses the property of ignorability, which Dawid [8, eq. (20)] expresses as:

(33) Y ⊥ ⊥ T ∗ ∣ F T , T .

However, as Dawid himself notes, given T = t , then either F T = ∅ in which case T ∗ = t (and independence holds trivially), or F T = t , so that this constraint is identical to:

(34) Y ⊥ ⊥ T ∗ ∣ F T = t for t ∈ T

(see Figure 3(f) and (g)). Equivalently in terms of potential outcomes,

(35) Y ( t ) ⊥ ⊥ T ∗ for t ∈ T

(see Figure 3(b)). Again, we note that T is not required for the purpose of expressing this condition.

4.3 Simplification

From a graphical perspective, it is perhaps natural to wish to express the invariance of the distribution of Y given T across observational and interventional distributions by examining whether a regime indicator F T is d-separated from Y given T . However, as we have seen, it is necessary to include what Dawid calls the ITT variable (aka the natural value of treatment) T ∗ in order to rule out cases of spurious invariance. Furthermore, T ∗ plays a central role in certain notions, such as the effect of treatment on the treated, that are widely used in many studies that apply the potential outcome framework.

As shown previously, there is no need to condition on T when describing the defining independences, and in fact doing so arguably obscures the nature of the specific assumption being made. This suggests that T should be marginalized from the ITT augmented graph, rather than T ∗ as Dawid proposes. Note that, if we distinguish the cases F T = ∅ and F T = t , the resulting graphs (modulo labeling) are isomorphic to those used in the SWIG framework (compare Figure 3(a) to (e), and (b) to (f)).

We carry out this reformulation in full generality in the next section.

5 Reformulation of decision graphs

Our proposed reformulation of decision graphs follows a strategy similar to that used for SWIGs. In contrast, Dawid aims to give ECI relations that, together with the usual independence relations over the observed variables, will yield the Markov property for the augmented decision graph with ITT variables. As an alternative, we begin by defining a Markov property associated with the augmented decision graph, and then, using distributional consistency we derive the usual observed conditional independences.^[19]

It should be noted that Dawid’s independences do not actually imply the full Markov property for the ITT graph because, as noted by the presence of a dashed edge, there are context-specific independences implied by the graph.^[20] However, these are not implied by the independence relations A and B . (To see this, note that the conditions A and B would also hold for a decision DAG with the same structure, but in which T was not a deterministic function of T ∗ and F T , in which case the context-specific independence relations would not hold.)

Note also that these extra ECI relations are not restricted solely to those involving T . Consider, for example, the front door graph shown in Figure 7(a). Since Dawid’s augmented decision diagram, shown in Figure 7(c), includes a dashed edge from T ∗ to T , indicating that this edge should be removed conditional on F T = t , the diagram implies that Y will be d-separated from F T given M and F T ≠ ∅ . However, even though it is encoded in the augmented graph, the corresponding ECI:

(36) Y ⊥ ⊥ F T ∣ M , F T ≠ ∅

does not follow from the independences A + B .

$Figure 7 (a) Front-door graph G {\mathcal{G}} ; (b) the SWIG G ( t ) {\mathcal{G}}\left(t) (with ancestral labeling); (c) the augmented decision diagram G ∗ {{\mathscr{G}}}^{\ast } ; and (d) the augmented decision diagram given F T = t {F}_{T}=t in which the dashed edge from T ∗ {T}^{\ast } to T T is removed. Note that in (d) F T {F}_{T} is d-separated from Y Y given M M . However, the corresponding extended independence, Y ⊥ ⊥ F T ∣ M , F T ≠ ∅ Y\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}{F}_{T}\hspace{0.05em}| \hspace{0.05em}M,{F}_{T}\ne \varnothing , is not implied by Dawid’s conditions A+B.$

Figure 7

(a) Front-door graph G ; (b) the SWIG G ( t ) (with ancestral labeling); (c) the augmented decision diagram G ∗ ; and (d) the augmented decision diagram given F T = t in which the dashed edge from T ∗ to T is removed. Note that in (d) F T is d-separated from Y given M . However, the corresponding extended independence, Y ⊥ ⊥ F T ∣ M , F T ≠ ∅ , is not implied by Dawid’s conditions A+B.

In the potential outcome framework, the constraint (36) corresponds to:

(37) p ( Y ( t ) ∣ M ( t ) ) = p ( Y ( t ∗ ) ∣ M ( t ∗ ) ) .

This constraint is naturally encoded by the d -separation of Y ( t ) from the fixed variable t given M ( t ) on the SWIG G ( t ) shown in Figure 7(b) (see [7,10,12]).

As these examples suggest, in order to capture the full Markov structure of the augmented decision diagram, including those constraints corresponding to dashed edges, it is natural to use the constraints implied by the decision diagram when no regime indicators are idle, which we express in shorthand as F A ≠ ∅ ; graphically, this corresponds to removing (temporarily) all of the dashed edges. We show below that the independences encoded then imply, via distributional consistency, the Markov property for the observed data that is encoded in the original graph.

Another advantage of this approach is that we will only require the ITT variables T ∗ ; the “applied treatment,” which Dawid [8] denotes “ T ,” will not be required.^[21]

Specifically, consider a set of variables V 1 , … , V p . Let A ⊂ { 1 , … , p } be the (index) set of the targets of intervention. If i ∈ A , then let V i be the corresponding ITT variable (which Dawid denotes by X i ∗ ). Thus, the set V 1 , … , V p consists of ITT variables as well as variables that are not in A and hence not targets of intervention.^[22] Thus, under the regime where every intervention target has been intervened upon, so that F i ≠ ∅ for all i ∈ A , the variables in V 1 , … , V p correspond to the random variables in the SWIG G ( a ) .

For every intervention target i ∈ A , let g i ∗ denote the dynamic regime that “intervenes” to set the intervention target to its natural value V i . Let F i ∗ be a regime indicator taking the states ∅ or g i ∗ .

Definition 13

(Distributional consistency for decision diagrams) The kernel p ( V ∣ F A ) is said to obey distributional consistency if, given B i ∈ A and C ⊆ A \ { B i } , where C may be empty,

(38) V ⊥ ⊥ F i ∗ ∣ F C , F C ≠ ∅ ,

where we use the shorthand F C ≠ ∅ to indicate that for all j ∈ C , F j ≠ ∅ .

Note that, taking Y = V \ { B i } , (38) is equivalent to the following equality, which corresponds exactly to (3):

(39) p ( Y = y , B i = b ∣ F i = b , F C = c ) = p ( Y = y , B i = b ∣ F i ∗ = g i ∗ , F C = c , F C ≠ ∅ )

(40) = p ( Y = y , B i = b ∣ F i ∗ = ∅ , F C = c , F C ≠ ∅ ) = p ( Y = y , B i = b ∣ F i = ∅ , F C = c ) = p ( Y = y , B i = b ∣ F C = c ) .

Here, the second equality follows from (38), while the first and third equalities are via the definition of F ∗ and ∅ .

As observed by Dawid, in place of Definition 13, we could instead have defined distributional consistency, without reference to the dynamic regime g i ∗ , by simply equating (39) and (40). We have chosen to make use of g i ∗ in order to emphasize what we see as the tautological nature of distributional consistency, while also formulating condition (38) as a conditional independence.

The following four Lemmas are reformulations of Lemmas 3–6 in the decision diagram framework. Though the proofs are largely translations of those lemmas, we include them here for completeness.

Lemma 14

If p ( V ∣ F A ) obeys distributional consistency, B and C are disjoint subsets of A , where C may be empty, then for all y , b , and c :

(41) V ⊥ ⊥ F B ∗ ∣ F C , F C ≠ ∅ ,

where F B ∗ is the set { F i ∗ , i ∈ B } .

Proof

This follows by induction on the size of B .□

Lemma 15

Let B and C be disjoint subsets of A, where C may be empty, and let Y and W be disjoint subsets of V \ B , then distributional consistency implies:

(42) Y ⊥ ⊥ F B ∗ ∣ B , W , F C , F C ≠ ∅ .

Proof

This follows by applying (extended) graphoid axioms to (41).□

Lemma 16

Let B and C be disjoint subsets of A, where C may be empty. If B ⊆ W , then under distributional consistency:

(43) W ⊥ ⊥ F B ∣ F C , F B ∪ C ≠ ∅ ⇒ W ⊥ ⊥ F B ∣ F C , F C ≠ ∅ .

Proof

Let b ∈ X B , c ∈ X C , and w ∈ X W . Given the LHS of (43), it is sufficient to prove p ( X W ∣ F B = b , F C = c ) = p ( X W ∣ F B = ∅ , F C = c ) . Now:

p ( X W = w ∣ F B = b , F C = c ) = p ( X W \ B = w W \ B , X B = w B ∣ F B = b , F C = c ) = p ( X W \ B = w W \ B , X B = w B ∣ F B = w B , F C = c ) = p ( X W \ B = w W \ B , X B = w B ∣ F B ∗ = g B ∗ , F C = c ) = p ( X W \ B = w W \ B , X B = w B ∣ F B ∗ = ∅ , F C = c ) = p ( X W \ B = w W \ B , X B = w B ∣ F B = ∅ , F C = c ) .

The second equality uses the premise of (43), the third is by definition of g B ∗ , and the fourth is distributional consistency via Lemma 14.□

Lemma 17

Let B and C be disjoint subsets of A, where C may be empty. Let Y and W be disjoint sets with B ⊆ W , then under distributional consistency:

(44) Y ⊥ ⊥ F B ∣ W , F C , F B ∪ C ≠ ∅ ⇒ Y ⊥ ⊥ F B ∣ W , F C , F C ≠ ∅ .

Proof

Similar to the proof of Lemma 16, given the premise in (44), it suffices to show that p ( X Y ∣ X W , F B = b , F C = c ) = p ( X Y ∣ X W , F B = ∅ , F C = c ) .

p ( X Y ∣ X W = w , F B = b , F C = c ) = p ( X Y ∣ X W \ B = w W \ B , X B = w B , F B = b , F C = c ) = p ( X Y ∣ X W \ B = w W \ B , X B = w B , F B = w B , F C = c ) = p ( X Y ∣ X W \ B = w W \ B , X B = w B , F B ∗ = g B ∗ , F C = c ) = p ( X Y ∣ X W \ B = w W \ B , X B = w B , F B = ∅ , F C = c ) .

As in the previous proof, the second equality uses the premise of (44), the third is by definition of g B ∗ , and the fourth is distributional consistency via Lemma 14.□

5.1 Reformulated augmented decision diagrams

Let G be a DAG with a topologically ordered vertex set V = { 1 , … , p } representing an observed distribution p ( W V ) .^[23] Let A ⊆ V be the subset of vertices for which interventions are well defined, let F = { F i , i ∈ A } be the corresponding set of regime indicators. Let G ∗ be the extended DAG with vertex set V ∪ F , representing the kernels p ( W V ∣ F A ) . As before, we use pa ( i ) to indicate the (index) set of the variables that are the parents of W i in the original DAG G , and let pre ( i ) denote { 1 , … , i − 1 } , the predecessors of i under a total ordering ≺ consistent with G .^[24]

Definition 18

The kernel p ( W V ∣ F A ) will be said to obey the augmented DAG local Markov property for the DAG G ∗ if for all i ∈ V :

(45) W i ⊥ ⊥ F A \ pa ( i ) , W pre ( i ) \ ( pa ( i ) \ A ) ∣ W pa ( i ) \ A , F A ∩ pa ( i ) , F A ≠ ∅ ,

where F A ∩ pa ( i ) ≠ ∅ is a shorthand for F j ≠ ∅ for all j ∈ A ∩ pa ( i ) .

This formulation captures the Markov property necessary for the augmented diagram including the context-specific independences that arise from interventions (that are not captured directly in Dawid’s A + B formulation).

Note that this property follows from d -separation applied to the graph in which we intervene on every vertex in A . We will show that under distributional consistency, this property implies factorization of the observed distribution with respect to the original graph.

However, it is useful first to further decompose the sets on the RHS of the independence. Specifically, we divide the regime indicators that are not the parents of i into those that occur after i and those that are prior to i :

F A \ pa ( i ) = ( F A \ pre ( i ) , F ( A ∩ pre ( i ) ) \ pa ( i ) ) .

Similarly, we divide the set of random variables that are prior to i and either in A or not parents of i into those that are not parents and those that are the parents that are in A :

W pre ( i ) \ ( pa ( i ) \ A ) = ( W pre ( i ) \ pa ( i ) , W pa ( i ) ∩ A ) .

Thus, independence (45) becomes:

(46) W i ⊥ ⊥ F A \ pre ( i ) , ︷ time order F ( A ∩ pre ( i ) ) \ pa ( i ) , ︷ causal Markov prop. W pre ( i ) \ pa ( i ) , ︷ assoc. Markov prop. W pa ( i ) ∩ A ︷ ignorability ∣ ⋃ F A ∩ pa ( i ) , ︸ fixed parents W pa ( i ) \ A , ︸ random parents F A ∣ ≠ ∅ . ︸ intervene on all of A

Consequently, independence (46) captures the following:

Later interventions have no effect on earlier distributions (time order).
Given intervention on all earlier targets, the specific value of an intervention does not affect the distribution of a variable given its non-intervened parents unless the intervened on variable is itself a parent (causal Markov property).
Independence from earlier random variables given non-intervened parents (associational Markov property).
An intervention on a parent of a variable renders that variable independent of the natural value of the intervention target conditional on its other non-intervened parents (ignorability).

5.2 Example

In Table 4, we show the reformulated decision diagram Markov property corresponding to the augmented DAG G ∗ , as shown in Figure 2(c). Note that the local property here corresponds naturally to the graph G ∗ under the regime F 0 = x 0 , F 1 = x 1 displayed in Figure 2(d). In particular, note that for each random vertex, the size of the conditioning set in the defining independence (ignoring the term F 01 ≠ ∅ ) is equal to the number of parents that the vertex has in Figure 2(d).

Table 4

Defining properties for reformulated decision diagram corresponding to Figure 2, under the ordering ( H , X 0 , Z , X 1 , Y )

Reformulated decision diagram local property
H	⊥ ⊥	F 0 , F 1 ∣	F 01 ≠ ∅
X 0	⊥ ⊥	H , F 0 , F 1 ∣	F 01 ≠ ∅
Z	⊥ ⊥	X 0 , F 1 ∣	H , F 0 , F 01 ≠ ∅
X 1	⊥ ⊥	X 0 , F 0 , F 1 ∣	H , Z , F 01 ≠ ∅
Y	⊥ ⊥	H , X 0 , X 1 , F 0 ∣	Z , F 1 , F 01 ≠ ∅

Here, as elsewhere, F 01 ≠ ∅ is a shorthand for ( F 0 ≠ ∅ & F 1 ≠ ∅ ).

5.3 Consequences of the local Markov property

Lemma 19

If the kernel p ( W V ∣ F A ) obeys distribution consistency and the augmented DAG local Markov property w.r.t. G ∗ , then:

(47) p ( W i ∣ W pre ( i ) , F A = a )

(48) = p ( W i ∣ W pre ( i ) , F pre ( i ) ∩ A = a pre ( i ) ∩ A )

(49) = p ( W i ∣ W pre ( i ) , F pa ( i ) ∩ A = a pa ( i ) ∩ A )

(50) = p ( W i ∣ W pa ( i ) , F pa ( i ) ∩ A = a pa ( i ) ∩ A )

(51) = p ( W i ∣ W pa ( i ) \ A , F pa ( i ) ∩ A = a pa ( i ) ∩ A ) .

Proof

Here, (48) follows from Lemma 16 since by Definition 18,

W pre ( i ) ∪ { i } ⊥ ⊥ F A \ pre ( i ) ∣ F A ∩ pre ( i ) , F A ≠ ∅ .

Similarly, (49) follows from Lemma 17 since by the local Markov property:

W i ⊥ ⊥ F ( A ∩ pre ( i ) ) \ pa ( i ) ∣ F A ∩ pa ( i ) , F A ∩ pre ( i ) ≠ ∅ .

Finally, (50) and (51) again follow from the local Markov property since

W i ⊥ ⊥ W pre ( i ) \ pa ( i ) , W pa ( i ) ∩ A ∣ W pa ( i ) \ A , F A = a ,

hence p ( W i ∣ W pre ( i ) = w , F A = a ) does not depend on w pre ( i ) \ ( pa ( i ) \ A ) = ( w pre ( i ) \ pa ( i ) , w pa ( i ) ∩ A ) .□

5.4 Markov property for the observed distribution

The following result shows that the reformulated local Markov property implies, via distributional consistency, the ordinary local Markov property for the observed distribution. This result corresponds to Theorem 10.

Theorem 20

If the kernel p ( W V ∣ F A ) obeys distribution consistency and the augmented DAG local Markov property w.r.t. G ∗ , then p ( W V ) obeys the usual local Markov property w.r.t. G .

Proof

Let w ∗ ∈ X pre ( i ) .

(52) p ( W i = w ∣ W pre ( i ) = w ∗ ) = p ( W i = w ∣ W pre ( i ) = w ∗ , F pre ( i ) ∩ A = w pre ( i ) ∩ A ∗ ) = p ( W i = w ∣ W pa ( i ) \ A = w pa ( i ) \ A ∗ , F pa ( i ) ∩ A = w pa ( i ) ∩ A ∗ ) .

Here, the first equality follows by distributional consistency. The second follows directly from the equality of (48) and (51) in Lemma 19. Since the last line only depends on w pa ( i ) ∗ , the ordered local Markov property for the DAG holds.□

5.5 Identifiability

The next result shows that the reformulated local Markov property implies that the kernel p ( V ∣ F A ) will be identified from the distribution of the observables provided that the relevant conditional distributions are identified (from the distribution of the observables). This result corresponds to Theorem 11.

Theorem 21

Suppose the kernel p ( W V ∣ F A ) obeys distribution consistency and the augmented DAG local Markov property w.r.t. G ∗ . Let a be an assignment to the intervention targets in A , and let v be an assignment to W V . Then, for every i :

(53) p ( W i = v i ∣ F A = a , W pre ( i ) = v pre ( i ) ) = p ( W i = v i ∣ W pa ( i ) ∩ A = a pa ( i ) ∩ A , W pa ( i ) \ A = v pa ( i ) \ A ) .

Consequently, p ( W V ∣ F A = a ) is identified given p ( W V ) and obeys the Markov property for the DAG formed from G ∗ by removing all outgoing edges from vertices in A .

As before, we note that equality (53) corresponds to the property referred to as “modularity” in the SWIG formulation, which is also an instance of the extended g-formula of [2,14].

Proof

Let a ∈ X A , v ∈ X V . Now:

(54) p ( W i = v i ∣ W pre ( i ) = v pre ( i ) , F A = a ) = p ( W i = v i ∣ W pa ( i ) ∩ A = v pa ( i ) ∩ A , W pa ( i ) \ A = v pa ( i ) \ A , F A ∩ pa ( i ) = a pa ( i ) ∩ A ) = p ( W i = v i ∣ W pa ( i ) ∩ A = a pa ( i ) ∩ A , W pa ( i ) \ A = v pa ( i ) \ A , F A ∩ pa ( i ) = a pa ( i ) ∩ A ) = p ( W i = v i ∣ W pa ( i ) ∩ A = a pa ( i ) ∩ A , W pa ( i ) \ A = v pa ( i ) \ A ) .

Here, the first equality follows from the equality of (47) and (50); the second follows from the equality of (50) and (51); the third by distributional consistency.□

5.6 Distributions resulting from fewer interventions

As in the SWIG case, a similar argument applies if we consider interventions on a subset B ⊆ A . This result corresponds to Theorem 12.

Theorem 22

Suppose the kernel p ( W V ∣ F A ) obeys distribution consistency and the augmented DAG local Markov property w.r.t. G ∗ . Let b be an assignment to the intervention targets in B ⊆ A , and let w ∗ be an assignment to W V . Then, for every i :

(55) p ( W i = w i ∗ ∣ F B = b , W pre ( i ) = w pre ( i ) ∗ ) = p ( W i = w i ∗ ∣ W pa ( i ) ∩ B = b pa ( i ) ∩ B , W pa ( i ) \ B = w pa ( i ) \ B ∗ ) .

Consequently, p ( W V ∣ F B = b ) is identified given p ( W V ) and obeys the Markov property for the augmented DAG G ∗ ∗ formed from G ∗ by removing all outgoing edges from vertices in B and removing the regime indicators F A \ B .

Proof

p ( W i = w i ∣ W pre ( i ) = w pre ( i ) , F B = b ) = p ( W i = w i ∣ W pre ( i ) = w pre ( i ) , F pre ( i ) ∩ B = b pre ( i ) ∩ B ) = p ( W i = w i ∣ W pre ( i ) = w pre ( i ) , F pre ( i ) ∩ B = b pre ( i ) ∩ B , F pre ( i ) ∩ ( A \ B ) = w pre ( i ) ∩ ( A \ B ) ) = p ( W i = w i ∣ W pa ( i ) \ A = w pa ( i ) \ A , W pa ( i ) ∩ B = b pa ( i ) ∩ B , W pa ( i ) ∩ ( A \ B ) = w pa ( i ) ∩ ( A \ B ) ) = p ( W i = w i ∣ W pa ( i ) \ B = w pa ( i ) \ B , W pa ( i ) ∩ B = b pa ( i ) ∩ B ) .

Here, the first equality is by Lemma 16; the second is distributional consistency; the third follows from Theorem 21 applied to G ∗ ; the fourth is a simplification.□

6 The role of “fictitious” independence in Dawid’s development

Dawid in [8] uses what he terms a “fictitious” independence in his proofs that the distribution of the kernels that condition on the regime indicators F i obey the Markov property for the augmented DAG with ITT variables. Specifically, in his proof of Lemma 4, though not the statement, he makes the formal assumption that

(56) F 1 ⊥ ⊥ F 0 ,

and similarly in the proof of Theorem 1, he assumes that all the regime indicators are mutually independent [8, p. 76, eqn. (82)]

(57) F 1 ⊥ ⊥ F 2 ⊥ ⊥ ⋯ ⊥ ⊥ F k − 1 ⊥ ⊥ F k .

Such an independence assumption does not fit into the ECI framework used by Dawid to describe the Markov property for augmented graphs. This is because, as stated by Dawid, an ECI statement A ⊥ ⊥ B ∣ C must satisfy: “(a) no non-stochastic variable occurs in A , and (b) all non-stochastic variables are included in B ∪ C ” [8, fn. 3]; these conditions allow independences to be viewed as well defined restrictions on p ( A ∣ B , C ) since all of the non-stochastic variables appear on the right of the conditioning bar. However, an independence of the form F i ⊥ ⊥ F j violates both of these conditions.

Perhaps for this reason, Dawid argues that although his proofs make use of the assumptions (56) and (57), there is no loss of generality:

So long as all our assumptions and conclusions are in the form described in footnote 3 [i.e., satisfy (a) and (b)], any proof that uses this extended understanding only internally will remain valid […] [8, p. 63]

[…] because the premisses and conclusions of the argument relate only to distributions conditioned on the regime indicators, the extra assumption of variation independence is itself inessential, and can be regarded as just another “trick.” [8, p. 63, fn.24]

We will show via an example that Dawid’s inference here is not valid: in general, the conclusion will not hold for a kernel without additional assumptions regarding the set of states taken by the non-stochastic variables. However, notwithstanding this, as we also show below, Dawid’s conclusions are still correct owing to the special structure that is present in the possible states taken by regime indicators.

6.1 Invalid implication

To illustrate the issue with the proof, we re-write Dawid’s equations so as to make the argument transparent. Dawid makes the following claim:

Claim 23

Consider a kernel q ( x , y ∣ a , b ) , with stochastic variables X and Y and non-stochastic variables A and B . If the following ECI restrictions hold:

(58) Y ⊥ ⊥ A ∣ B , X ,

(59) Y ⊥ ⊥ B ∣ A , X ,

then it follows that:

(60) Y ⊥ ⊥ A , B ∣ X .

To relate this to Dawid’s proof of Lemma 4 in [8], A = F 0 , B = F 1 , X = X 0 , and Y = { H , Z , X 1 ∗ } . Thus (56), (58), (59), and (60) correspond to Dawid’s equations (41), (49), (50), and (51), respectively.

In the proof of this claim, Dawid makes use of the “fictitious” independence A ⊥ ⊥ B , but as noted above, he argues that this “internal” assumption may be made without loss of generality. To see that this implication does not hold without additional conditions on the state spaces for A and B , suppose that the non-stochastic pair ( A , B ) ∈ S ≡ { − 2 , − 1 } 2 ∪ { 1 , 2 } 2 , so that ( A , B ) take one of the following eight states:

( − 2 , − 2 ) , ( − 2 , − 1 ) , ( − 1 , − 2 ) , ( − 1 , − 1 ) , ( 1 , 1 ) , ( 1 , 2 ) , ( 2 , 1 ) , ( 2 , 2 ) .

Note that, by construction, the non-stochastic variables are not variation independent; they always share the same sign. Now, let P − ( Y , X ) and P + ( Y , X ) be any pair of distributions over ( X , Y ) such that P + ( Y ∣ X ) ≠ P − ( Y ∣ X ) and define the kernel p ( Y , X ∣ a , b ) for ( a , b ) ∈ S as follows:

(61) p ( Y , X ∣ a , b ) = p − ( Y , X ) if both a , b < 0 , p + ( Y , X ) if both a , b > 0 , undefined otherwise .

By construction, if B = b < 0 , then for all a such that ( a , b ) ∈ S , i.e., a ∈ { − 2 , − 1 } , it holds that

(62) p ( Y ∣ A = a , B = b , X ) = p − ( Y ∣ X ) ,

so (58) holds when B = b < 0 . The argument when B = b > 0 is symmetric, since in this case for all a such that ( a , b ) ∈ S , we have p ( Y ∣ A = a , B = b , X ) = p + ( Y ∣ X ) . Hence, (58) holds for all B ∈ { − 2 , − 1 , 1 , 2 } . A symmetric argument replacing A with B shows that (59) also holds.

However, the conclusion (60) fails since by construction:

(63) p ( Y ∣ A = − 2 , B = − 2 , X ) = p − ( Y ∣ X ) ≠ p + ( Y ∣ X ) = p ( Y ∣ A = 2 , B = 2 , X ) .

The implication in the claim corresponds to an ECI instance of the Intersection Axiom (CI5) introduced by Dawid in his classic articles [17,18]. As he notes in several places in [8], this implication is well known not to hold in general.^[25] Our aforementioned counterexample simply serves to show that even though there is no distribution over the non-stochastic variables, the implication will not hold if the non-stochastic variables are not variation independent.

6.2 Validity of the conclusion for regime indicators

That the implications used in Dawid’s proofs of Lemma 4 and Theorem 1 do not hold – without conditions on the joint space for the regime indicators – may at first seem to call into question Dawid’s conclusions. However, at least in causal theories making use of DAG representations and involving multiple treatments, the decisions as to whether to intervene, and if so, which value to enforce are unconstrained. Consequently, variation independence will hold, and hence, the conclusion will be valid.

However, there are situations in which interventions may be constrained. For example, suppose that there are two strategies for a medical condition; each treatment involves two separate stages ( A 1 and A 2 ). At time t = 1 , the doctor must decide between strategies “1,” “2.” It is easy to imagine situations in which, if treatment was commenced at time 1, the treatment at time 2 involves “completing” the treatment that was started at time 1, for example, removing surgical stitches from the specific operation performed at time 1. In this case, the treatment options available at time 2 are constrained by the decision at time 1.

Reflecting this, there have been causal decision theories proposed in which variables do not live in a product space (see [22]). Likewise, in the potential outcome framework, the formulation of causally interpreted structured tree graphs given by [2] also allows for this possibility.

However, even in this case, Dawid’s implication will still hold, provided that the following condition obtains.

Definition 24

Let F A ⊊ × i ∈ A X i ∪ { ∅ } indicate the (constrained) state-space for the set of regime indicators F A .

(64) For all f ∈ F A , s.t. f ≠ ∅ ⇒ there exists i ∈ A s.t. f i ≠ ∅ and ( f − i , ∅ ) ∈ F A ,

where f − i indicates the values assigned to A \ { i } by f .

In words, this states that for any possible setting of the regime indicators, in which they are not all “idle,” there exists some intervention target A i that is intervened upon under f , which could have not been intervened upon, such that the resulting vector ( f − i , ∅ ) is still a valid value for F A .

This condition may still hold in settings in which, if a later target is intervened upon, the regime under which an earlier target is set to “idle” is not well defined. For example, an intervention on A 2 setting F 2 = 1 may only be well defined if F 1 = 1 , but not F 1 = ∅ . In the aforementioned treatment completion example this would be the case if, in the absence of an intervention on A 1 , some patients would receive treatment 2 at time 1, so that the subsequent intervention F 2 = 1 would not be well defined. If the same holds for F 2 = 2 , then F 1 = ∅ ⇒ F 2 = ∅ .^[26] The condition (64) will always hold provided that treatment decisions follow a time order, and that, regardless of the decisions that have occurred previously, it is always possible to decide to replace the “last” intervention with the idle regime.

It is easy to see that under the condition (64) for any f ∈ F A , there will exist a sequence ( f = f 0 , f 1 , … , f q = ∅ ) such that for j = 1 , … , q , f j ∈ F A , and f j contains one more idle regime indicator than f j − 1 . It then follows under this condition that:

(65) if for all i ∈ A , W ⊥ ⊥ F i ∣ F A \ { i } then W ⊥ ⊥ F A ,

where here the conditional independence statements implicitly quantify over all the assignments to F A that are in F A , and hence valid.

Acknowledgments

We thank Ilya Shpitser and Philip Dawid for helpful comments and discussions.

Funding information: The authors completed work on this article while visiting the American Institute for Mathematics and the Simons Institute, Berkeley. The authors were supported by ONR Grant N000141912446; Robins was also supported by NIH Grant R01 AI032475.
Conflict of interest: The authors state no conflicts of interest.

Appendix

A.1 Conditions implying the SWIG local Markov property for G given that p ( V ) factors with respect to G

Here, we show that if P A obeys the SWIG local Markov property corresponding to a complete graph G ¯ , and further, the observed distribution p ( V ) is positive and obeys the local Markov property for a subgraph G of G ¯ , then it follows from distributional consistency that P A also obeys the SWIG local Markov property corresponding to G .

Theorem 25

Suppose P A ⊆ obeys distributional consistency and P A obeys the SWIG ordered local Markov property for G ¯ , a complete DAG. ^[27] If p ( V ) is positive and obeys the Markov property for a subgraph G of G ¯ , then P A obeys the SWIG ordered local Markov property for G .

Proof

Let v ∈ X pre ( i ) , v i ∗ ∈ X i , and a ∈ X A .

(A1) p ( X i ( a ) = v i ∗ ∣ X pre ( i ) ( a ) = v pre ( i ) )

(A2) = p ( X i = v i ∗ ∣ X pre ( i ) \ A = v pre ( i ) \ A , X pre ( i ) ∩ A = a pre ( i ) ∩ A )

(A3) = p ( X i = v i ∗ ∣ X pa ( i ) \ A = v pa ( i ) \ A , X pa ( i ) ∩ A = a pa ( i ) ∩ A ) .

Here, the first equality follows from (22) that holds under the local Markov property for G ¯ . The second equality is due to the local Markov property for p ( V ) . Consequently, we see that (A1) does not depend on v pre ( i ) \ ( pa ( i ) \ A ) nor on a A \ pa ( i ) as required by the SWIG local Markov property. Note that positivity is used here in order to ensure that (A2) and (A3) are equal for all assignments to the variables in the conditioning events.□

This result is similar in spirit to Dawid’s construction in that it provides conditions that, in conjunction with the observed distribution p ( V ) obeying the Markov property for G , are sufficient to imply that P A obeys the SWIG local Markov property for G . The SWIG ordered local Markov property on P A for a complete graph G ¯ corresponds to the FFRCISTG of [2], in the case where A represents the finest, i.e., largest, set of treatment variables for which well defined counterfactuals exist, and there are no (population- or individual-level) exclusion restrictions.

Note that for a complete graph G ¯ , for every variable i , pre ( i ) = pa ( i ) . Consequently, the SWIG local Markov property (11) and (13) reduce to requiring that for every i :

(A4) X i ( a ) ⊥ ⊥ d X pre ( i ) ∩ A ( a ) ︷ ignorability , x A \ pre ( i ) ︷ time order ∣ x A ∩ pre ( i ) , ︸ fixed predecessors X pre ( i ) \ A ( a ) . ︸ random predecessors

Thus, we see that the SWIG local Markov property for the complete graph G ¯ solely imposes ignorability and that interventions in the future do not change (the distribution of) variables in the past.

The single-graph approach given by Definition 7 and the two-graph construction of Theorem 25 each have their own strengths and weaknesses:

In the single-graph approach, the model places restrictions on P A ; distributional consistency for P A ⊆ then implies the relevant SWIG Markov properties for all the other distributions in P A ⊆ , including the factual distribution p ( V ) . This approach is more concise insofar as it requires fewer conditions. The approach does not require p ( V ) to be positive.
In the two-graph construction, the graph G specifies conditional independence restrictions on the observed distribution p ( V ) via an ordinary Markov property, while the SWIG Markov property for the complete supergraph G ¯ imposes ignorability and a total time order on P A . Under positivity for p ( V ) , distributional consistency for P A ⊆ then implies the relevant SWIG Markov properties for every distribution in P A ⊆ . Though it requires more conditions, this approach has the advantage that it clearly demarcates a set of additional conditions that, when added to the assumption that p ( V ) obeys the Markov property for G , suffice to construct the full model on P A ⊆ .

The fact that the single-graph approach does not require positivity can be seen as an advantage since it does not restrict the set of observed distributions. As a consequence, the graph in the single-graph approach may include edges that indicate effects arising from interventions on A that set variables to configurations that have probability zero under the observed distribution. Even in the absence of confounding, such effects may only be detectable via randomized experiments (see [23] for further discussion).

A.2 Derivation of part of the augmented DAG local Markov property for G from p ( V )

Similar to our development in Section A.1, and also to Dawid’s construction, we provide conditions on the kernel p ( W V ∣ F A ) that, in conjunction with a positive observed distribution p ( V ) that obeys the local Markov property for a subgraph G , suffice to ensure p ( W V ∣ F A ) obeys the local property for the corresponding augmented graph G ∗ . These conditions are formulated in terms of a decision diagram G ¯ ∗ corresponding to a complete DAG G ¯ that contains G as a subgraph.

Theorem 26

Suppose that p ( W V ∣ F A ) obeys distribution consistency and the augmented DAG local Markov property with respect to G ¯ ∗ where G ¯ is a complete DAG. If p ( V ) is positive and obeys the (ordinary) Markov property for a subgraph G of G ¯ then p ( W V ∣ F A ) also obeys the augmented DAG local Markov property for G ∗ .

For a complete graph G ¯ , for every variable i , pre ( i ) = pa ( i ) , and thus, the local Markov property for G ¯ ∗ requires that for every i ,

(A5) W i ⊥ ⊥ d W pre ( i ) ∩ A , ︷ ignorability F A \ pre ( i ) ︷ time order ∣ F A ∩ pre ( i ) , ︸ fixed predecessors W pre ( i ) \ A , ︸ random predecessors F A ≠ ∅ . ︸ intervene on all of A

Thus, similar to (A4), this imposes ignorability and that interventions in the future do not change (the distribution of) variables in the past.

Proof

Let v ∈ X pre ( i ) , v ∗ ∈ X i , and a ∈ X A .

(A6) p ( W i = v i ∗ ∣ F A = a , W pre ( i ) = v pre ( i ) )

(A7) = p ( W i = v i ∗ ∣ W pre ( i ) ∩ A = a pre ( i ) ∩ A , W pre ( i ) \ A = v pre ( i ) \ A )

(A8) = p ( W i = v i ∗ ∣ W pa ( i ) ∩ A = a pa ( i ) ∩ A , W pa ( i ) \ A = v pa ( i ) \ A ) .

Here, the first equality follows from (53) which holds under the augmented local Markov property for G ¯ ∗ . The second equality is due to p ( V ) obeying the local Markov property for G . Consequently, we see that (A6) does not depend on v pre ( i ) \ ( pa ( i ) \ A ) nor on a A \ pa ( i ) , as required by the augmented graph local Markov property. Note that positivity ensures that (A7) and (A8) are equal for all assignments to the variables in the conditioning events.□

References

[1] Dawid AP. Influence diagrams for causal modelling and inference. Int Stat Rev. 2002;70:161–89. 10.1111/j.1751-5823.2002.tb00354.xSearch in Google Scholar

[2] Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Math Model. 1986;7:1393–512. 10.1016/0270-0255(86)90088-6Search in Google Scholar

[3] Richardson TS, Robins JM. Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. Center for Statistics and the Social Sciences Technical Report. 2013. University of Washington, Seattle, Washington, USA. https://www.csss.washington.edu/Papers/wp128.pdf. Search in Google Scholar

[4] Robins JM, Richardson TS. Alternative graphical causal models and the identification of direct effects. Causality and psychopathology: finding the determinants of disorders and their cures. United Kingdom: Oxford University Press; 2010. 10.1093/oso/9780199754649.003.0011Search in Google Scholar

[5] Imbens GW. Causality in econometrics: choice vs chance. Econometrica. 2022;90(6):2541–66. 10.3982/ECTA21204Search in Google Scholar

[6] Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. New York: Springer Verlag; 1993. 10.1007/978-1-4612-2748-9Search in Google Scholar

[7] Malinsky D, Shpitser I, Richardson TS. A potential outcomes calculus for identifying conditional path-specific effects. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research. Naha, Okinawa, Japan: MLResearch Press; 2019. Search in Google Scholar

[8] Dawid AP. Decision-theoretic foundations for statistical causality. J Causal Inference. 2021;9:39–77. 10.1515/jci-2020-0008Search in Google Scholar

[9] Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82(4):669–709.10.1093/biomet/82.4.669Search in Google Scholar

[10] Shpitser I, Richardson TS, Robins JM. Multivariate counterfactual systems and causal graphical models; 2021. arXiv:2008.06017. 10.1145/3501714.3501757Search in Google Scholar

[11] Ghassami A, Shpitser I, Richardson TS, Robins JM. Causal models with restricted interventions; 2023. In preparation. Search in Google Scholar

[12] Robins JM. Personal Communication; 2018. Search in Google Scholar

[13] Lauritzen SL, Dawid AP, Larsen B, Leimer HG. Independence properties of directed Markov fields. Networks. 1990;20:491–505. 10.1002/net.3230200503Search in Google Scholar

[14] Robins JM, Hernán MA, Siebert U. Effects of multiple interventions. In: Ezzati M, Murray CJL, Lopez AD, Rodgers A, editors. Comparative quantification of health risks : global and regional burden of disease attributable to selected major risk factors. vol. 2. Geneva: World Health Organization; 2004. p. 2191–230. Search in Google Scholar

[15] Pearl J. Causality. 2nd ed. Cambridge, UK: Cambridge University Press; 2009. Search in Google Scholar

[16] Dawid AP. Causal inference without counterfactuals. J Amer Stat Assoc. 2000;95:407–48. 10.1080/01621459.2000.10474210Search in Google Scholar

[17] Dawid AP. Conditional independence in statistical theory. J R Stat Soc Ser B (Methodological). 1979;41(1):1–31.10.1111/j.2517-6161.1979.tb01052.xSearch in Google Scholar

[18] Dawid AP. Conditional independence for statistical operations. Ann Statist. 1980;8:598–617. 10.1214/aos/1176345011Search in Google Scholar

[19] Gill R. The intersection axiom of conditional probability; 2019. https://www.slideshare.net/gill1109/the-intersection-axiom-of-conditional-probability. Search in Google Scholar

[20] Sullivant S. Algebraic statistics. Providence, Rhode Island, USA: American Mathematical Society; 2018. Search in Google Scholar

[21] Peters J. On the intersection property of conditional independence and its application to causal discovery. J Causal Inference. 2015;3(1):97–108.10.1515/jci-2014-0015Search in Google Scholar

[22] Thwaites P, Smith JQ, Riccomagno E. Causal analysis with chain event graphs. Artif Intelligence. 2010;174(12):889–909.10.1016/j.artint.2010.05.004Search in Google Scholar

[23] Robins JM, Richardson TS, Shpitser I. An interventionist approach to mediation analysis. 2020. https://arxiv.org/abs/2008.06019. Search in Google Scholar

Received: 2022-02-17

Accepted: 2023-02-10

Published Online: 2023-10-25

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2022-0012

Keywords for this article

directed acyclic graph; decision theory; finest fully randomized causally interpreted structured tree graph; potential outcome; single-world intervention graph

Creative Commons

BY 4.0