Home Decision making, symmetry and structure: Justifying causal interventions
Article Open Access

Decision making, symmetry and structure: Justifying causal interventions

  • David O. Johnston EMAIL logo , Cheng Soon Ong and Robert C. Williamson
Published/Copyright: January 16, 2025
Become an author with De Gruyter Brill

Abstract

We can use structural causal models (SCMs) to help us evaluate the consequences of actions given data. SCMs identify actions with structural interventions. A careful decision maker may wonder whether this identification is justified. We seek such a justification. We begin with decision models, which map actions to distributions over outcomes but avoid additional causal assumptions. We then examine assumptions that could justify causal interventions, with a focus on symmetry. First, we introduce conditionally independent and identical responses (CIIR), a generalisation of the IID assumption to decision models. CIIR justifies identifying actions with interventions, but is often an implausible assumption. We consider an alternative: precedent is the assumption that “what I can do has been done before, and its consequences observed,” and is generally more plausible than CIIR. We show that precedent together with independence of causal mechanisms (ICM) and an observed conditional independence can justify identifying actions with causal interventions. ICM has been proposed as an alternative foundation for causal modelling, but this work suggests that it may in fact justify the interventional interpretation of causal models.

MSC 2010: 62D20; 62A01; 68T37; 60A99

1 Introduction

Sometimes we want to make decisions supported by data. Structural causal models (SCMs) are a standard framework for addressing this kind of problem. In these models, variables of interest are identified with the nodes of a directed graph, and directed edges represent causal relationships between them. A decision maker using an SCM could bring any amount of prior information to the problem: at one extreme, the graph may just be a convenient means of representing facts about the consequences of actions that they already know to be true. At the other, a decision maker could have very little idea which graph is appropriate for their problem and may want to engage in causal discovery, where they employ general principles along with the given data to decide on a set of viable causal graphs.

Well-known difficulties of causal inference are that there are no widely accepted principles of causal discovery that will always yield a unique causal graph, and available data are typically consistent with causal graphs that admit a wide variety of interventional consequences. Even if these problems could somehow be avoided and a graph with nontrivial causal implications obtained, a decision maker must also decide how to map their available options to structural interventions on this graph. However, if the construction of the appropriate causal graph is beyond the decision maker’s prior knowledge, then the identification of options with structural interventions may also lie beyond their knowledge.

Consider, for example, an author who wants to know what genre to pick for their next writing project in order to maximise sales – science fiction or romance. Suppose they have collected a large dataset, and according to some causal discovery method, they have obtained a structural causal model (1)[1]. This model contains three variables: sales in the book’s first 12 months of life S , genre G (as judged by the bookseller) and the average sales of the author’s previous books R .

(1) .

1

Under the operation of a perfect intervention, given a distribution P ( S , G , R ) estimated from the data, an intervention on genre will yield the distribution

P ( S R , do ( G = x ) ) = P ( S R , G = x ) ;

The author can observe their own average sales R = r . Suppose that P ( S R , G ) is such that E [ S R = r , G = romance ] > E [ S R = r , G = science fiction ] . If the author identifies choosing to write a romance novel with the intervention do ( G = romance ) , then this model says they will, in expectation, sell more books if they choose to write a romance novel.

However, the event G = romance occurs when the book is delivered to the seller and classified as romance, and there are many paths from forming the intention to write a romance novel to this event taking place. Perhaps our author is an experienced writer of science fiction but knows little about romance and wonders whether they can really write a good book in the romance genre. The fact that the causal discovery method identified graph (1) without any additional dependences – observed or otherwise – is perhaps an indication that genre-specific experience does not matter much. On the other hand, the author could also write a novel that any reasonable person would agree epitomises science fiction, but is adversarially constructed so that the bookseller’s algorithm calls it romance. It seems unreasonable to demand that the causal model tell us much about this possibility.

It would be convenient if interventions on an SCM informed us about the consequences of ordinary attempts to manipulate the intervened variable. Convenience is not, in and of itself, a good reason to believe it is true. What we actually want is a sound reason to believe it:

  • It follows from plausible assumptions that ordinary attempts to manipulate are well modelled by interventions on an SCM derived from causal discovery.

  • It is empirically observed that SCMs derived from causal discovery accurately predict the consequences of ordinary attempts to manipulate the intervened variable.

To the best of our knowledge, these questions have not been previously discussed. A related issue that has been discussed is when ambiguous interventions are well defined. For example, there are many different options that are known to affect a person’s body mass index (BMI) including diet plans, gastric surgery and limb removal [1,2]. Many authors who discussed this question agreed that there is no obvious “canonical” action to control BMI [35]. We observe that diet is an ordinary way to try to control BMI, gastric surgery is less common but still a reasonable option while limb removal for the purpose of BMI control is unheard of and unreasonable, so by our proposed standard, we might hope that a good causal model could capture the consequences of both successful dieting and gastric surgery, but need not address limb removal. The reasonable options could still have different consequences, though, so it raises the question: if, somehow, a graph discovery method yielded a causal model where the effects of intervening on BMI were identified, does this mean that the consequences of both options are actually the same? If not, which options, if any, should the intervention be identified with?

One view is that decision problems with multiple means to manipulate a variable are ill-posed. Ambiguous interventions refer to the case where an intervention targets a variable that is a composite of a set of finer grained variables (in the same way, the genre of a book is a composite of the book’s textual contents, appearance, the cultural context in which it is interpreted and so forth). The previous work by Spirtes and Scheines has argued that composite variables do not have clear intervention semantics [6]. Pearl appears to have advanced a similar view: “there is no way a model can predict the effect of an action unless one specifies which variables are affected by the action and how.” [7, Ch. 11]

This is a strict principle which seems to rule out common patterns of causal reasoning. Is it really necessary to model each possible strategy in great detail to conclude that the nature of health benefits for many weight control strategies are likely to be similar, given similar success at controlling one’s weight? What’s more, just how much detail does one need to specify about which variables are affected? We cannot possibly specify everything about which variables a given action affects. Adopting a different diet will lead one to walk down different aisles at the supermarket, use different cookware to prepare food and imbue their kitchen with different scents – but one need not manually rule out the relevance of these and every other possible impact of shifting diet to conclude that an overweight person who adopts a high protein diet and loses weight will also reduce their risk of heart disease.

Reviewing the general issue, it is clear that SCMs can be used to predict the consequences of actions in systems we already understand – it is easy to construct toy models of familiar systems like sprinklers, rain and footpaths where the results of intervention operations match the already known consequences of actions. It is less clear how to apply them to predict the consequences of actions in systems that are less understood. On one hand, it may be possible to learn SCMs from data according to general principles of causation. On the other hand, a decision maker has a set of pragmatic actions they are considering which must be mapped to interventions in the learned SCM. It is difficult to find a method for doing this mapping that is both valid and avoids throwing away a lot of the utility of causal modelling.

We don’t know how to resolve the problem as posed, so we take a different approach. We only aim to help a decision maker predict the consequences of their pragmatic actions, and to offer general principles they can use to improve their predictions with data. Intervention operations, as defined on SCMs, have no role in this process unless they are justified by the learning principles and the data. What we show, in the end, is that sometimes they are justified.

Decision models are a formal representation of models that help a decision maker predict the consequences of their actions. They map a set of actions, assumed to be known to the decision maker, to probability distributions representing predictions of consequences. Such models have been studied previously by numerous authors [812]. To be useful, a decision model must come with some means of relating observed data to consequences of actions. Interventions on structural models are one such means – an intervention on a variable G will change the distribution of G while keeping the distributions of all other variables conditional on their parents identical to their previously observed values. Our purpose is to justify interventions, however, so we have to turn to alternative principles.

The first principle we consider is a generalisation of the assumption of independent and identically distributed (IID) variables. Specifically, we consider decision models with sequences of variable pairs that share independent and identical responses (IIR), where the distribution of every “output” variable conditional on the corresponding “input” variable is identical, regardless of whether it represents previously observed data or future data that will be affected by the actions of the decision maker, no matter what action is taken. Much like structural interventions, this assumption holds that the identified conditional distributions are not changed by the actions of the decision maker.

To help us think about when the IIR assumption might be justified, we relate it to a symmetry of decision models inspired by De Finetti’s representation theorem [13]. We prove that an IIR decision model with an unknown response function is equivalent to an input–output contractible (IO-contractible) decision model (Theorem 3.17). As far as justifying the IIR assumption, this is a negative result. IO-contractibility implies the interchangeability of sufficiently large quantities of previously observed data with any data arising from experimentation – a condition which is usually unreasonable. This is analogous to the common view that causal effects are typically not known to be identified in observational data.

An alternative learning principle is that of precedent. This is the assumption that – informally stated – all of the decision maker’s prospective actions have been taken before under all of their possible circumstances, and their consequences are observed in the available data. This differs from the IIR assumption in that the decision maker does not know which observations go with a particular action and particular set of circumstances. It is analogous to starting with an IIR sequence and forgetting the inputs. Because consequences will generally have a different distribution of inputs, forgetting the inputs means the observed data is no longer interchangeable with the consequences. Our key result (Theorem 4.7) combines precedent with an additional assumption: absolute continuity of conditionals (ACC). These two assumptions – precedent and ACC – together with an observation of conditional independence imply IIR holds with respect to a pair of observed variables, and so they justify treating our actions as interventions on the input variable identified by the theorem.

ACC is not a simple assumption to state: it requires that certain “higher order” probability distributions over conditional distributions remain absolutely continuous with respect to the Lebesgue measure after observing a conditional independence. We can gain an intuitive understanding of this assumption by relating it to the principle of independent causal mechanisms [14], an informal principle that says “causal mechanisms” – which are conditional distributions of effects given their parents – are in some sense independent or at least are not deterministically dependent on one another. We show that, given a particular version of the principle of independent causal mechanisms, the assumption of absolute continuity of conditionals can be justified by assuming the direction of causal relationships between certain variables. Thus, given a decision model, we offer a justification for treating actions as interventions on certain variables based on precedent (a kind of symmetry) ACC (justified by causal structure) and an observed conditional independence.

Structural causal models are usually taken to uphold the principle of independent causal mechanisms and to support the assumption that the consequences of choices can be computed via intervention operations [15]. However, logically speaking, these are separate assumptions. Our result suggests that, if we already accept the principle of independent causal mechanisms, we may be able to reduce the assumption of interventions to a symmetry (precedent) and an observed feature of the world (conditional independence).

1.1 Connections to previous work in causal inference

Our starting point is that we want models that are useful for decision makers, which motivates the formalism of “decision models.” This approach is in the tradition of the decision theoretic approach to causal inference that has been applied in slightly different ways by previous works [810]. Probabilistic graphical models [11], while not explicitly decision theoretic, have much in common with the decision models we study. They are sufficiently similar that methods for computing the consequences of interventions in probabilistic graphical models [12] can be adapted to decision models.

Another precursor to our work is the notion of a sequence of exchangeable observations along with “one more (possibly non-exchangeable) observation” [16]. This anticipates our effort, with the CIIR assumption, to extend the assumption of exchangeability for previously observed data to an assumption that can also apply to future data subject to a decision maker’s influence. While Lindley mentioned that this approach can be applied questions of causation, he did not explore this deeply due to the perceived difficulty of finding a satisfactory definition of causation.

There have been a number of other works on symmetries in causal inference. Models with exchangeable potential outcomes have been used to prove several identification results [17,18]. There are similarities between exchangeable potential outcomes and independent and identical response functions. Given our focus on the relationship between our approach and the structural approach to causal modelling, a thorough treatment of similarities and differences between our approach and the potential outcome approach is beyond the scope of this article.

Conditional exchangeability is defined as the exchangeability of the non-intervened causal parents of a target variable under intervention on its remaining parents [19]. Sareela et al. suggested that this could be interpreted as a symmetry of some kinds of experiment: if, for example, patients are administered a treatment, then conditional exchangeability can be viewed as an expected invariance of results when patients in the experiment (and their concomitant treatments) are exchanged. Similar kinds of symmetry appear in several other works [10,2023]. A key difference between all of these causal symmetries and input–output contractibility is that these are all symmetries that involve altering an experiment. Input–output contractibility, which we study, is symmetry with respect to data manipulation – under IO contractibility, certain permutations and subsets of the data have exactly the same model. Thus, IO contractibility can be defined entirely using the mathematics of decision models, without having to discuss potentially complex manipulations of hypothetical experiments.

A different kind of regularity of causal models is given by the stable unit treatment distribution assumption (SUTDA) [10] and the stable unit treatment value assumption (SUTVA) [17]. This regularity is similar to the condition of locality, a subassumption of input–output contractibility, and as with exchangeable potential outcomes, a careful examination of the similarities and differences is beyond the scope of this article.

Theorem 4.7 was inspired by causal inference by invariant prediction [24]. While both the assumptions and the conclusions drawn in that work differ from the assumptions and conclusion of Theorem 4.7, both look for variable pairs X and Y such that the distribution of Y given X doesn’t change when actions are taken. Unlike Peters et al., our result does not make use of structural interventions, and the connection to the principle of independent causal mechanisms is original to this work.

An alternative which exchangeability-like symmetry can be used to learn causal structure [25], in contrast to our work where we use structure together with symmetry for the purposes of predicting consequences of actions from data, but do not consider the problem of learning structure from data.

1.2 Outline

Section 2 outlines our mathematical framework and provides a brief reference on notation. Section 3.2 introduces decision models with conditionally independent and identical responses, a generalisation of conditionally independent and identically distributed variables. We then introduce and explain Theorem 3.17, a generalisation of De Finetti’s representation theorem that applies to decision models, and argue on the basis of this theorem that the assumption of conditionally independent and identical responses is often unreasonable.

Section 4 introduces the notion of precedent and then proves Theorem 4.7, which establishes that precedent, together with ACC and an observed conditional independence implies conditionally independent and identical responses for pairs of observed variables. We also show how ACC can be justified by structural assumptions, if we accept the principle of independent causal mechanisms.

2 Technical prerequisites

This section gathers some necessary technical definitions, and is included for reference. A reader who wishes to follow the arguments of the article may skip to Section 3 and refer back to this section as required.

Section 2.1 introduces the notation used in this article. Because decision models are stochastic functions rather than probability measures, we introduce in Section 2.2 some extensions of standard probabilistic concepts.

2.1 Notation

We refer readers to chapters 1, 2 and 4 of [26] for an introduction to probability theory as we use it. Here, we offer a brief overview of our notation.

We denote a measurable space ( A , A ) . Given a collection U of subsets of A , σ ( U ) is the smallest σ -algebra containing U . For a set A , we write the power set P ( A ) .

We write random variables X : A X ( X is the variable, X is the codomain and X is the σ -algebra on X ). Given a probability measure P on ( A , A ) , P X : X [ 0 , 1 ] is the marginal distribution of X and P X Y : X × Y [ 0 , 1 ] is the distribution of X conditional on Y .

A sequence of random variables ( X , Y ) is itself a random variable ω ( X ( ω ) , Y ( ω ) ) . We denote by * the trivial random variable * : A { * } , where ( { * } , { , { * } } ) is a one element set equipped with the indiscrete or “trivial” σ -algebra. Distributions conditioned on the trivial variable are equivalent to marginal distributions: P X * = P X .

We denote by Δ ( A ) the set of all probability measures on ( A , A ) .

We use the Iverson bracket condition for the function that evaluates to 1 if condition is true and 0 if it is false. The Dirac measure δ x Δ ( X ) is the probability measure for which δ x ( A ) = x A .

A positive integer in square brackets [ m ] refers to the set { 1 , , m } . Two numbers in square brackets [ x , y ] R refer to the interval, unless it is given as a set of indices in which case it is the sequence of integers { m , , n } .

The set { * } is a singleton set containing * with σ -algebra { , { * } } . * is a trivial variable that maps some measurable set (implicitly defined by the context) to { * } .

2.2 Decision models

We are interested in modelling decision making rather than prediction. A decision maker makes a choice of one of a set of different options, and different choices lead to different outcomes. This is not the case for someone only interested in prediction, as the outcome is unaffected by the prediction “chosen.”

Choices differ from ordinary random variables. When we begin thinking about a decision problem, we do not know which choice we will make (or else the problem would be trivial). When we finish thinking about it, the choice has been made. Random variables are not like this – we are uncertain about them at the outset, and we remain uncertain when we have constructed a satisfactory probabilistic model. There may be many interesting things to say about the process of making a decision, but we will not say them here. The decision maker has a set of options or choices that they may choose, and we do no speculation about which choice will be made; there is no probability distribution associated with the set of options (for further argument that option probabilities should not contribute to decision making [27]).

A decision maker does require, for each of their options, a forecast of the consequences. We model this with a Markov kernel, a function that maps options to probability distributions.

Definition 2.1

(Markov kernel) Given measurable spaces ( E , ) and ( F , ) , a Markov kernel or stochastic function is a map M : E × [ 0 , 1 ] such that

  • The map M ( A ) : x M ( A x ) is -measurable for all A .

  • The map M ( x ) : A M ( A x ) is a probability measure on ( F , ) for all x E .

We use an alternative notation for the signature of a Markov kernel to stress the fact that we can consider it a measurable map from a measurable set to a set of probability distributions.

Notation 2.2

(Signature of a Markov kernel) Given measurable spaces ( E , ) and ( F , ) and a Markov kernel M : E × [ 0 , 1 ] , we write M : E F , which we read as “ M maps from E to probability measures on F .”

A decision model is a generalisation of a probability space. A probability space is a measurable sample space together with a probability measure. A decision model is a measurable option set, a measurable sample space and a Markov kernel that maps options to probability measures on the sample space. We represent the Markov kernel as P , where the subscript stands for the choice.

Definition 2.3

(Decision model) A decision model is a triple ( P , ( Ω , ) , ( C , C ) ) where P : C Ω is a Markov kernel, ( Ω , ) is the sample space and ( C , C ) is the set of options.

Random variables are measurable functions on the sample space.

Definition 2.4

(Random variable) Given a decision model ( P , ( Ω , ) , ( C , C ) ) , an X-valued random variable is a measurable function X : ( Ω , ) ( X , X ) .

We use C to refer to the identity function on C , which we call the decision maker’s choice. We can in principle define other choice variables with domain C , but we will not use them here.

2.3 Conditional distributions and conditional independence in decision models

Decision models yield a probability distribution for each option α from the set C . Given a random variable X , we have a marginal distribution P α X for each option, and if we have two random variables X and Y , for each α C , we have a conditional distribution P α Y X (at least, we always work in probability spaces where such conditional distributions exist).

We use two notions of conditional independence. The first is conditional independence for each option α C . We say X Y ( Z , C ) – read “ X is independent of Y given Z and C ” if, for each α C , X P α Y Z – in words: for every α , relative to P α , X is independent of Y given Z .

The second notion is independence “of C .” We say X ( Y , C ) Z if for every α C , X P α Y Z and furthermore for every α , α

P α X Z = P α X Z ;

Extended conditional independence is a generalisation that subsumes both of these versions of conditional independence [28].

Our notion of conditional independence satisfies the standard properties as long as we insist that C always and only appears on the right-hand side of the symbol:

  1. Symmetry: X Y ( Z , C ) iff Y X ( Z , C ) .

  2. X Y ( Y , C ) .

  3. Decomposition: X ( Z , Y ) ( W , C ) implies X Z ( W , C ) and X Y ( W , C ) .

  4. Weak union:

    1. X ( Y , Z ) ( W , C ) implies X Y ( Z , W , C ) .

  5. Contraction: X Z ( W , C ) and X Y ( Z , W , C ) implies X ( Y , Z ) ( W , C ) .

If P α X Y = P α X Y for all α , α C (i.e. X C Y ), we may write P Y X to emphasise the independence from C .

In a decision model, we say that two random variables are almost surely equal if they are almost surely equal for every α C . That is, given X , Y

X Y α C : P α ( X Y ) = 0 ;

We also have a notion of almost sure equality for conditional distributions. Given two conditional distributions P X Y H and P X Y H , P X Y H Y H P X Y H if for all α we can choose versions of P α X Y H and P α X Y H such that P α X Y H ( y , h ) = P α X Y H ( y , h ) for P α Y H -almost all y Y , h H . We will generally just write , leaving the variables implicit. Note that this does not imply P X Y H Y H P X Y H (as P α Y H may put positive probability on some set of P α Y H measure 0).

2.4 Directed graphs

We refer to structural causal models in several places. The theory of structural models that we use can all be found in Chapter 1 of [7]. We use the notion of an intervention and d-separation (which we represent with the symbol), as well as elementary properties of directed graphical models like parents.

3 Inferring consequences when observations and consequences share identical responses

Recall the example discussed in Section 1: an author wants to choose the genre of a book they will write. There, we proposed a structural causal model that predicted that the distribution of sales conditional on genre, the author’s historical sales success and global trends sales does not change under intervention on genre. We also raised the question of how the author could know, exactly, if the action they took was an intervention – or close enough to it – to take advantage of this invariance.

Consider a slightly different notion of invariance: instead of assuming the distribution of sales conditional on genre and the covariates not change under an intervention on genre, assume it doesn’t change under any of the author’s available actions. In this case, the author can reason as follows: while they don’t know exactly what consequences deciding to write a romance novel ( α romance ) will have, they know that under this choice they are more likely to produce a romance novel vs deciding to write a science fiction novel ( α sf ). They also know that their choice will not affect their history of book sales, nor recent global trends in sales. Thus, despite their uncertainty over the details of the consequences, choosing to write a romance novel will lead to more sales in expectation (assuming, as we did, that romance novels were observed to sell better in the observed data).

If the author accepts the second assumption (and we’re not saying that they should), then they can treat all of their actions that control the genre of their book as interventions on genre with respect to the original causal model (not necessarily perfect interventions). We are not suggesting the author should accept this assumption, just that if they did, then they could model their actions as interventions. The work in this article serves as a foundation for the more plausible justification we present in the Section 4.

In the language of decision models: the author has a decision model ( P , ( C , C ) , ( Ω , ) ) together with a sequence of random variables ( S i , G i , R i ) i [ m ] { q } , where the indices [ m ] refer to the books observed in the dataset so far, and the special index q refers to the book the author is hoping to write. The assumption we are discussing is that there is some unknown stochastic function, which we will call H , such that for all i , j

P S i G i R i H H , G i R i P S j G j R j H H ;

We call H a “response function.” We want this to represent the response of every S i to the corresponding G i and R i given infinite data – this property is codified by the assumption S i ( S { i } , R { i } , G { i } ) ( G i , R i , H , C ) (here, { i } is the set ( [ m ] { q } ) \ { i } ).

Thus, the notion of a “fixed response function” has two sub-assumptions: first, conditional on H , the response of every S i to ( G i , R i ) is identical regardless of i and secondly, conditional on H again, S i is independent of other triples ( S i , G i , R i ) . Both of these assumptions together is the assumption of conditionally independent and identical responses (CIIR for short).

Is it ever reasonable to assume CIIR? It might be for systems deliberately engineered for regularity. A switch reliably turns on a light if you flick it, and a function in a piece of code reliably returns the same result given the same input. Book sales or human health are not examples of systems like this, however. Maybe in principle many non-engineered systems exhibit regular responses, but we generally don’t know if any particular collection of variables will do so.

Instead of appealing to our prior knowledge of mechanisms as we do in the case of engineered systems, we could try to appeal to knowledge of symmetries of the problem. The inspiration for this approach comes from De Finetti’s work on Bayesian probabilistic inference. De Finetti observed that many statistical models assumed a sequence of independent and identically distributed random variables conditional on an unknown “true parameter” (which we could call conditionally independent and identically distributed or CIID). He was unsatisfied with the notion of “true parameters” and offered an alternative way to analyse these models via symmetry. If a prediction problem is not changed in any important way by permuting the measurements, then it seems reasonable to adopt a probability model that is unchanged under permutation of variables. De Finetti showed that the class of probability models with this symmetry (called exchangeability) is equivalent to the class of CIID models [13].

Here, we ask: is there an analogous indifference over permutations in decision models that yields the CIIR assumption? Formally, the answer is yes: the class of CIIR decision models is equivalent to decision models with a symmetry we call input–output contractibility (or IO contractibility). However, IO contractibility is a less intuitive and less appealing assumption than exchangeability. The main practical upshot of this section is an argument against assuming CIIR in many situations. IO contractibility implies that, after having seen infinite data, any further input–output pairs can be exchanged. Setting aside the infinite data requirement, if the author assumes CIIR for a decision model that includes both a convenient historical dataset of book sales and for the sales of their own books, they accept that these two problems are identical:

  • Write a large number of books of various genres themselves, observe their sales, and predict the sales of one more book they write of a given genre.

  • Observe the sales, genres and author averages of the same number of books from the convenient historical dataset, and use this to predict the sales of one book they write of a given genre.

However, there are plenty of good reasons to think that the past sales of the author’s own books, written under similar conditions, will be much more similar to the sales of their future books than the sales of books by an arbitrary collection of third parties under unknown conditions. Even if it might turn out that conditioning on average sales of each author is enough to make the second dataset as predictive as the first, this possibility does not justify assuming it is so.

IO contractibility implies that, after having seen infinite data, any further input–output pairs can be exchanged. Setting aside the infinite data requirement, if the author assumes CIIR for a decision model that includes both a convenient historical dataset of book sales and the sales of their own books, they accept that these two scenarios are equivalent for predicting the author’s future book sales:

  • Writing a large number of books of various genres themselves and observing their sales

  • Observing the sales and genres of a large number of books from the convenient historical dataset

However, there are good reasons to think that the past sales of the author’s own books, written under similar conditions, will be much more informative about their future book sales than the sales of books by an arbitrary collection of third parties under unknown conditions. While it’s possible that the historical dataset could be as predictive as the author’s own experience, assuming this equivalence from the outset is not justified.

3.1 Conditionally independent and identical responses

We now turn to the formal treatment of the CIIR assumption and its equivalence to IO contractibility. First, we define sequential input–output models as a shorthand for decision models that feature a sequence of random variable pairs.

Definition 3.1

(Sequential input–output model) A decision model ( P , ( C , C ) , ( Ω , ) ) and two sequences of variables Y ( Y i ) i N and D ( D i ) i N with corresponding indices is a sequential input–output model, which we specify with the shorthand ( P , D , Y ) . By convention, we say that the D i s are inputs and Y i s are outputs.

In general, the relationship between the decision maker’s choice and the behaviour of inputs D i can be arbitrary, but this work is mainly useful when the decision maker has some prior knowledge about how to control inputs.

Sequential input–output pairs ( D i , Y i ) i N share independent and identical responses conditional on V if, conditioning on V , every output Y i “responds to” D i according to the same stochastic function.

Definition 3.2

(Conditionally independent and identical responses) Given a sequential input–output model ( P , D , Y ) along with some random variable V , the ( D i , Y i ) i N pairs are related by independent and identical responses conditional on V if for all i , Y i ( D [ 1 , i ) , Y [ 1 , i ) ) ( D i , V , C ) and P Y i D i V D i V P Y j D j V for all i , j N .

This is a general form of the CIIR assumption that only requires the outputs Y i be independent of previous inputs and outputs conditional on V and D i . If we suppose that the variable indices match the time-ordering of variables, it’s plausible that an input D i may be chosen based on previous data (e.g. in our example problem, some prior author in the dataset might have chosen the genre of their book based on their previous observations). Thus, there may be relationships between D j and Y i for j > i even after conditioning on D i and V . Here, we will add an additional assumption called weak data-independence, which means that conditional on the unknown parameter V and past inputs D [ 1 , i ] , Y i is also independent of all future inputs. Generalising our results to data-dependent inputs is an open question.

Definition 3.3

(Weakly data-independent) A sequential input–output model ( P , D , Y ) with independent and identical responses conditional on V is weakly data-independent if Y i D { i } ( D i , V , C ) .

3.2 Symmetries of sequential conditional probabilities

Given the previously mentioned sequences D and Y , the decision maker has for each option α C a conditional probability P α Y D (note the absence of V ). If V is not conditioned on, then even under the CIIR assumption there is dependence between different elements of the sequence of pairs. In other words, it is not the case that Y i ( Y { i } , D { i } ) D i ; intuitively, this is because observing ( Y i , D i ) will allow the decision maker to learn something about the distribution of Y j given D j .

We wish to express the assumption that, setting aside the fact that we learn more about these pairs as we observe more data, as far as the decision maker is concerned they are equivalent in terms of behaviour. Following the example of exchangeability, one possible way to express this is that swapping pairs makes no difference to the model – under this assumption, P α Y i Y j D i D j is the same as P α Y j Y i D j D i . More generally, given any permutation ρ : N N , define Y ρ ( Y ρ ( i ) ) i N and D ρ similarly. Then we could propose a symmetry such that for all α , ρ

P α Y D D P α Y ρ D ρ ;

This assumption is stronger than necessary. Even if the ( D i , Y i ) pairs share the same input output behaviour given perfect knowledge, we may learn more about this behaviour from observing some D j than another D i , violating this symmetry. Example 3.4 shows this in more detail.

Example 3.4

Suppose there is a machine with two arms D = { 0 , 1 } , one of which always pays out $100 and the other that pays out nothing. A decision maker (DM) doesn’t know which is which, but DM watches exactly two people operate the machine once each and may not observe the payouts – only the choices that each person makes. The first person in the sequence knows exactly which arm is good, and the second one has no idea. The first person will always pull the good arm, while the second person will pull the good arm 50 % of the time. The response H takes values that can be summarised as “arm 0 pays out” and “arm 1 pays out” (which we’ll just refer to as { h 0 , h 1 } ), and the DM assigns 50% probability to each possibility initially. Then for any α ,

P α Y 2 D 2 D 1 ( 100 1 , 0 ) = P α Y 2 D 2 H ( 100 1 , h 0 ) P α H D 2 D 1 ( h 0 1 , 0 ) + P α Y 2 D 2 H ( 100 1 , h 1 ) P α H D 2 D 1 ( h 1 1 , 0 ) = 0 1 + 1 0 = 0

because D 1 = 0 implies h 0 , while

P α Y 1 D 1 D 2 ( 100 1 , 0 ) = P α Y 1 D 1 H ( 100 1 , h 0 ) P α H D 1 D 2 ( h 0 1 , 0 ) + P α Y 1 D 1 H ( 100 1 , h 1 ) P α H D 1 D 2 ( h 1 1 , 0 ) = 0 0 + 1 1 = 1 P α Y 2 D 2 D 1 ( 100 1 , 0 )

because D 1 = 1 implies h 1 .

From the point of view of the DM, the good arm always turns out to be the one that the first person picks, no matter what they pick – and only the arm the first person picks. The first and second person’s choices are not interchangeable.

This model only requires that the first person’s choice resolve the decision maker’s uncertainty about the payout function H that they will face. It is consistent with the first person somehow causing their chosen arm to pay out when they make their choice (provided this arm also continues to pay out for subsequent people), or with this person simply knowing which arm pays out and choosing it accordingly.

Example 3.4 motivates the weaker symmetry we call exchange commutativity. The key difference is that exchange commutativity allows for the permutation of pairs after conditioning on some variable W . That is, a sequential input–output model ( P , D , Y ) is exchange commutative if there is some variable W such that the conditional P α Y WD is symmetric to swaps of input and output pairs. Intuitively, conditioning on W “screens off” anything that may be learned from observing the inputs.

Definition 3.5

(Exchange commutativity) Given a sequential input–output model ( P , D , Y ) along with some W : Ω W , we say ( P , D , Y ) commutes with exchange over W if for all finite permutations ρ : N N and all α C

P α Y WD = P α Y ρ WD ρ ;

We require an additional regularity assumption, which we call locality. We’re going to state the assumption first, then give an example (involving inflation) to illustrate why this assumption is needed. Intuitively, locality says something like “ Y i doesn’t depend on D j for j i ” (that is, the effects of each input are local) – though it’s worth bearing in mind that this intuitive interpretation is not a perfect translation of the assumption of locality (see Appendix B.6).

As Example 3.4 suggests, locality cannot be the assumption that Y i doesn’t depend on D j unconditionally; D j could, after all, offer some evidence about the state of the unknown parameter V . As with exchange commutativity, we handle this possibility by making locality the assumption that Y i doesn’t depend on any non-corresponding D j after conditioning on some auxiliary W .

Definition 3.6

(Locality) Given a sequential input–output model ( P , D , Y ) along with some W : Ω W , the model is local over W if for all α C , i N , Y [ i ] D [ i , ) ( W , D [ i ] , C ) .

If an input–output model is both exchange commutative and local with respect to the same W , then we say it is input–output contractible. This term is chosen because such a model is unchanged by contractions of the input and output indices – see Theorem 3.8.

Definition 3.7

(Input–output contractibility) A sequential input–output model ( P , D , Y ) along with some W : Ω W is input–output contractible (IO contractible) over W if it is both local and commutes with exchange over the same W .

Theorem 3.8

(Equality of equally sized conditionals) Given a sequential input–output model ( P , D , Y ) and some W , P α Y WD is IO contractible over W if and only if for all subsequences A , B N (not necessarily finite) with A = B and every α

P α Y A WD A , N \ A = P α Y B WD B , N \ B ;

Proof

Appendix B.1.□

Appendix B.2 explores out two additional properties of these two symmetries. Example B.5 shows that neither locality nor exchange commutativity is implied by the other. Example B.6 shows that locality by itself does not rule out everything that we might intuitively describe as “interference” between pairs.

We might wonder if both locality and exchange commutativity are needed, seeing as exchange commutativity by itself looks like a generalisation of exchangeability – in fact, if we take the inputs to be trivial, then it coincides precisely with exchangeability. The reason why locality is also needed is that, for nontrivial inputs, we can construct exchangeable commutative models where the response function depends on a symmetric function of the full set of inputs D i . An example of this possibility is a crude model of inflation: if you give any one person $100, they’ll be $100 richer in real terms, but if you give everyone $100 you cause inflation and the net effect on the wealth of all of the recipients now depends on their initial wealth. Giving everyone more money would cause more inflation, and giving everyone less money would result in less. That is, the impact of this action on someone’s wealth Y i doesn’t just depend on D i , but on the entire sequence D = ( D i ) i [ n ] .

3.3 Representation of IO contractible models

In this section, we state Theorem 3.17: a sequential input output model ( P , D , Y ) features pairs ( D i , Y i ) related by conditionally independent and identical responses if and only if it is IO contractible over some variable W .

The proof of the theorem can be found in its entirety Appendix B.3. There we employ a string diagram notation in some steps of the proof, itself explained in Appendix A. Here, we introduce enough to explain the the theorem statement.

3.4 Preliminaries

Definition 3.9

(Input count variable) Given a sequential input–output model ( P , D , Y ) with countable D , # j k is the variable

# D = j k i = 1 k 1 D i = j ;

That is, # D = j k is equal to the number of times D i = j over all i < k .

If we have an infinite sequence of pairs ( D i , Y i ) , we can wrap the sequence Y into a table Y D such that Y 11 D is equal to the value of the first Y i such that D i = 1 , Y 21 D is equal to the value of the second such Y i and so forth. We call it a “tabulated conditional” because, under the assumption of CIIRs, we can evaluate a conditional P α Y D ( d 1 , d 2 , ) by “looking up” the marginal distribution P α Y 1 d 1 D Y 2 d 2 D over the appropriate elements of Y D .

Definition 3.10

(Tabulated conditional distribution) Given a sequential input–output model ( P , D , Y ) on ( Ω , ) , define the tabulated conditional distribution Y D : Ω Y N × D by

Y i j D = k = 1 # D = j k = i D k = j Y k ;

That is, the ( i , j ) th coordinate of Y D is equal to the value of Y k for which the corresponding D k is the i th instance of the value j in the sequence ( D 1 , D 2 , ) , or 0 if there are fewer than i instances of j in this sequence.

The directing random measure of an infinite sequence of exchangeable variables ( X i ) i N is the probability measure that maps events A in the single-variable sigma-algebra X to the limit of normalised partial sums of indicator functions over the set A [29].

Definition 3.11

(Directing random measure) Given a decision model ( P , Ω , C ) and a sequence X ( X i ) i N , the directing random measure of X written J : Ω Δ ( X ) is the function

J ( ω ) A lim n 1 n i = 1 n 1 A ( X i ( ω ) ) A X ,

where each X i takes values in X and A X . Note that H ( ω ) is only well defined in the case that the given limit exists for all α C , and we are only interested in the cases where the limit exists.

Given input and output sequences D and Y we define the directing random conditional as the directing random measure of the tabulated conditional Y D interpreted as a sequence of column vectors ( ( Y 1 j D ) j D , ( Y 2 j D ) j D , ) .

Definition 3.12

(Directing random conditional) Given a sequential input–output model ( P , D , Y ) , we will say the directing random conditional H : Ω Δ ( Y D ) is the function

H 𝖷 j D A j lim n 1 n i = 1 n j D 1 A j ( Y i j D ) ;

A finite permutation within columns is a function that independently permutes a finite number of elements in each column of a table. A special case of such a function is a permutation of rows that swaps entire rows; this is a permutation within columns that applies the same permutation to each column.

Definition 3.13

(Permutation within rows) Given a sequence of indices ( i , j ) i N , j D , a finite permutation within rows is a function η : N × D N × D such that for each j D , η j η ( , j ) is a finite permutation N N and η ( i , j ) = ( η j ( i ) , j ) .

Lemma 3.15 shows that an IO contractible conditional distribution can be represented as the product of a probability distribution symmetric to permutations of rows and a “lookup function” or “switch.” Lookup function is also used in the representation of potential outcomes models [17], but we do not assume that the tabulated conditional Y D is interpretable as a table of potential outcomes. By representing a conditional probability as an exchangeable regular probability distribution, we can apply De Finetti’s theorem, a key step in proving the main result of Theorem 3.17.

To prove Lemma 3.15, we assume that the set of input sequences in which each possible value appears infinitely often has measure 1 for every option in C . Without this assumption, we would have to accept positive probability that we run out of D i s taking some value j D preventing us from filling out the “tabulated conditional” Y D correctly. We call this side condition almost surely infinite (as each element of D appears an infinite number of times almost surely).

Definition 3.14

(Almost surely infinite) Given a sequential input–output model ( P , D , Y ) with D countable if, letting E D N be the set of all sequences such that for all j D

x E i = 0 x i = j = ,

we have P α D ( E ) = 1 for all α , then we say D is almost surely infinite.

Note that for any W and almost all w W , P α D W ( E w ) = 1 .

The key property of the tabulated conditional is that we can evaluate the regular conditional P α Y WD by “looking up” the appropriate marginal of P α Y D .

Lemma 3.15

Suppose a sequential input–output model ( P , D , Y ) is given with D countable and D almost surely infinite. Then for some W , α , P α Y WD is IO contractible if and only if

(2) P α Y WD ( 𝖷 i N A i w , ( d i ) i N ) = P α ( Y i , d i D ) i N W ( 𝖷 i N A i w ) A i Y D , w W , d i D ,

and for any finite permutation within columns η : N × D N × D

(3) P α ( Y i j D ) N × D W = P α ( Y η ( i , j ) D ) N × D W

Proof

Sketch only.

Only if: We define a random invertible function R : Ω × N N × D that reorders the indices so that, for i N , j D , D R 1 ( i , j ) = j almost surely. We then use IO contractibility to show that the distribution under this reordering is unchanged, and the reordering gives us equation (2). Applying IO contractibility again yields equation (3).

If: We construct a conditional probability satisfying equations (2) and (3) and verify that it satisfies IO contractibility.

The full proof can be found in Appendix B.3. Note that the proof uses string diagram notation explained in Appendix A.□

Because the distribution P α Y D W from Lemma 3.15 is row-exchangeable, the limit in the definition of the directing random conditional H exists almost surely (see Lemma B.13). In fact, we do not need the full sequence of pairs ( D , Y ) to calculate H ; any subsequence A N that satisfies the condition that D A is almost surely infinite is sufficient.

Theorem 3.16

Suppose a sequential input–output model ( P , D , Y ) is given with D countable, D almost surely infinite and for some W , P α Y WD is IO contractible for all α . Consider an infinite set A N , and let D A ( D i ) i A and Y A ( Y i ) i A such that D A is also almost surely infinite. Then H A , the directing random conditional of ( P , D A , Y A ) is almost surely equal to H , the directing random conditional of ( P , D , Y ) .

Proof

The strategy we pursue is to show that an arbitrary subsequence of ( D i , Y i ) pairs induces a random contraction of the rows of Y D . Then we show that the contracted version of Y D has the same distribution as the original, and consequently, the normalised partial sums converge to the same limit.

The proof is in Appendix B.3.□

3.5 Statement of the representation theorem

We are now ready to state the main result of this section, Theorem 3.17. Assuming a weakly data independent model ( P , D , Y ) (Definitions 3.1, 3.3) with inputs D almost surely infinite (Definition 3.14), ( P , D , Y ) is IO contractible over some W if and only if the pairs ( D i , Y i ) share conditionally independent and identical responses (Definition 3.2).

Theorem 3.17

(Representation of IO contractible models) Suppose a weakly data-independent sequential input–output model ( P , D , Y ) with sample space ( Ω , ) is given with D countable and D almost surely infinite. Then the followings are equivalent:

  1. There is some W such that for all α , P α Y WD is IO contractible.

  2. For all i , Y i ( Y i , D i , C ) ( H , D i ) and for all i , j

    P Y i H D i H , D i P Y j H D j ;

  3. There is some L : V × D Y such that

    P Y DV ( 𝖷 i N A i d , v ) = i N L ( A i d i , v ) ;

Furthermore, if any of these conditions hold, then the first and third also hold substituting the directing random conditional H (Definition 3.12) for W or V , respectively.

Proof

(1) (3): We apply Lemma 3.15 followed by Lemma B.13 followed by Lemma B.15.

(3) (2): We verify that the required conditional independences hold assuming (3).

(2) (1): We show that, assuming (2), P α Y WD is IO contractible over W for all α .

See Appendix B.4 for the full proof. Note that the proof uses string diagram notation explained in Appendix A.□

The presence of the W in Theorem 3.17 is a nuisance that makes it hard to evaluate the assumptions – it’s not enough to consider unconditional locality or exchange commutativity, we need to know if it holds after conditioning on some variable. The problem is that observing some inputs (but not others) could, in principle, tell us a lot about the response H . However, this concern is defused if we observe enough ( D i , Y i ) pairs to precisely characterise the response, so a necessary condition of IO contractibility is the exchangeability of infinite subsequences of ( D , Y ) . This is the subject of Theorem 3.18.

Theorem 3.18

A data-independent sequential input–output model ( P , D , Y ) with directing random conditional H and D almost surely infinite features conditionally independent and identical responses P α Y i D i H only if for any sets A , B N such that D A and D B are also almost surely infinite and any i , j N such that i A , j B ,

P α Y i D i Y A , D A = P α Y j D j Y B D B ;

If in addition each P α YD is dominated by some exchangeable Q α Y D , then the reverse implication also holds.

Proof

See Appendix B.5.□

3.6 Does IO contractibility help us infer consequences?

One of the key contributions of De Finetti’s representation theorem was to provide an alternative justification for the common modelling assumption that a sequence of variables were all distributed according to a shared but unknown “true distribution.” De Finetti regarded the notion of an “unknown true distribution” as nonsensical, and through his representation theorem suggested that we could instead justify this structure by arguing that the experiment that produced the sequence of variables was, from the point of view of the analyst seeking to make predictions, invariant to reindexing the variables in the sequence.

Can IO contractibility help to justify common causal assumptions in a similar way? The answer generally seems negative: Theorem 3.18 tells us IO contractibility implies regarding “experimental” and “observational” data as interchangeable provided we have enough of each, but we normally wouldn’t consider datasets collected under meaningfully different conditions to be interchangeable.

Suppose our author has written 1,000 books of their own, and let N \ [ 1,000 ] be the infinite set of passive sales observations, while N \ { 1 } is the set of all passive observations plus 999 observations of “consequences” (sales of the author’s own books). Theorem 3.18 states that, given the assumption of conditionally independent and identical responses,

P α S 1 G 1 R 1 ( SGR ) N \ [ 1,000 ] = P α S 1 G 1 R 1 ( SGR ) N \ { 1 } ;

This is the basis for our claim at the start of this section that, under the CIIR assumption, the author is obliged to ignore any data related to the consequences of their own actions if they are given enough passive observations to start with. If we loosely associate “consequences of observations” with experimental data, we can note that in practice, when both experimental and observational data are available, they are not assumed to be interchangeable in this sense – in fact, the question of how well the observational data predicts experimental outputs is one of substantial interest [3032].

To cut a long story short, CIIR is not a compelling assumption for inferring consequences from data. The question arises: what else can we do?

4 Inferring consequences when choices have precedent

Given a convenient dataset of passive observations and a desire to predict the consequences of one’s actions, it is generally unreasonable to treat (sufficiently large) sequences of passive observations as interchangeable with sequences of direct observations of the consequences of actions. A decision maker ought to give some weight to the possibility that in the long run these sequences exhibit different patterns and, equivalently, they should not accept the CIIR assumption. How can a decision maker express the idea that the consequences of their actions are in some sense like previous observations of a similar system, without making an overly strong assumption like CIIR? A common move is to invoke unobserved variables: assume that there is a consistent input–output relationship provided all of the inputs are observed, but it is not known what all of the inputs are. We make the same assumption here.

By itself, this assumption can be trivial. If we have inputs ( E i , X i ) and outputs Y i , where E i is unobserved, we can construct a model where the distribution of E i after an action is taken puts all its mass on outcomes with no support in the observed data. In this case, the relationship between the observed data and the consequences of actions may be arbitrary. Perfect interventions are one way to add structure to this assumption; under a perfect intervention on X i , the marginal distribution of X i depends on the intervention while the marginal distribution of E i is unchanged and X i and E i become independent.

We explore an alternative constraint: the distribution of ( E i , X i ) after taking action must be absolutely continuous with respect to the distribution of ( E i , X i ) in the observed data. We can view this as the assumption that IO contractibility holds for the consequences of actions together with a random subset of the observed data – but not necessarily with respect to any known subset. We can justify it like this: the observed outcomes Y arise from a hidden but repeatable process where the same inputs – which may arise from someone’s actions or from background context – lead to the same distribution of outputs regardless of anything else the decision maker might consider. Furthermore, any distribution over inputs that the decision maker can bring about with their actions is dominated by the distribution of inputs in the training data. That is, anything the decision maker can do, under whatever circumstances they might do it, has been done before. If a decision model satisfies these constraints, we say that the decision maker’s actions have precedent.

In order to derive a useful inference rule, we require another assumption we call absolute continuity of conditionals. As a technical assumption it is quite opaque, but we can gain some intuition for it via the informal principle of independent causal mechanisms. We save the details for Section 4.2, but briefly: assuming a particular form of the principle of independent causal mechanisms, the assumption of absolute continuity of conditionals can be justified by assumptions about the directions of causal relationships between certain variables. Precedent and absolute continuity of conditionals together yield Theorem 4.7, which allows us to infer that observed pairs of variables exhibit conditionally independent and identical responses from conditional independence.

The following example illustrates the general idea we discuss here.

Example 4.1

Suppose a decision maker collects data about a group of people who have variously engaged the services of dieticians, sporting coaches, general practitioners, bariatric surgeons and none of the above, with practitioner choice recorded by the variable Z i . The decision maker has also collected data on each person’s body mass index X i at the beginning of the study and followed mortality outcomes Y i after a considerable period of time. A decision maker is reviewing this passive data, and in particular is wondering if steps they take to manage their own weight X c are likely to improve their own mortality prospects Y c .

Our decision maker presumes that each group of people Z i has, in aggregate, different strategies for pursuing weight management and different contextual reasons for doing so, though direct observations of these facts are unavailable. Because of this variation, the decision maker reasons, people in these different groups with the same levels of body mass index should see different mortality results if, conditional on body mass index, the different circumstances and management strategies actually lead to different results. Conversely, if there is no variation in results for these different groups of people, then it would appear that, at least with regard to mortality, the eventual body mass index achieved is apparently the only important feature of any management plan.

This inference might fail if, for any reason, treatment plan were selected in a way that masks the variation in their effects. For example, if all groups of people overwhelmingly choose to pursue diet changes in the end, then their results will not reveal any variation in mortality outcomes due to different treatment strategies. Alternatively, it might be the case that unobserved variables are delicately balanced just so that the conditional independence holds. For example, perhaps holding final BMI equal diet interventions tend to produce better health than bariatric surgery, but the visitors to the surgeon are exactly enough healthier than visitors to dieticians to make the final outcomes for both groups identical.

4.1 Passive data, consequences, and precedent

To simplify the presentation, we will consider a specific kind of decision model featuring a long sequence of passive observations indexed by natural numbers. Passive observations are completely unresponsive to the decision maker’s choice. This is augmented with one more random variable representing the consequences of acting, indexed by the special character c and which is generally responsive to the decision maker’s choice. That is, we have ( X i ) i N unresponsive to the decision maker and ( X c ) responsive to the decision maker.

Definition 4.2

(Passive data and consequences) Given a decision model ( P , Ω , C ) , we will adopt a convention that variables indexed with i N are “passive data” (i.e. not under the decision maker’s control) while variables indexed with “ c ” are “consequences” (i.e. under the decision maker’s control). We write an infinite sequence of variables, exactly one of which is under the decision maker’s control, as ( X i ) i N { c } .

We will also deal with models with the following standard structure: ( X , Y , Z ) are an observed sequence of triples and E is an unobserved variable where the passive data ( Z i , E i , X i , Y i ) i N are exchangeable. We also assume that the pairs ( ( E i , X i ) , Y i ) i N { c } share conditionally independent and identical responses for all indices (including consequences). In general, we do not need a variable Z c to be defined. We’ll call this structure a latent CIIR model.

Definition 4.3

(Latent CIIR model) A latent CIIR model is a model ( P , ( E i , Z i , X i , Y i ) i N { c } ) such that the passive data ( Z i , E i , X i , Y i ) i N are exchangeable and the pairs ( ( E i , X i ) , Y i ) i N { c } also share conditionally independent and identical responses, with inputs ( E i , X i ) and outputs Y i . We say the E i s are “latent” variables, which informally means that we typically do not get to observe them.

We can take any model ( P , X N { c } ) with exchangeable observations and turn it into a latent CIIR model by setting Z i = * and E i = ( X i , Y i ) . This trivial construction typically isn’t very helpful, though. As we have mentioned, we are particularly interested in models with precedent, where “things we can do have been done before.” That is, any setting of the unobserved state E c with positive probability as a result of taking action also has positive probability in the observed data.

Theorem 4.7 establishes sufficient conditions for the informal deduction described in Example 4.1. We assume that all variables of interest are discrete, and will make use of an alternative notation for discrete conditional probabilities.

Definition 4.4

(Index notation for discrete conditional distributions) Given a joint probability distribution H P α XY with X and Y discrete, let H x y μ Y X ( { y } x ) and H X Y ( x , y ) H x y .

With regard to precedent, we specifically want to assume that the achievable distributions of inputs to Y have positive probability in the observed data.

Definition 4.5

(Precedent) Given a latent CIIR model ( P , ( E i , Z i , X i , Y i ) i N { c } ) with E , X , Y and Z all discrete, let H be the directing random measure of ( Z i , E i , X i , Y i ) i N .

We say that the consequences have ( X , E ) -precedent with respect to ( P , ( E i , X i , Y i , Z i ) i N { c } ) if for all α , P α satisfies:

P α E c X c H ( h ) P α E i X i H ( h ) P α X c H almost all h .

When we have precedent, we can compute the distribution of consequences P α Y c X c H by first calculating P α Y i X i E i H (which is unique up to ( x , e ) pairs with 0 probability) and then re-weighting it according to P α E c X c H .

A further assumption for Theorem 4.7 is one we call absolute continuity of conditionals (ACC). The basic idea is that, after a decision maker decides that Y i is independent of Z i given X i , the DM’s model must have the distribution of H Z EX conditional on H EXZ Y is absolutely continuous with respect to the Lebesgue measure. This assumption rules out the possibility that distributions of unobserved variables are “fine-tuned” to mask variation in effects (like that described at the end of Example 4.1).

Definition 4.6

(Absolute continuity of conditionals [ACC]) Given a latent CIIR model ( P , ( E i , X i , Y i , Z i ) i N { c } ) with E , X , Y and Z all discrete, let H be the directing random measure of ( Z i , E i , X i , Y i ) i N .

We say that the options C have absolute continuity of conditionals with respect to ( P , ( E i , X i , Y i , Z i ) i N { c } ) if P satisfies either of the following:

P α H E Z X H E X Z Y H Z E ( h E X Z Y , h Z E ) U Δ ( X ) E × Z α , P α almost all h E X Z Y , h Z E ,

where U Δ ( X ) E × Z is the uniform measure on the set of functions from E × Z to discrete probability distributions with X outcomes.

Theorem 4.7 tells us: if we assume ( X , E ) -precedent and absolute continuity of conditionals, and if we accept Y i Z i ( X i , C , H ) (perhaps based on data), then we can conclude that the ( X i , Y i ) pairs share conditionally independent and identical response for all indices i N { c } . Then we can apply the reasoning we outlined at the start of Section 3.2: if the decision maker has prior knowledge of how their choices influence X c , then they can use this together with the response function observed in the passive data to work out how their actions will influence Y c .

Theorem 4.7

(Latent to observable IO contractibility) Given a latent CIIR model ( P , ( E i , X i , Y i , Z i ) i N { c } ) with E , X , Y , and Z all discrete, let H be the directing random measure of ( P , ( Z i , E i , X i , Y i ) i N ) .

Let I Δ ( Y ) X Z be the event H X z Y = H X z Y for all z , z Z ; i.e. the event that Y i is independent of Z i conditional on X i and H X Z Y . Define Q α Δ ( Ω ) to be the probability measure such that, for all A

Q α ( A ) P α 1 I H ( A 1 ) ,

i.e. Q α is P α conditioned on H X Z Y I , so Y i Q Z i ( X i , H , C ) .

If Q satisfies ( X , E ) -precedent and absolute continuity of conditionals, then with respect to Q the pairs ( X i , Y i ) i N { c } share conditionally independent and identical responses.

Proof

We show that the assumption of conditional independence imposes a polynomial constraint on H z d which is nontrivial unless Y i ( Z i , E i , C ) ( X i , H ) , and hence, the solution set S for this constraint has measure 0 when this conditional independence does not hold. Therefore, the independence holds, and the conclusion of conditionally independent and identical responses follows.

Full proof is presented Appendix C.□

4.2 Justifying the assumptions of Theorem 4.7

4.2.1 Structural justifications for ACC

We’ve offered some motivation for the assumption of precedent, but not yet for the assumption of absolute continuity of conditionals. Justifying this assumption is not straightforward: while it’s common to assume parameters are unconditionally distributed absolutely continuously with respect to the Lebesgue measure, this assumption involves conditioning on the event I (that Y i is independent of Z i conditional on X i and H ), itself a Lebesgue measure 0 event with respect to the joint distribution of all parameters. In such an event, we need to justify why we prefer the conclusion that E is independent of Y conditional on X and H to other configurations that also yield the original conditional independence.

We can make a case for preferring this interpretation by appealing to assumptions typically used to justify causal discovery. Those assumptions are, informally:

  1. In structural causal models, missing edges are common.

  2. The causal mechanisms encoded by structural models are not precisely aligned.

The first assumption, more precisely, is that conditional independences may have positive probability if they correspond to a structural causal model with a missing edge. Such an assumption can be found in, for example, decomposable scoring rules [33], and it is present in spirit in causal discovery based on the faithfulness assumption.

The second assumption is derived from the informal notion that the “causal mechanisms” encoded by structural models are not precisely aligned – this is the principle of independent causal mechanisms [14][2]. There are multiple precise interpretations of this informal principle, and a version suitable for our purposes is given in [15]. Given a graph G , for each “causal mechanism” – that is, each node X i together with all of its parents Pa G ( X i ) – we associate Pa G ( X i ) × ( X i 1 ) linearly independent parameters, and hold that the full set of parameters is absolutely continuous with respect to the Lebesgue measure on the product space.

A consequence of this is that if H 1 and H 2 are functions of disjoint sets of parameters associated with different causal mechanisms, then P α H 1 H 2 is almost surely continuous with respect to the pushforward of the Lebesgue measure. This is the key property we need to justify the absolute continuity of conditionals.

Set this aside for a moment, and consider a causal discovery problem under the faithfulness assumption. Start with a particular set of hypothesised structural causal models: three observed variables, X , Y and Z , and a single unobserved variable E . Assume at the outset that there is an edge between Z and X that points in the direction of X and an edge between E and X that points in the direction of X ; other than that, the direction or presence of any other edges is unknown. We illustrate this family of graphs with solid lines for edges known to be present, dashed lines for edges that may or may not be present, arrows for known directions and undirected edges for “unknown” directions.

4.2.1

Now, we observe that Y Z X . Assuming faithfulness, we must reduce this family of graphs to only those that have the d-separation Y Z X . This has a number of consequences:

  • Y cannot be directed towards X

  • Y cannot be directly connected to E

  • Y cannot be directly connected to Z

This leaves us with the reduced set of structural models:

4.2.1

In every model in this collection, E is d-separated from Y by X which implies Y E X in every structural model (the key step of Theorem 4.7).

We claim that G 0 is exactly the set of acyclic structural models that imply the second of the two possible conditions for absolute continuity of conditionals:

P α H E Z X H E X Z Y H Z E ( h E X Z Y , h Z E ) U Δ ( X ) E × Z α , P α almost all h E X Z Y , h Z E .

Recall that absolute continuity of conditionals is required to hold after accounting for the conditional independence. Thus, we consider the reduced set G 1 . Consider the following properties of G 1 :

  • Edge from X to Z , not Z to X

  • Edge from X to E , not E to X

  • No edge from Y to X , E or Z

This identifies H E Z X as a causal mechanism and H E X Z Y is a function either of the causal mechanism H X Y or H Y (depending on the presence of edges). Thus, both H Z E and H E X Z Y are functions of disjoint causal mechanisms and according to our principle of independent causal mechanisms, P α H E Z X H E X Z Y H Z E is almost surely absolutely continuous with respect to the Lebesgue measure.

4.2.2 Applying structural arguments for ACC

We have argued that a decision maker may infer that the conditional distribution of an output Y depends only on an input X under three conditions:

  1. The decision maker assumes the consequences of their actions have precedent in the observed data.

  2. The decision maker supposes that the observed and unobserved variables supporting the precedent assumption have certain causal structures.

  3. The decision maker observes a particular conditional independence.

There remain challenges to applying this theory to real-world decision problems. The first is shared with other theories of causal discovery from conditional independence: the result depends on the observation of exact conditional independence, but we can only ever observe approximate independence. This is a familiar problem for inferring causal relationships from conditional independence – see, for example, [34] where it is shown that spurious approximate conditional independences can be common in causal structures with sufficiently many edges.

To handle the case of approximate conditional independence, there are two assumptions we could explore strengthening in Theorem 4.7: precedent and absolute continuity of conditionals. Instead of precedent, we could consider bounds on the divergence between the distributions of ( E , X ) before and after action has been taken. For absolute continuity of conditionals, a promising direction for future work would be to consider joint distributions over causal mechanisms (potentially with respect to multiple structural models), which could enable us to quantify the probability of near-violations of ACC. Whether either or both of these extensions yields a usefully stronger result is an open question for future research.

The second challenge is related to making assumptions about an unobserved variable E . We only need some variable E to exist satisfying the required assumptions. An optimist might be inclined to think that, if they can find a suitable Z , X and Y (with Z upstream of X , X dependent on Z and Y conditionally independent of Z given X ), then surely some E exists that satisfies the required structural assumption and the precedent condition. This might be too optimistic. The principle we invoked to justify ACC via causal structure was that conditional independences due to missing edges are common, but conditional independences for any other reason are rare (loosely speaking). This same principle implies that, for any particular E , it is possible that no edges to X or Z exist, corresponding to a structural model like this:

4.2.2

Under the standard interventional interpretation, the effect of X on Y is identified in this model. However, we are considering prediction of actions under the assumption of precedent, and in this case the consequences of actions on Y are not determined by their effect on X . A decision maker’s actions could lead to changes in the distribution of ( E , X ) and in general this would lead to a different conditional distribution for Y given X . This could be because the actions have “side effects” or – as is perhaps commonly the case – the decision maker experiences “covariate shift” in which this distribution changes regardless of their efforts.

To illustrate with an example, suppose that our author observes in their historical data that book sales S are independent of the author’s identity N given genre G , and genre and identity are dependent. Our intuition says that author identity must be the antecedent of genre and not the other way around, so the assumptions concerning observable variables are satisfied.

Can we find a suitable E into this model? We could spend a long time discussing this question. We want something that is causally antecedent to X , supports the precedent assumption and either has no “missing edges” to X or at least allows reasonable judgements about what edges might be missing. Maybe there is a more fundamental theory that connects assumptions of “basic physics” to the notions of causal structure and precedent we discuss here which might shed light on when such an E exists, but searching for such a theory is beyond the scope of this work.

Instead of searching for a completely satisfactory theoretical justification for the required structural assumptions, we could empirically test the principle suggested by Theorem 4.7. We could, for example, test whether the optimist’s view generally holds up in practice, however compelling its theoretical motivation is. This would involve searching for datasets where there are variables Z , X and Y such that Y Z X (approximately), Z is not independent of X and Z also temporally precedes X (or we have some other good reason for thinking that Z is upstream of X ). If reliable experimental data are also available then we could test whether the distribution of Y given X matches in both the observational and experimental conditions (investigations along these lines have been conducted in several fields [3032]).

5 Conclusion

There seems to be a large gap between the set of things that people routinely learn to manipulate and the set of circumstances where contemporary theories of causal inference tell us that valid inference is possible. We would like to be able to close this gap; we want to be able to build systems that learn to manipulate the world at least as well as people do, and ideally we would like to understand how they learn to manipulate it. How to do this is a wide open question; there are many plausible approaches, and it is not yet clear which will be the most fruitful.

We explored the possibility that our understanding of causal inference is missing a fundamental principle – specifically, the principle of symmetry. Ordinary statistics gets a lot of mileage from the idea that no observation is “special”; one can be exchanged for another. To a decision maker, the consequences of their actions are always special, but they are not arbitrarily special. When I am facing a decision, I am usually aware that many other people have faced similar decisions, with access to similar information and similar capabilities and it would be foolish to think that the consequences of my actions are vastly different to consequences already observed. We proposed precedent as a symmetry principle that captures the idea that, while a decision maker may be somewhat special, they are not arbitrarily special, and we show how (in combination with the principle of independent causal mechanisms) this principle offers a novel justification for valid causal inference.

We believe this approach is promising for two main reasons. First, the assumption of precedent seems to us better motivated than the assumption that actions are modelled by causal interventions (though this is admittedly a judgement call). Second, standard structural models invoke two independent “structural” assumptions: interventions and the principle of independent causal mechanisms. Our framework, on the other hand, derives intervention-like inferences from the combination of precedent and the principle of independent causal mechanisms, suggesting that there may be a more parsimonious theory for interpreting causal structure. Much work remains to flesh out the theoretical foundations, elaborate applications and empirically test this approach, but we think it opens up promising and novel lines of research.

Acknowledgements

Thanks to members of the ANU College of Engineering and Computer Science for helpful discussion and feedback, especially Elliot Catt, Tom Everitt and Sarita Rosenstock.

  1. Funding information: RW’s contribution was supported by the Deutsche Forschungsgemeinschaft under Germany’s Excellence Strategy – EXC number 2064/1 – Project number 390727645. DJ’s contribution was supported by an Australian Government Research Training Program Scholarship.

  2. Author contributions: David O. Johnston conceived the project, developed the theoretical framework, wrote the proofs and drafted the manuscript. Robert C. Williamson and Cheng Soon Ong provided substantial guidance in developing and refining the theoretical ideas. All authors discussed the results and implications and commented on the manuscript.

  3. Conflict of interest: No conflicts of interest to declare.

  4. Data availability statement: Not applicable.

Appendix A String diagrams

We use a string diagram notation to represent probabilistic functions. This is a notation created for reasoning about abstract Markov categories and is somewhat different to existing graphical languages. The main difference is that in our notation wires represent variables and boxes (which are like nodes in directed acyclic graphs) represent probabilistic functions. Standard directed acyclic graphs annotate nodes with variable names and represent probabilistic functions implicitly. The advantage of explicitly representing probabilistic functions is that we can write equations involving graphics. This is introduced in Section A.

We make use of string diagram notation for probabilistic reasoning. Graphical models are often employed in causal reasoning, and string diagrams are a kind of graphical notation for representing Markov kernels. The notation comes from the study of Markov categories, which are abstract categories that represent models of the flow of information. For our purposes, we don’t use abstract Markov categories but instead focus on the concrete category of Markov kernels on standard measurable sets.

A coherence theorem exists for string diagrams and Markov categories. Applying planar deformation or any of the commutative comonoid axioms to a string diagram yields an equivalent string diagram. The coherence theorem establishes that any proof constructed using string diagrams in this manner corresponds to a proof in any Markov category [35]. More comprehensive introductions to Markov categories can be found in [36,37].

A.1 Elements of string diagrams

Markov kernels are drawn as boxes with input and output wires, and probability measures (which are Markov kernels with the domain { * } ) are represented by triangles:

A.1

Given two Markov kernels L : X Y and M : Y Z , the product L M is represented by drawing them side by side and joining their wires:

A.1

Given kernels K : W Y and L : X Z , the tensor product K L : W × X Y × Z is graphically represented by drawing kernels in parallel:

A.1

A space X is identified with the identity kernel id X : X Δ ( X ) . A bare wire represents the identity kernel:

id X X ——- X

Product spaces X × Y are identified with tensor product of identity kernels id X id Y . These can be represented either by two parallel wires or by a single wire representing the identity on the product space X × Y :

X × Y id X id Y X —- X Y —- Y = X × Y —– X × Y

A kernel L : X Δ ( Y Z ) can be written using either two parallel output wires or a single output wire, appropriately labelled:

A.1

We read diagrams from left to right (this is somewhat different to [3638] but in line with [35]), and any diagram describes a set of nested products and tensor products of Markov kernels. There are a collection of special Markov kernels for which we can replace the generic “box” of a Markov kernel with a diagrammatic elements that are visually suggestive of what these kernels accomplish.

A.2 Special maps

Definition A.1

(Identity map) The identity map id X : X X defined by ( id X ) ( A x ) = δ x ( A ) for all x X , A X , is represented by a bare line.

id X X —- X

Definition A.2

(Erase map) Given some 1-element set { } , the erase map del X : X { * } is defined by ( del X ) ( x ) = 1 for all x X . It “discards the output.” It looks like a lit fuse:

A.2

Definition A.3

(Swap map) The swap map Swap X , Y : X × Y Y × X is defined by ( Swap X , Y ) ( A × B x , y ) = δ x ( B ) δ y ( A ) for ( x , y ) X × Y , A X and B Y . It swaps two inputs and is represented by crossing wires:

A.2

Definition A.4

(Copy map) The copy map Copy X : X X × X is defined by ( Copy X ) ( A × B x ) = δ x ( A ) δ x ( B ) for all x X , A , B X . It makes two identical copies of the input, and is drawn as a fork:

A.2

Definition A.5

( n -fold copy map) The n -fold copy map Copy X n : X X n is given by the recursive definition

A.2

A.2.1 Semidirect product

Given K : X Y and L : Y × X Z , the semidirect product is graphically represented by connecting K and L and “keeping an extra copy”:

A.2.1

The semidirect product is can be used to join a marginal P X and a conditional P Y X to form a joint P X , Y .

A.2.2 Plates

In a string diagram, a plate that is annotated i A means the tensor product of the A elements that appear inside the plate. A wire crossing from outside a plate boundary to the inside of a plate indicates an A -fold copy map, which we indicate by placing a dot on the plate boundary. For our purposes, wires that have begin inside a plate always terminate within the plate; the “output” of the diagram is the A fold tensor product of wires within the plate.

Thus, given K i : X Y for i A ,

A.2.2

A.3 Commutative comonoid axioms

Diagrams in Markov categories satisfy the commutative comonoid axioms.

(A1) .

(A2) .

.

A.3
A.3
A.3

as well as compatibility with the monoidal structure

A.3

and the naturality of del, which means that

(A3) .

A.3

(we do not need a deeper understanding of naturality here)

A.3.1 Markov kernels associated with functions

For any measurable function f : X Y , we can associated a deterministic Markov kernel F f : X Y defined by ( F f ) ( B x ) = δ f ( x ) ( B ) for all x X , B Y – that is, F f is the Markov kernel that maps a point x deterministically to f ( x ) .

A.4 Manipulating string diagrams

A morphism in a Markov category is deterministic iff it commutes with the copy map.

Definition A.6

(Copy map commutes for deterministic morphisms) For K : X Y

(A4) .

holds iff K is deterministic.

A.4

Deterministic Markov kernels are the Markov kernels where, for any x , B , K ( B x ) { 0 , 1 } .

Planar deformations along with applications of equations (A1) through equation (A4) give us a set of rules for transforming one string diagram into an equivalent one.

String diagrams can always be converted into definitions involving integrals and tensor products. A number of shortcuts can help to make the translations efficiently.

For arbitrary K : X × Y Z , L : W Y

A.4

That is, an identity map “passes its input directly to the next kernel.”

For arbitrary K : X × Y × Y Z :

A.4

That is, the copy map “passes along two copies of its input” to the next kernel in the product.

For arbitrary K : X × Y Z

A.4

The swap map before a kernel switches the input arguments.

For arbitrary K : X Y × Z

A.4

Given K : X Y and L : Y Z :

A.4

Thus, the action of the del map is to marginalise over the deleted wire. With integrals, we can write

( K L ) ( id Y del Z ) ( A × { * } x ) = Y { * } δ y ( A ) δ * ( { * } ) L ( d z y ) K ( d y x ) = A K ( d y x ) = K ( A x ) .

A.4.1 Labelling wires with variable names

The previous examples all labelled wires with spaces. Going forward, we will instead label wires with variable names. Given a decision model ( P , ( Ω , ) , ( C , C ) ) and random variables X 1 , X 2 and X 3 , we can draw the Markov kernel P α X 1 , X 2 X 3 as follows:

A.4.1

The wire labels identify the diagram on the right as a picture of P α X 1 , X 2 X 3 ; it cannot be any arbitrary Markov kernel X 3 X 1 × X 2 . Wire labels identify variables with wires “in the obvious way,” that is, if I delete the wire labelled X 2 , then I obtain the diagram for P α X 1 X 3 .

A.4.1

The semidirect product of conditional distributions is the “joint conditional”:

A.4.1

B Symmetries of conditional probabilities

B.1 Equality of equally sized contractions

This is the proof of Theorem 3.8.

All swaps can be written as a product of transpositions, so proving that a property holds for all finite transpositions is enough to show it holds for all finite swaps. It’s useful to define a notation for transpositions.

Definition B.1

(Finite transposition) Given two equally sized sequences A , B N n with A = ( a i ) i [ n ] , B = ( b i ) i [ n ] , A B : N N is the permutation such that

[ A B ] ( a i ) = b i

that sends the i th element of A to the i th element of B and vice versa. Note that B A is the inverse of A B .

Lemma B.2 is used to extend conditional probabilities of finite sequences to infinite ones.

Lemma B.2

(Infinitely extended kernels) Given a collection of Markov kernels K i : W × X N Y i for all i N , if we have for every j > i

(A5) K j ( id Y i del Y j i ) = K i del X j i ,

then there is a unique Markov kernel K : X N Y N such that for all i , j N , j > i

K ( id Y i del Y N ) = K i del X j i .

Proof

Take any x X N and let x m X n be the first n elements of x . By equation (A5), for any A i Y , i [ m ]

K n ( 𝖷 i [ m ] A i × Y n m x n ) = K m ( 𝖷 i [ m ] A i x m ) .

Furthermore, by the definition of the Swap map for any permutation ρ : [ n ] [ n ] ,

K n Swap ρ ( 𝖷 i [ m ] A ρ ( i ) × Y n m x n ) = K n ( 𝖷 i [ m ] A i × Y n m x n ) .

Thus, by the Kolmogorov extension theorem [26], for each x X N , there is a unique probability measure Q x Δ ( Y N ) satisfying

(A6) Q x ( 𝖷 i [ n ] A i × Y N ) = K n ( 𝖷 i [ n ] A ρ ( i ) x [ n ] ) .

Furthermore, for each { A i Y i N } , n N , note that for p > n

Q x ( 𝖷 i [ n ] A i × Y N ) Q x ( 𝖷 i [ p ] A i × Y N ) Q x ( 𝖷 i N A i ) ,

so by the Monotone convergence theorem, the sequence Q x ( 𝖷 i [ n ] A i ) converges as n to Q x ( 𝖷 i N A i ) . x Q x Z n ( 𝖷 i [ n ] A i ) is measurable for all n , { A i Y i N } by equation (A6), and so x Q x is also measurable.

Thus, x Q x is the desired Markov kernel K .□

Corollary B.3

Given ( P , Ω , C ) , W : Ω V and two pairs of sequences ( V , X ) ( V i , X i ) i N and ( Y , Z ) ( Y i , Z i ) i N with corresponding variables taking values in the same sets V = Y and X = Z , if ( P , V , X ) and ( P , Y , Z ) are both local over W and for all α

P α X [ n ] W V [ n ] = P α Z [ n ] W Y [ n ]

for all n N then for all α

P α X W V = P α Z W Y ;

Proof

Fix arbitrary α C . By assumption of locality,

P α X [ n ] W V [ n ] del V N = P α X W V ( id X n del X N ) P α Z [ n ] W Y [ n ] del V N = P α Z W Y ( id X n del X N ) ;

Hence for all n , m > n ,

P α X [ m ] W V [ m ] ( id X n del X m n ) = P α Z [ m ] W Y [ m ] ( id X n del X m n ) = P α X [ n ] W V [ n ] del V m n

and, in particular, by Lemma B.2, P α X W V and P α Z W Y are the limits of the same sequence. α was arbitrary, so this holds for all α .□

Theorem B.4

Given a sequential input–output model ( P , D , Y ) and some W , P α Y WD is IO contractible over W if and only if for all subsequences A , B N A (not necessarily finite), A = B and for every α

P α Y A WD A , N \ A = P α Y B WD B , N \ B ;

Proof

Only if: For a sequence of natural numbers Z N A , let del Z be the Markov kernel associated with the map that sends Y to Y Z ( Y i ) i Z .

If A is finite, then let n A , and by exchange commutativity,

P α Y A WD A , N \ A = P α Y A WD A [ n ] = P α Y WD A [ n ] del A = P α Y [ n ] A WD del A ;

Use the fact that ( [ n ] A ) ( B [ n ] ) = B A and apply exchange commutativity to obtain

P α Y [ n ] A WD del A = P α Y ( [ n ] A ) ( B [ n ] ) WD B [ n ] del A = P α Y WD B [ n ] del B = P α Y B WD B , N \ B

if A is infinite, then we can take finite subsequences A m that are the first m elements of A and similarly for B m . Then by previous reasoning,

P α Y A m WD A m [ m ] = P α Y [ m ] WD = P α Y B m WD B m [ m ] ,

then by Corollary B.3

P α Y A WD A [ n ] = P α Y B m WD B m [ m ] ;

Finally, by locality,

P α Y A WD A [ n ] = P α Y A WD A del D N \ A ;

If: Taking A = [ n ] for all n establishes locality, and taking A = ( ρ ( i ) ) i N for arbitrary finite permutation ρ establishes exchange commutativity.□

B.2 Examples of symmetries

These are the examples referenced in Section 3.2. Example B.5 shows that neither locality nor exchange commutativity is implied by the other.

Example B.5

We prove the claim by way of presenting counterexamples.

First, a model that exhibits exchange commutativity but not locality. Suppose D = Y = { 0 , 1 } and P α Y D : D N Y N is given by

P α Y D ( 𝖷 i N A i d N ) = i N δ lim n i N d i n ( A i )

for some sequence d N such that this limit exists. Then for any finite permutation ρ

P α Y ρ D ρ ( 𝖷 i N A i d N ) = i N δ lim n i N d ρ 1 ( i ) n ( A ρ 1 ( i ) ) = P α Y D ( 𝖷 i N A i d N ) ,

so ( P α , D , Y ) commutes with exchange, but

P α Y 1 D ( A 1 0 , 1 , 1 , 1 ) = δ 1 ( A 1 ) P α Y 1 D ( A 1 0 , 0 , 0 , 0 ) = δ 0 ( A 1 ) ,

so ( P α , D , Y ) is not local.

Next, a model that satisfies locality but does not commute with exchange. Suppose again D = Y = { 0 , 1 } and P α Y D : D N Y N is given by

P α Y D ( 𝖷 i N A i d N ) = i N δ i ( A i ) ,

then

P α Y ρ D ρ ( 𝖷 i N A i d N ) = i N δ i ( A ρ 1 ( i ) ) i N δ i ( A i ) = P α Y D ( 𝖷 i N A i d N ) ,

so ( P α , D , Y ) does not commute with exchange, but for all n ,

P α Y [ n ] D ( 𝖷 i [ n ] A i d N ) = i [ n ] δ i ( A ρ 1 ( i ) ) = P α Y [ n ] D ( 𝖷 i [ n ] A i ( 0 ) i N ) ,

so ( P α , D , Y ) is local.

Although locality seems to an assumption that there is no interference between inputs and outputs of different indices, by itself, it actually permits models with certain kinds of interference. This is shown in Example B.6.

Example B.6

Consider an experiment where I first flip a coin and record the results of this flip as the outcome Y 1 of “step 1.” Subsequently, I can either copy the outcome from step 1 to the result for “step 2” (this is the input D 1 = 0 ), or flip a second coin use this as the input for step 2 (this is the input D 1 = 1 ). D 2 is an arbitrary single-valued variable. Then for all d 1 , d 2 ,

P Y 1 D ( y 1 d 1 , d 2 ) = 0.5 P Y 2 D ( y 2 d 1 , d 2 ) = 0.5 ;

Thus, the marginal distribution of both experiments in isolation is Bernoulli ( 0.5 ) no matter what choices I make, but the input D 1 affects the joint distribution of the results of both steps, which is not ruled out by locality.

B.3 Representation theorem preliminaries

This is the proof of Lemmas 3.15 and B.13 and Theorem 3.16 from Section 3.4. In addition, Lemmas B.13 and B.15 are presented and proved, which will be later used in the proof of Theorem 3.17.

Note that these proofs use the string diagram notation explained earlier in Appendix A. First, we will reproduce definitions of locality and exchange commutativity with equivalent statements in string diagram notation.

Definition B.7

Given a sequential input–output model ( P , D , Y ) along with some W : Ω W , for α C we say P α Y WD is local over W if for all α C , n N

B.3

Definition B.8

Given a sequential input–output model ( P , D , Y ) along with some W : Ω W , we say ( P , D , Y ) commutes with exchange over W if for all finite permutations ρ : N N and all α C

P α Y WD = P α Y ρ WD ρ .

We say ( P , D , Y ) commutes with exchange if there is some W such that ( P , D , Y ) commutes with exchange over W .

The following definitions are also reproduced for the reader’s convenience.

Definition B.9

Given a sequential input–output model ( P , D , Y ) on ( Ω , ) with countable D , # j k is the variable

# j k i = 1 k 1 D i = j ;

In particular, # j k is equal to the number of times D i = j over all i < k .

Definition B.10

Given a sequential input–output model ( P , D , Y ) on ( Ω , ) , define the tabulated conditional distribution Y D : Ω Y N × D by

Y i j D = k = 1 # j k = i 1 D k = j Y k ;

That is, the ( i , j ) th coordinate of Y D ( ω ) is equal to the coordinate Y k ( ω ) for which the corresponding D k ( ω ) is the i th instance of the value j in the sequence ( D 1 ( ω ) , D 2 ( ω ) , ) , or 0 if there are fewer than i instances of j in this sequence.

Definition B.11

Given a sequential input–output model ( P , D , Y ) with D countable if, letting E D N be the set of all sequences such that for all j D

x E i = 0 x i = j = ,

we have P α D ( E ) = 1 for all α , then we say D is almost surely infinite.

Lemma B.12

Suppose a sequential input–output model ( P , D , Y ) is given with D countable and D almost surely infinite. Then for some W , α , P α Y WD is IO contractible if and only if

P α Y WD ( 𝖷 i N A i w , ( d i ) i N ) = P α ( Y i d i D ) i N W ( 𝖷 i N A i w ) A i Y D , w W , d i D ,

where F lu is the Markov kernel associated with the lookup map

lu : X N × Y N × D Y ( ( x i ) N , ( y i j ) i , j N × D ) ( y i d i ) i N ,

and for any finite permutation within rows η : N × D N × D ,

(A7) P α ( Y i j D ) N × D W = P α ( Y η ( i , j ) D ) N × D W .

B.3

Proof

Only if: We define a random invertible function R : Ω × N N × D that reorders the indicies so that, for i N , j D , D R 1 ( i , j ) = j almost surely. We then use IO contractibility to show that P α Y D ( d ) is equal to the distribution of the elements of Y D selected according to d D N .

Note that at most one of # j k = i 1 D k = j and # j l = i 1 D l = j can be greater than 0 for k l and, by assumption, j D k N # j k = i 1 D k = j = 1 almost surely (that is, for any i , j , there is some k such that D k is the i th occurrence of j ). Define R k : Ω N × D by ω arg max i N , j D # j k = i 1 D k = j ( ω ) (i.e. R k returns the ( i , j ) pair, where j is the value of D k and i is the count of j occurrences up to D k ). Let R : N N × D by k R k . R is almost surely bijective and

Y D ( Y i j D ) i N , j D = ( Y R 1 ( i , j ) ) i N , j D Y R 1 .

By construction, D R 1 ( i , j ) = j almost surely for all α ; that is, D R 1 is almost surely equal to e ( e i j ) i N , j D such that e i j = j for all i . Hence (almost surely),

(A8) P Y D W D R 1 ( A w , d ) = P Y R 1 W D R 1 ( A w , d ) = P Y R 1 W D R 1 ( A w , e ) = P Y D W ( A w )

for any d D N .

Now,

(A9) P α Y R 1 W D R 1 ( A w , d ) = R P α Y ρ W D ρ ( A d ) P α R 1 W D R 1 ( d ρ w , d ) ;

For each ρ , define ρ n : N N as the finite permutation that agrees with ρ on the first n indices and is the identity otherwise. By IO contractibility, for n N

P Y ρ n ( [ n ] ) W D ρ n ( [ n ] ) = P Y ρ ( [ n ] ) W D ρ ( [ n ] ) = P Y [ n ] W D [ n ]

By Corollary B.3, it must therefore be the case that

P Y W D = P Y ρ W D ρ ;

Then from equation (A9),

(A10) P α Y R 1 W D R 1 ( A w , d ) = R P α Y ρ W D ρ ( A d ) P α R 1 W D R 1 ( d ρ w , d ) = R P Y WD ( A w , d ) P α R 1 W D R 1 ( d ρ w , d ) = P Y WD ( A w , d )

almost surely for all i , j N . Then by equation (A8) and equation (A10)

(A11) P α Y D W ( A w ) = P α Y WD ( A w , e )

Take some d D N . From equation (A11) and IO contractibility of P Y WD ( A e ) ,

( P α Y D W id D ) F lu ( A w , d ) = P α ( Y i d i D ) i N W ( A d ) = P α ( Y i d i ) i N WD ( A w , e ) = P α ( Y i d i ) i N W ( D i d i ) N ( A w , ( e i d i ) i N ) = P α Y WD ( A w , ( e i d i ) i N ) = P α Y WD ( A w , d N ) ;

It remains to be shown that Y D is invariant to finite permutations within rows. Consider some finite permutation within columns η : N × D N × D , note that e η ( i , j ) = j and hence ( e η ( i , j ) ) i N , j D = e . Thus,

P α ( Y η ( i , j ) D ) N × D W ( A w ) = P α ( Y D ) N × D W Swap η ( A w ) = P α Y WD Swap η ( A w , e ) from Eq. (14) = P α Y η WD ( A w , e ) = P α Y WD η 1 ( A w , e ) by exchange commutativity = P α Y WD ( A w , ( e η 1 ( i , j ) ) i N , j D ) = P α Y WD ( A w , e ) = P α ( Y i j D ) N × D W ( A w ) ; from Eq. (14)

If: We construct a conditional probability according to Definition 3.10 and verify that it satisfies IO contractibility.

Suppose

where P α Y D W satisfies equation (A7).

Consider any two d , d D N such that for some S , T N with S = T = n , d S = d T . Let S T be the transposition that swaps the i th element of S with the i th element of T for all i .

P α Y S WD ( 𝖷 i [ n ] A i w , d ) = P α ( Y i d i D ) i S W ( 𝖷 i [ n ] A i w ) = P α ( Y S T ( i ) d i D ) i S W ( 𝖷 i [ n ] A i w ) = P α ( Y i d S T ( i ) D ) i T W ( 𝖷 i [ n ] A i w ) = P α ( Y i d i D ) i T W ( 𝖷 i [ n ] A i w ) = P α Y T WD ( 𝖷 i [ n ] A i w , d )

and, in particular, taking T = [ n ]

= P α Y [ n ] WD ( × i [ n ] A i w , d ) ,

but d is an arbitrary sequence such that the T elements match the S elements of d , so this holds for any other d whose T elements also match the S elements of d . That is,

P α Y S WD ( × i [ n ] A i w , d ) = ( P α Y [ n ] WD [ n ] del D N ) ( × i [ n ] A i w , d ) ,

so K is IO contractible by Theorem 3.8.□

B.3

As a consequence of Lemma 3.15 along with De Finetti’s representation theorem, we can say that given ( P , D , Y ) IO contractible, conditioning on H renders the columns of Y D independent and identically distributed.

Lemma B.13

Suppose a sequential input–output model ( P , D , Y ) is given with D countable, D almost surely infinite and ( P , D , Y ) IO contractible over the same W . Then, letting H be the directing random conditional of ( P , D , Y ) (Definition 3.12) and Y i D ( Y i j D ) j D , we have for all i N , Y i D ( Y N \ { i } D , W , C ) H , A Y D and almost all ν H

(A12) .

B.3

Proof

Fix w W , α C and consider P α , w Y D P α Y D W ( w ) . From Lemma 3.15, we have the exchangeability of the sequence ( Y 1 D D , Y 2 D D , ) with respect to ( P α , w , Ω , ) as a special case of the invariance of P α ( Y i j D ) N × D W to permutations of rows. By the column exchangeability of P α , w Y D , from Prop 1.4 of [29] (where H is the directing random measure, Definition 3.11)

Because the right-hand side does not depend on w , we can say

(A13) .

equation (A12) follows from this independence.

Because the right-hand side of (A13) also does not depend on α we have Y D ( W , C ) H . Further application of Prop. 1.4 of [29] yields Y i D ( Y N \ { i } D , W ) ( H , C ) , and for almost all ν H ,

P α Y i D H ( A ν ) = ν ( A ) .

The right-hand side does not depend on α , which yields Y i D ( Y N \ { i } D , W , C ) H .□

B.3
B.3

Theorem B.14

Suppose a sequential input–output model ( P , D , Y ) is given with D countable, D almost surely infinite and for some W , P α Y WD is IO contractible for all α . Consider an infinite set A N , and let D A ( D i ) i A and Y A ( Y i ) i A such that D A is also almost surely infinite. Then H A , the directing random conditional of ( P , D A , Y A ) is almost surely equal to H , the directing random conditional (Definition 3.12) of ( P , D , Y ) .

Proof

The strategy we will pursue is to show that an arbitrary subsequence of ( D i , Y i ) pairs induces a random contraction of the rows of Y D . Then we show that the contracted version of Y D has the same distribution as the original, and consequently, the normalised partial sums converge to the same limit.

Define Y D , A as the tabulated conditional of ( D A , Y A ) , i.e. let # j A , k be the count restricted to A :

# j A , k i [ k 1 ] A D i = j ,

then

Y i j D , A k A # j A , k = i 1 D k = j Y k = k A # j A , k = i 1 D k = j Y R k j D ;

That is, defining Q : N N by i k A # j A , k = i 1 D k = j R k , then

(A14) Y i j D , A = Y Q ( i ) j D ,

where Q ( i ) N by the assumption that each value of D occurs infinitely often in A (otherwise Q ( i ) might be 0).

Equation (A14) is what is meant by “the subsequence ( D A , Y A ) induces a random contraction over the rows of Y D .” We will now show that Y D , A has the same distribution as Y D .

Let con q : Y N × D Y N × D be the Markov kernel associated with the function that sends ( Y i j D ) i N , j D to ( Y q ( i ) j D ) i N , j D . Then for any B Y N × D , w , q :

(A15) P α Y D , A WQ ( B w , q ) = P α Y D W con q ( B w ) = P α Y WD con q ( B w , e ) by eq. (14) = P α Y WD ( B w , e ) by Theorem 3.8 = P α Y D W ( B w ) . by eq. (14)

Finally, take H A the directing random measure of Y D , A . We conclude from the equality equation (A15) and from the fact that there is a one-to-one map from directing random measures to exchangeable distributions such that H A H .□

The following is a technical lemma that will be used in Theorem 3.17.

Lemma B.15

Suppose a sequential input–output model ( P , D , Y ) is given with D countable, D almost surely infinite, and for some W P α Y WD is IO contractible over W for all α , and for all α .

Recall that F lu is defined in Lemma 3.15. Then Y W ( H , D , C ) , where H is the directing random conditional associated with P α Y WD , and for all α .

B.3
B.3

Proof

We show that the function that maps the variables Y and D to H also maps Y D and a constant e D N to H with H H . The result then follows from disintegration along with a conditional independence given by Lemma 3.15.

Y D is a function of Y and D (see Definition 3.10) and H is a function of Y D . Say f : Y × D H is such that H = f ( Y , D ) (Definition 3.11). Because H = f ( Y , D ) , we have H ( W , C ) ( Y , D ) . Thus,

For a sequence d D N where each j D occurs infinitely often, take [ d = j ] i to be the i th coordinate of d equal to j D and # [ d = j ] i to be the position in d of [ d = j ] i . Concretely, f is given by

f ( y , d ) = 𝖷 j D A j lim n 1 n i = 1 n j D 1 A j ( y # [ d = j ] i ) f d ( y ) ,

where the limit exists. Note that for y D Y D × N , we have

f d lu ( y D , d ) = 𝖷 j D A j lim n 1 n i = 1 n j D 1 A j ( y # [ d = j ] i j D ) ;

Let g ( y D , d ) f d lu ( y D , d ) for some d D N , where each j D occurs infinitely often.

We aim to show that g ( Y D , d ) = g ( Y D , d ) almost surely for all d , d D N such that each j D occurs infinitely often.

Consider, for arbitrary j D A j Y D ,

P α ( g ( Y D , d ) ( A ) = g ( Y D , d ) ( A ) ) = H P α id Ω H ( g ( Y D , d ) ( A ) = g ( Y D , d ) ( A ) ν ) P α H ( d ν ) .

Note that

P α id Ω H ( g ( Y D , d ) ( A ) = ν ( A ) ν ) = P α Y D H lim n 1 n i = 1 n j D 1 A j ( y # [ d = j ] i , j D ) = ν ( A ) ν P α H ( d ν )

by independent permutability of the rows of Y D (Lemma 3.15), for each row, we can send the # [ d = j ] i th element to i and obtain

P α Y D H lim n 1 n i = 1 n j D 1 A j ( y # [ d = j ] i , j D ) = ν ( A ) ν P α H ( d ν ) = P α Y D H lim n 1 n i = 1 n j D 1 A j ( y i , j D ) = ν ( A ) ν = P α Y i D D H lim n 1 n i = 1 n 1 A ( y i D ) = ν ( A ) ν

but by Lemma B.13, the sequence ( Y i D ) i N are mutually independent conditional on H and for all α , P α Y i H ( A ν ) H ν ( A ) . Thus, by the law of large numbers,

P α Y D H lim n 1 n i = 1 n 1 A ( y i D ) = ν ( A ) ν = 1 ,

which implies

H P α id Ω H ( g ( Y D , d ) ( A ) = g ( Y D , d ) ( A ) ν ) P α H ( d ν ) = H P α id Ω H ( g ( Y D , d ) ( A ) = ν ( A ) g ( Y D , d ) ( A ) = ν ( A ) ν ) P α H ( d ν ) = 1 ;

Because this holds for all A ,

g ( Y D , d ) g ( Y D , d ) ;

And, as a consequence, defining

i : ( y d , d , d ) ( lu ( Y D , d ) , g ( Y D , d ) ) ;

we have

i ( y d , d , d ) i ( y d , d , d ) ,

which in turn implies the almost sure equality of the associated Markov kernels:

but we also have, by the definitions of f and g

we can therefore write P α YH WD as follows:

because H is a deterministic function of Y D we can recognise that H is independent of D given Y D . Similarly, Y is a deterministic function of Y D and D so Y is independent of H and W given D and Y D . Thus, del W F lu del H = P α Y W Y D D H and

(noting that this is a subdiagram of equation (A16)).

Putting this together:

(A16) .

By higher order conditionals,

(A17) .

substituting equation (A18) into (A19)

From Lemma 3.15 we also have Y D ( W , C ) H , so

and so by higher order conditionals Y W ( H , D , C ) and

B.3
B.3
B.3
B.3
B.3
B.3
B.3
B.3
B.3
B.3

B.4 Representation theorem

This is the proof of the main result from Section 3, Theorem 3.17.

Theorem B.16

Suppose a sequential input–output model ( P , D , Y ) with sample space ( Ω , ) is given with D countable and D almost surely infinite. Then the following are equivalent:

  1. There is some W such that P α Y WD is IO contractible for all α .

  2. For all i , Y i ( Y i , D i , C ) ( H , D i ) and for all i , j ,

    P Y i H D i H , D i P Y j H D j ;

  3. There is some L : H × X Y such that

B.4
B.4

Proof

As a preliminary, we will show

(A19) .

where lus : D × Y D Y is the single-shot lookup function

( ( y i ) i D , d ) y d .

Recall that lu is the function

( ( d i ) N , ( y i j ) i , j N × D ) ( y i d i ) i N ;

By definition, for any { A i Y i N } ,

F lu ( × i N A i ( d i ) N , ( y i j ) i N × D ) = δ ( y i d i ) i N ( × i N A i ) = i N δ y i d i ( A i ) = i N F lus ( A i d i , ( y i j ) j D ) = ( i N F lus ) ( × i N A i ( d i ) N , ( y i j ) i , j N × D ) ,

which is what we wanted to show.

(1) (3): From Lemma 3.15, we have some Y D such that

and by Lemma B.13,

(A19) .

By Lemma 3.15, for each w W ,

and so by Lemma B.15,

(A20) .

We can substitute equations (A20) and (A19) into (A21) for

where

(3) (2): If

then by the definition of higher order conditionals, for any i N and any α C

(A21) P Y i HD i Y i D i L del Y N × X N ,

hence Y i ( Y i , D i , C ) ( H , D i )

(2) (1): Take W H . Because we assume Y i ( Y [ 1 , i ) , D [ 1 , i ) , C ) ( H , D i ) we can take L H X Y = P α Y i H X i for all i , α and

P Y i HD i Y [ 1 , i ) D [ 1 , i ) L del Y i 1 × X i 1

by taking the semidirect product of the conditionals

(where the second line follows from the fact that permuting wire labels does not actually change the kernel illustrated by the diagram).

( P , D , Y ) is therefore exchange commutative over H . Furthermore, take A N . Then

so ( P , D , Y ) is also local over H .□

B.4
B.4
B.4
B.4
B.4
B.4
B.4
B.4
B.4

B.5 Consequences of Theorem 3.17

Theorem 3.17 says that a data independent sequential input–output model ( P , D , Y ) features conditionally independent and identical response functions P α Y i HD i for all α if and only if there is some W such that P α Y WD is IO contractible over W for all α .

A simple special case to consider is when W is single valued – that is, when P α Y D is IO contractible. As Theorem B.17 shows, this corresponds to the CIIR sequence models where the inputs D are unconditionally data-independent and independent of the hypothesis H . We can also consider the case where ( P , D , Y ) is only exchange commutative over * . This corresponds to models where the inputs D are data-independent and the hypothesis H depends on a symmetric function of the inputs D (under some side conditions).

Theorem B.17

(Data-independent IO contractibility) Suppose a sequential input–output model ( P , D , Y ) with sample space ( Ω , ) is given with D countable and, letting E D N be the set of all sequences for which each j D occurs infinitely often, P α D ( E ) = 1 for all α . Then the following are equivalent:

  1. P α Y D is IO contractible for all α .

  2. For all i , Y i ( Y i , D i , C ) ( H , D i ) , for all i , j , α

    P α Y i H D i = P α Y j H D j ,

    H D C and for all i D i D ( i , ] ( D [ 1 , i ) , C ) .

  3. There is some L : H × X Y such that for all α ,

B.5

Proof

See Appendix B.5.□

In the following lemma, we use annotated conditional independence symbols P to denote conditional independence with respect to P as we want to track conditional independence with respect to different models.

Lemma B.18

(Exchangeably dominated conditionals) Given ( P , Ω , ) and variables D , Y , if for any α there is some Q α such that Q α DY is exchangeable with directing random measure H , D is almost surely infinite with respect to Q α and for any i , Q α Y i D Y { i } P α Y i D Y { i } , then P α Y HD is IO contractible (where H is the directing random conditional for P α Y D ).

Proof

By Prop. 1.4 of [29], there is a H such that ( D i , Y i ) Q C ( D { i } Y { i } ) ( H , C ) and for all i , j

(A22) Q α Y i D i H = Q α Y j D j H ;

There is some function f : D N × Y N such that H = f ( D , Y ) , i.e.

(A23) .

It follows from weak union that

(A24) Y i Q ( D { i } Y { i } ) ( D i , H , C ) P α Y i D i H Y { i } D { i } ( A d i , g , d , y ) P α Y i D i H ( A d i , g ) A , d i , g , d , y , α Y i P ( D { i } Y { i } ) ( D i , H , C ) ,

where equation (A24) follows from equation (A23).

Finally, from equation (A22) and equation (A24),

P α Y i D i H P α Y j D j H ;

Thus, ( P , D , Y ) features independent and identical responses conditioned on H , and by Theorem 3.17 it also has independent and identical responses conditioned on H . Finally, D almost surely infinite with respect to Q α implies D is also almost surely infinite with respect to P α , so by Theorem 3.17 P α Y HD is IO contractible.□

B.5

Theorem B.19

A data-independent sequential input–output model ( P , D , Y ) features conditionally independent and identical response functions P α Y i D i H with D almost surely infinite only if for any sets A , B N such that D A and D B are also almost surely infinite and any i , j N such that i A , j B ,

P α Y i D i Y A , D A = P α Y j D j R V Y B D B ;

If in addition each P α YD is dominated by some Q α such that Q α Y D is exchangeable, then the reverse implication also holds.

Proof

Only if: By assumption, P α Y A D A Y A , D A is IO contractible. Thus, by Theorem 3.17, P α Y A HD A is also IO contractible. We can observe that, taking any finite subset C of A , this can be extended to IO contractibility for P α Y A C HD A C , and therefore, we have IO contractibility for P α Y HD A . By Theorem 3.16, H is almost surely a function of both ( D A , Y A ) and ( D B , Y B ) and, furthermore, Y i ( D A , Y A ) ( D i , H , C ) , Y j ( D B , Y B ) ( D j , H , C ) . Hence there is some f : D N × Y N H such that for all E Y , d i D , d D N , y Y N

(A25) P α Y i D i Y A , D A ( E d i , y , d ) = P α Y i D i H ( E d i , f ( y , d ) ) = P α Y j D j H ( E d i , f ( y , d ) ) = P α Y j D j Y B , D B ( E d i , y , d ) ,

where equation (A25) follows from Theorem 3.8.

If: By construction,

Q α Y i D i Y { i } D { i } Q α D i Y { i } D { i } P α Y i D i Y { i } , D { i }

is exchangeable, and by domination Q α Y i D i Y { i } , D { i } Q α Y i D i Y { i } , D { i } . The result follows from Lemma B.18.□

Theorem B.20

Given ( P , Y , D ) , if P α Y D is exchange commutative for each α , and for each α P α D is absolutely continuous with respect to some exchangeable distribution Q α D in Δ ( D N ) with directing random measure H , and if D is almost surely infinite with respect to Q α , then ( P , Y , D ) is IO contractible.

Proof

For each α , extend Q α D to a distribution on ( D , Y ) by asserting that P α Y D Q α Q α Y D . Because Q α D dominates P α D , we have Q α Y D P α P α Y D

We will show Q α DY is unchanged by finite permutations of ( D i , Y i ) pairs. For some finite permutation ρ : N N :

Where line (29) follows from exchange commutativity, (30) follows from commutativity of deterministic kernels with the copy map and the fact that the swap map is deterministic and line (31) comes from the exchangeability of Q α D .

Because P α D is dominated by Q α D by assumption, we have P α Y D Q α Y D , which implies Q α Y i D Y { i } Q α Y i D Y { i } and from Lemma B.18, we therefore have P α Y HD IO contractible over H , and from Theorem 3.17, we have Y P C ( D , H ) and so P α Y HD IO contractible over H also.□

B.5

C Precedented options

C.1 IO contractibility from diverse precedent

This is the proof of Theorem 4.7 in Section 4.

Theorem C.1

Given a latent CIIR model ( P , ( E i , X i , Y i , Z i ) i N { c } ) with E , X , Y , and Z all discrete, let H be the directing random measure of ( P , ( Z i , E i , X i , Y i ) i N ) .

Let I Δ ( Y ) X Z be the event H X z Y = H X z Y for all z , z Z ; i.e. the event that Y i is independent of Z i conditional on X i and H X Z Y . Define Q α Δ ( Ω ) to be the probability measure such that, for all A ,

Q α ( A ) P α 1 I H ( A 1 ) ,

i.e. Q α is P α conditioned on H X Z Y I , so Y i Q Z i ( X i , H , C ) .

If Q satisfies ( X , E ) -precedent and absolute continuity of conditionals, then with respect to Q . The pairs ( X i , Y i ) i N { c } share conditionally independent and identical responses.

Proof

We apply absolute continuity of conditionals condition to show that Y i Q e E i ( Z i , X i , H , C ) for i N . We then apply the precedent condition to extend this independence to Y c Q e E c ( Z c , X c , H , C ) to complete the proof.

Note that by construction of Q α we have Y i Q e Z i ( X i , H , C ) . This in turn implies, for all α , x , y , z , the following holds Q α -almost surely:

(A27) e E H e x z y H e z x H z e e E H e z x H z e Q α e E H e x z y H e z x H z e e E H e z x H z e ;

Equation (A26) defines a polynomial constraint on H E { z , z } x ( H e x u ) e E , u { z , z } for each x , z , z . If H e x z y = H e x z y for all e , e , then this constraint holds for every possible assignment of values to H E { z , z } x . We will show that, unless H e x z y = H e x z y for all e , e and z , that this constraint is nontrivial for some z . Consequently, the set of solutions for H E Z x subject to the restriction H e x z y H e x z y has Lebesgue measure 0 for each x . We will do this by showing that, assuming H e x z y > H e < x z y for some e , e < , we can find alternative realisations of H E z x that lead to unequal values of the left hand side of Eq (A26) without affecting the right-hand side.

Assuming (without loss of generality) we have e , e < such that H e x z y > H e < x z y , either H z e = H z e < , H z e < H z e < or H z e > H z e < . Consider the first case, and take g E z X , g E z X such that for some x , ε 1 < g e z x , g e < z x < 1 ε 2 , g e z x = g e z x ε 1 , g e < z x = g e < z x + ε 2 and g u z x = g u z x for all other u E . Choose ε 1 = ε 2 . Then

g e z x H z e e E g e z x H z e > g e z x H z e e E g e z x H z e g e < z x H z e < e E g e z x H z e < g e < z x H z e < e E g e z x H z e

because the inequalities hold for the numerators and the denominators are equal. But then

e E H e x z y g e z x H z e e E g e z x H z e > e E H e x z y g e z x H z e e E g e z x H z e

because on the right side a smaller term in the sum receives more weight, a larger term receives less weight and all other terms are weighted equally.

Consider H z e > H z e < . Choose ε 1 = ε 2 H z e < H z e . Then, using the same construction as before, we have

H z e < g e z < x e E H z e g e z x = H z e < ( g e < z x ε 2 ) e E H z e g e z x + ε 1 H z e ε 2 H z e < = H z e < ( g e < z x ε 2 ) e E H z e g e z x < H z e < g e < z x e E H z e g e z x

and

H z e g e z x e E H z e g e z x = H z e ( g e z x + ε 1 ) e E H z e g e z x + ε 1 H z e ε 2 H z e < = H z e ( g e z x + ε 1 ) e E H z e g e z x > H z e g e z x e E H z e g e z x ,

then, by the same reasoning as before, we have

e E H e x z y g e z x H z e e E g e z x H z e > e E H e x z y g e z x H z e e E g e z x H z e ;

Analogous reasoning holds for H z e < H z e < .

Suppose g e x z y g e x z y for some e , e and z . Then equation (A26) implies a nontrivial constraint on H E z x for some z . Thus, for some e , e , z , x and y the set of solutions S { g E Z X equation (A26) is satisfied for all x , z g e x z y g e x z y } has Lebesgue measure 0 [39], and so by the assumption of absolute continuity of conditionals,

Q α H E Z X H E X Z Y H Z E ( S g E X Z Y g Z E ) 0 ;

On the other hand, by assumption, the set T { g z E equation (A26) is satisfied } has measure 1. Thus, we conclude that with the exception of a Q α measure 0 set, g e x z y = g e x z y . That is, Y i Q e E i ( Z i , X i , H , C ) . By contraction with Y i Q e Z i ( X i , H , C ) , we have Y i Q e ( Z i , E i ) ( X i , H , C ) .

By CIIR of the ( E i , X i , Y i ) pairs we have for all i ,

Q α Y i X i E i H Q α Y c X c E c H ;

We invoke precedent to establish that this also holds almost surely with respect to Q α E c X c H :

Q α Y i X i E i H Q α Y c X c E c H

and therefore, by Y i Q e E i ( Z i , X i , H , C ) ,

Q α Y i X i H Q α Y c X c H

completing the proof.□

References

[1] Hernán MA, Taubman SL. Does obesity shorten life? The importance of well-defined interventions to answer causal questions. Int J Obesity. 2008 Aug;32(S3):S8–S14. https://www.nature.com/articles/ijo200882. 10.1038/ijo.2008.82Search in Google Scholar PubMed

[2] Hernán MA. Does water kill? A call for less casual causal inferences. Ann Epidemiol. 2016 Oct;26(10):674–80. http://www.sciencedirect.com/science/article/pii/S1047279716302800. 10.1016/j.annepidem.2016.08.016Search in Google Scholar PubMed PubMed Central

[3] Pearl J. Does obesity shorten life? Or is it the Soda? On non-manipulable causes. J Causal Inference. 2018;6(2):20182001. https://www.degruyter.com/view/j/jci.2018.6.issue-2/jci-2018-2001/jci-2018-2001.xml. 10.1515/jci-2018-2001Search in Google Scholar

[4] Hernán MA, Cole SR. Invited commentary: causal diagrams and measurement bias. Amer J Epidemiol. 2009 Oct;170(8):959–62. https://academic.oup.com/aje/article/170/8/959/145135. 10.1093/aje/kwp293Search in Google Scholar PubMed PubMed Central

[5] Shahar E. The association of body mass index with health outcomes: causal, inconsistent, or confounded? Amer J Epidemiol. 2009 Oct;170(8):957–58. 10.1093/aje/kwp292Search in Google Scholar PubMed

[6] Spirtes P, Scheines R. Causal inference of ambiguous manipulations. Philos Sci. 2004 Dec;71(5):833–45. https://www.cambridge.org/core/journals/philosophy-of-science/article/abs/causal-inference-of-ambiguous-manipulations/2A605BCFFC1A879A157966473AC2A6D2. 10.1086/425058Search in Google Scholar

[7] Pearl J. Causality: Models, reasoning and inference. 2nd ed. New York, NY: Cambridge University Press; 2009. 10.1017/CBO9780511803161Search in Google Scholar

[8] Heckerman D, Shachter R. Decision-theoretic foundations for causal reasoning. J Artif Intell Res. 1995 Dec;3:405–30. https://www.jair.org/index.php/jair/article/view/10151. 10.1613/jair.202Search in Google Scholar

[9] Dawid P. The decision-theoretic approach to causal inference. In: Causality. John Wiley & Sons, Ltd; 2012. p. 25–42. https://onlinelibrary.wiley.com/doi/abs/10.1002/9781119945710.ch4. 10.1002/9781119945710.ch4Search in Google Scholar

[10] Dawid P. Decision-theoretic foundations for statistical causality. J Causal Inference. 2021 Jan;9(1):39–77. 10.1515/jci-2020-0008Search in Google Scholar

[11] Lattimore F, Rohde D. Causal inference with Bayes rule. arXiv:191001510 [cs, stat]. 2019 Oct. http://arxiv.org/abs/1910.01510. Search in Google Scholar

[12] Lattimore F, Rohde D. Replacing the do-calculus with Bayes rule. arXiv:190607125 [cs, stat]. 2019 Dec. http://arxiv.org/abs/1906.07125. Search in Google Scholar

[13] de Finetti B. Foresight: its logical laws, its subjective sources. In: Kotz S, Johnson NL, editors. Breakthroughs in statistics: foundations and basic theory. Springer Series in Statistics. New York, NY: Springer; [1937] 1992. p. 134–74. 10.1007/978-1-4612-0919-5_10. Search in Google Scholar

[14] Lemeire J, Janzing D. Replacing causal faithfulness with algorithmic independence of conditionals. Minds and machines. 2013 May;23 (2):227–49. https://link.springer.com/article/10.1007/s11023-012-9283-1. 10.1007/s11023-012-9283-1Search in Google Scholar

[15] Meek C. Strong completeness and faithfulness in Bayesian networks. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. UAI’95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 1995. p. 411–8. http://dl.acm.org/citation.cfm?id=2074158.2074205. Search in Google Scholar

[16] Lindley DV, Novick MR. The role of exchangeability in inference. Ann Stat. 1981;9(1):45–58. https://www.jstor.org/stable/2240868. 10.1214/aos/1176345331Search in Google Scholar

[17] Rubin DB. Causal inference using potential outcomes. J Amer Stat Assoc. 2005 Mar;100(469):322–31. 10.1198/016214504000001880. Search in Google Scholar

[18] Imbens GW, Rubin DB. Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge: Cambridge University Press; 2015. https://www.cambridge.org/core/books/causal-inference-for-statistics-social-and-biomedical-sciences/71126BE90C58F1A431FE9B2DD07938AB. Search in Google Scholar

[19] Saarela O, Stephens DA, Moodie EEM. The role of exchangeability in causal inference. 2020 Jun. https://arxiv.org/abs/2006.01799v3. Search in Google Scholar

[20] Hernán MA, Robins JM. Estimating causal effects from epidemiological data. J Epidemiol Community Health. 2006 Jul;60(7):578–86. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2652882/. 10.1136/jech.2004.029496Search in Google Scholar PubMed PubMed Central

[21] Hernán MA. Beyond exchangeability: The other conditions for causal inference in medical research. Stat Methods Med Res. 2012 Feb;21(1):3–5. 10.1177/0962280211398037. Search in Google Scholar PubMed

[22] Greenland S, Robins JM. Identifiability, exchangeability, and epidemiological confounding. Int J Epidemiol. 1986 Sep;15(3):413–419. 10.1093/ije/15.3.413. Search in Google Scholar PubMed

[23] Banerjee AV, Chassang S, Snowberg E. Chapter 4 - Decision theoretic approaches to experiment design and external validity. In: Banerjee AV, Duflo E, editors. Handbook of economic field experiments. vol. 1 of Handbook of Field Experiments. North-Holland; 2017. p. 141–74. https://www.sciencedirect.com/science/article/pii/S2214658X16300071. 10.1016/bs.hefe.2016.08.005Search in Google Scholar

[24] Peters J, Bühlmann P, Meinshausen N. Causal inference by using invariant prediction: identification and confidence intervals. J R Stat Soc Ser B (Stat Methodol). 2016;78(5):947–1012. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12167. 10.1111/rssb.12167Search in Google Scholar

[25] Guo S, Toth V, Schölkopf B, Huszar F. Causal de Finetti: On the identification of invariant causal structure in exchangeable data. Adv Neural Inform Proces Syst. 2023 Dec;36:36463–75. Search in Google Scholar

[26] Çinlar E. Probability and stochastics. New York, NY: Springer; 2011. 10.1007/978-0-387-87859-1Search in Google Scholar

[27] Liu Y, Price H. Ramsey and Joyce on deliberation and prediction. Synthese. 2020 Oct;197(10):4365–86. 10.1007/s11229-018-01926-8. Search in Google Scholar

[28] Constantinou P, Dawid AP. Extended conditional independence and applications in causal inference. Ann Stat. 2017;45(6):2618–53. http://www.jstor.org/stable/26362953. 10.1214/16-AOS1537Search in Google Scholar

[29] Kallenberg O. The basic symmetries. In: Probabilistic symmetries and invariance principles. Probability and its applications. New York, NY: Springer; 2005. p. 24–68. 10.1007/0-387-28861-9_2. Search in Google Scholar

[30] Eckles D, Bakshy E. Bias and high-dimensional adjustment in observational studies of Peer effects. J Amer Stat Assoc. 2021 Apr;116(534):507–17. 10.1080/01621459.2020.1796393. Search in Google Scholar

[31] Gordon BR, Zettelmeyer F, Bhargava N, Chapsky D. A comparison of approaches to advertising measurement: evidence from big field experiments at facebook. Rochester, NY: Social Science Research Network; 2018. ID 3033144. https://papers.ssrn.com/abstract=3033144. 10.2139/ssrn.3033144Search in Google Scholar

[32] Gordon BR, Moakler R, Zettelmeyer F. Close enough? A large-scale exploration of non-experimental approaches to advertising measurement. arXiv:220107055 [econ]. 2022 Jan. http://arxiv.org/abs/2201.07055. Search in Google Scholar

[33] Chickering DM. Learning equivalence classes of Bayesian-network structures. J Machine Learn Res. 2002;2(Feb):445–98. http://www.jmlr.org/papers/v2/chickering02a.html. Search in Google Scholar

[34] Uhler C, Raskutti G, Bühlmann P, Yu B. Geometry of the faithfulness assumption in causal inference. Ann Stat. 2013 Apr;41(2):436–63. http://arxiv.org/abs/1207.0547. 10.1214/12-AOS1080Search in Google Scholar

[35] Selinger P. A survey of graphical languages for Monoidal categories. In: Coecke B, editor. New structures for physics. Lecture Notes in Physics. Berlin, Heidelberg: Springer; 2011. p. 289–355. 10.1007/978-3-642-12821-9_4. Search in Google Scholar

[36] Fritz T. A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics. Adv Math. 2020 Aug;370:107239. https://www.sciencedirect.com/science/article/pii/S0001870820302656. 10.1016/j.aim.2020.107239Search in Google Scholar

[37] Cho K, Jacobs B. Disintegration and Bayesian inversion via string diagrams. Math Struct Comput Sci. 2019 Aug;29(7):938–71. 10.1017/S0960129518000488Search in Google Scholar

[38] Fong B. Causal theories: a categorical perspective on Bayesian networks. arXiv: 13016201 [math]. 2013 Jan. http://arxiv.org/abs/1301.6201. Search in Google Scholar

[39] Okamoto M. Distinctness of the eigenvalues of a quadratic form in a multivariate sample. Ann Stat. 1973;1(4):763–65. https://www.jstor.org/stable/2958321. 10.1214/aos/1176342472Search in Google Scholar

Received: 2023-01-06
Revised: 2024-04-05
Accepted: 2024-05-01
Published Online: 2025-01-16

© 2025 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Research Articles
  2. Decision making, symmetry and structure: Justifying causal interventions
  3. Targeted maximum likelihood based estimation for longitudinal mediation analysis
  4. Optimal precision of coarse structural nested mean models to estimate the effect of initiating ART in early and acute HIV infection
  5. Targeting mediating mechanisms of social disparities with an interventional effects framework, applied to the gender pay gap in Western Germany
  6. Role of placebo samples in observational studies
  7. Combining observational and experimental data for causal inference considering data privacy
  8. Recovery and inference of causal effects with sequential adjustment for confounding and attrition
  9. Conservative inference for counterfactuals
  10. Treatment effect estimation with observational network data using machine learning
  11. Causal structure learning in directed, possibly cyclic, graphical models
  12. Mediated probabilities of causation
  13. Beyond conditional averages: Estimating the individual causal effect distribution
  14. Matching estimators of causal effects in clustered observational studies
  15. Ancestor regression in structural vector autoregressive models
  16. Single proxy synthetic control
  17. Bounds on the fixed effects estimand in the presence of heterogeneous assignment propensities
  18. Minimax rates and adaptivity in combining experimental and observational data
  19. Highly adaptive Lasso for estimation of heterogeneous treatment effects and treatment recommendation
  20. A clarification on the links between potential outcomes and do-interventions
  21. Review Article
  22. The necessity of construct and external validity for deductive causal inference
Downloaded on 12.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/jci-2023-0001/html
Scroll to top button