Foundations of causal discovery on groups of variables

Jonas Wahl; Urmi Ninad; Jakob Runge

doi:10.1515/jci-2023-0041

Artikel Open Access

Foundations of causal discovery on groups of variables

Jonas Wahl , Urmi Ninad und Jakob Runge

Veröffentlicht/Copyright: 12. Juli 2024

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Journal of Causal Inference Band 12 Heft 1

Abstract

Discovering causal relationships from observational data is a challenging task that relies on assumptions connecting statistical quantities to graphical or algebraic causal models. In this work, we focus on widely employed assumptions for causal discovery when objects of interest are (multivariate) groups of random variables rather than individual (univariate) random variables, as is the case in a variety of problems in scientific domains such as climate science or neuroscience. If the group level causal models are derived from partitioning a micro-level model into groups, we explore the relationship between micro- and group level causal discovery assumptions. We investigate the conditions under which assumptions like causal faithfulness hold or fail to hold. Our analysis encompasses graphical causal models that contain cycles and bidirected edges. We also discuss grouped time series causal graphs and variants thereof as special cases of our general theoretical framework. Thereby, we aim to provide researchers with a solid theoretical foundation for the development and application of causal discovery methods for variable groups.

Keywords: causality; causal discovery; graphical models; Markov property; faithfulness; time series

MSC 2010: 62D20

1 Introduction

Inferring causal relationships from observational data and estimating their strength is an ubiquitous task in many research domains for which a multitude of tools [1–7] have been developed throughout the last decades. While the underlying assumptions on the data-generating process differ from method to method, the majority of approaches have in common that the causal objects of interest are one-dimensional random variables. However, in some applications, the relevant causal entities can be multivariate groups of variables, such as spatial regions of measurements, or collections of random variables that together describe or approximate a phenomenon of interest, such as the phase and amplitude of an oscillation. For instance, neuroscientists may be interested in causal interactions between brain regions rather than in interactions between individual neurons [8,9], while climate scientists would like to improve their understanding of spatio-temporal climate modes that extend across large regions on the globe [10–12] and interact across long distances. Similarly, economists may want to approximate the economic activity of a given country by a range of different indicators rather than a single univariate index [13].

At present, domain experts typically address such problems by employing the group mean of a variable group as a stand-in for the group as a whole or by means of more elaborate standard dimension reduction techniques such as principal component analysis (PCA). For instance, in climate science, the El Niño Southern Oscillation (ENSO) is often represented as either a regional average of sea surface temperatures or as a principal component in a PCA [14]. Unfortunately, if some of the causal processes at hand happen at a smaller scale than averages or principal components can capture, relevant causal information may be lost. As an example, the group mean of two variable groups W and Y may be conditionally dependent given the group mean of a third group Z , while the groups, considered as a whole, satisfy the conditional independence W ⊥ ⊥ Y ∣ Z [2,15]. Causal inference methods based on conditional independence testing such as the PC algorithm might therefore infer different causal structures depending on whether they use group means or the full variable groups as their basic causal objects. Moreover, the dominant mode of internal variability of a variable group Y as recovered by PCA may not be the causally relevant driver of its effect on another group Z , which could for instance be captured more accurately by a higher order principal component. If only the dominant component is consequently used in a causal analysis, then the causal effect of Y on Z may be diluted or disappear completely. A practical example of this, again from climate science, that deals with the effect of ENSO on the North Atlantic Oscillation (NAO) can be found in the study by Zhang et al. [16].

A second approach to causal discovery for variable groups is to run causal discovery algorithms on the totality of all micro-variables and then deduce group level relationships from the inferred micro-graph. Such an approach will inevitably need to unravel micro-relations of little interest to the group level problem at hand. For example, one is typically not interested in causal relations between individual grid locations of satellite measurements of temperature data but between different spatial temperature fields as a whole [7]. In addition, to be sound, a micro-level causal discovery method may require strong technical assumptions on micro-relations that are again of no relevance to the between-group interactions, and it can quickly become computationally inefficient and statistically frail, see, e.g. [17] for empirical evidence of this for two variable groups. We will return to causal discovery with dimension reduction and full microlevel causal discovery in the final section of this article, Section 8, where we will discuss their strengths and weaknesses in more detail.

An alternative approach to the group level causal discovery problem is thus to consider variable groups as a whole as the basic causal entities on which to apply available causal discovery methods [18]. For instance, approaches based on conditional independence testing such as the PC-algorithm do not make any assumptions on the dimensionality of its node variables per se and can still be executed provided that its conditional independence tests are adapted to the multivariate setting [19–22]. However, such constraint-based methods rely on two fundamental assumptions, the causal Markov property and causal faithfulness, or variants thereof, which now have to be assumed directly on the group level for the methods to be sound. Thus, the following question arises: If causally interacting micro-variables are partitioned into variable groups, see Figure 1, do causal discovery assumptions on the micro-level transfer to the group level and if not, what else is required for these group level causal discovery assumptions to be valid?

Figure 1

A mixed graph over micro-variables is coarsened to a mixed graph of variable groups, see Definition 5. In graphical causal modelling, directed edges represent direct causal influences, bidirected edges represent confounding by a hidden variable, and undirected edges indicate the presence of a selection variable that has been conditioned on.

To answer this question, in this work, we provide a thorough theoretical analysis of the relationship between micro- and macro-level causal models with a view on causal discovery assumptions. We do so for causal models that exhibit cyclic as well as acyclic behaviour. Parallel questions on causal effect estimation on directed acyclic graphs over variable groups have been addressed recently by Anand et al. [23]. Anand et al. [23] also present general rules of graphical calculus for acyclic graphs of groups, which we recall and adapt to our setting in Section 3. In order to discuss our main results, we now recall that the Markov property and causal faithfulness relate the graphical structure of the model, the causal graph, or more precisely its d - or σ -separations, to the observational distribution of the involved variables.

The Markov property states that two variables that can be separated graphically by a separating set S are conditionally independent given that set, or for short that d - (or σ -)separation implies conditional independence. The assumption of causal faithfulness on the other hand requires that also the converse implication is true, i.e. that conditional independence implies d - (or σ -)separation. Taken together, both properties thus state that graphical separations and conditional independencies are in exact correspondence to each other.

While the Markov property is a given in almost every causal inference method, causal faithfulness is more controversial and its validity has been discussed in various places [1,24]. As a consequence, weaker versions of causal faithfulness have been developed, most notably adjacency and orientation faithfulness [5,25]. We study under which conditions the causal Markov property, faithfulness and some of its relatives do and do not carry over from a fine grained micro-level causal graphical model to a more coarse grained macro-level graph in which the micro-level variables are partitioned into groups, see Figure 1. In order to do so, we additionally study the relationship between micro-level and coarse grained group level causal graphs on a purely graphical level, see Section 3.

As our main results, we show that the Causal Markov property does transfer from the micro- to the group level (Theorems 2 and 3) relatively straightforwardly, but that this is no longer true for causal faithfulness, a fact that was already noted in empirical simulations in the study by Parviainen and Kaski [18]. We point out that in some sense when dealing with variable groups, the faithfulness assumption is more complicated than was already known: not only does faithfulness fail to transfer to the macro-level can even be violated even though its weaker relatives adjacency and orientation faithfulness [5] are both satisfied on the macro-level, see Section 5 and Figure 9. We are not aware of this type of faithfulness violations (i.e., faithfulness being violated but adjacency and orientation faithfulness holding) in other settings and call them non-local faithfulness violations.

$Figure 2 A diagrammatic summary of the relationship between the σ \sigma -Markov property and different types of faithfulness on a micro-graph G {\mathcal{G}} and its graph of groups with respect to a partition P {\mathcal{P}} , see Definition 4 for details. A blue arrow indicates that all properties from which the arrow emerges imply the target property. An orange arrow indicates that the properties from which the arrow emerges are not sufficient to guarantee the target property.$

Figure 2

A diagrammatic summary of the relationship between the σ -Markov property and different types of faithfulness on a micro-graph G and its graph of groups with respect to a partition P , see Definition 4 for details. A blue arrow indicates that all properties from which the arrow emerges imply the target property. An orange arrow indicates that the properties from which the arrow emerges are not sufficient to guarantee the target property.

On the other hand, we also provide two criteria that do guarantee macro-level causal faithfulness whenever the variables are sufficiently well-connected internally, either through cycles (Theorem 4) or through directed or bidirected paths (Theorem 5). This may justify the assumption of causal faithfulness is some settings, as often variable groups are chosen the way they are, exactly because of their internal coherence or their strong internal interactions. Nevertheless, considered in entirety, our discussion shows that faithfulness, already controversial in the univariate case, can be a strong assumption for causal graphs over variable groups and practitioners are advised to proceed with care when assuming it.

We also demonstrate that graphs over variable groups need to be interpreted carefully with respect to their causal meaning as we will discuss in Section 7. In addition, we point out that the weaker notion of adjacency faithfulness does transfer from the micro-level to macro-level (Lemma 9). Therefore, when developing causal discovery tools for variable groups, proceeding in line with methods such as the conservative PC-algorithm of Ramsey et al. [5], which only rely on adjacency faithfulness, may be advisable if there are no domain-specific reasons to believe that faithfulness is a valid assumption.

We end with a discussion on causal discovery for time series, and generalize the widely employed notion of the time series summary graph [10] to the notion of time series summary graphs of groups. We show that, under a dynamical systems inspired condition that we dub causal mixing, stronger causal conclusions can be derived from grouped time series summary graphs. Thus, while causal conclusions on the time-resolved level need to be interpreted carefully, global interactions between groups of processes may be more robust with respect to the standard assumptions of causal inference. To summarize, our main contributions are as follows:

We extend the theoretical framework of Anand et al. [23] for graphical causal reasoning between variable groups to σ -separation and discuss the relationship between micro- and macro-level versions of fundamental graphical properties, such as acyclicity and acyclification (Theorem 1).
We discuss Markov properties for m - and σ -separation and show that they transfer from micro- to group level directed mixed graphs (DMGs), see Theorems 2 and 3.
We discuss different failure modes of causal faithfulness for graphs over variable groups including an example of a non-local faithfulness violation (Section 5).
We provide two criteria (Theorems 4 and 5) that ensure faithfulness on the group level after coarsening a micro-graph. We also discuss the role of adjacency faithfulness (Lemma 9) and an example addressing the applicability of Meek’s orientation rules [26] that was brought forward by Parviainen and Kaski [18] (Subsection 5.3).
We show how time series causal graphs fit into our framework (Section 6).
We elaborate on the difference between apparent and true causation in group DMGs and time series group DMGs (Section 7).
We discuss strengths and failure modes of causal discovery for variable groups through dimension reduction and full microlevel causal discovery and contrast this to an approach that proceeds directly on the group level (Section 8).

We summarize our main results on faithfulness and Markov properties in Figure 2. We hope that this work will provide a solid theoretical footing for the development and empirical validation of group level causal discovery algorithms in the future.

Figure 3

Acyclification of a cyclic mixed graph to an acyclic mixed graph.

1.1 Related work

The compatibility of averaging across variable and causal inference has been discussed by Rubenstein et al. [15], which also provides some toy examples. Arguably, the studies by Parviainen and Kaski, and Anand et al. [18,23] are closest to our work. Parviainen and Kaski [18] discussed several causal discovery methods for variable groups, introduced the notion of groupwise faithfulness and provide a first analysis of this property, including some empirical experiments with discrete micro-variables. We expand upon the theoretical analysis of Parviainen and Kaski [18] in several directions, e.g. by including cyclic structures, addressing Markov properties as well as by providing new sufficient criteria for groupwise faithfulness, new examples of faithfulness violations and results on time series. Parviainen and Kaski [18] provide an example in which groupwise faithfulness with respect to d -separation is deemed insufficient to ensure that the Meek orientation rules [26], a fundamental part of the PC-algorithm [2], still hold. However, we will point out in Section 5.3 that this is no longer true if group level cycles in the example of the study by [18] are properly accounted for by replacing d -faithfulness with σ -faithfulness. Anand et al. [23] present a graphical calculus for d -separation over graph of groups, which we will adapt to σ -separation below, and use this calculus to discuss causal effect estimation for directed acyclic graphs over variable groups, therein called cluster DAGs. Zscheischler et al. [27] and Wahl et al. [17] present ways of inferring cause–effect relationships when only two groups of variables are involved. Constraint-based causal discovery methods for variable groups require conditional independence testing for multivariate random vectors which are discussed in various places [19–22]. Causal discovery for time series is treated in many works, see e.g. [7,28,29] for discussions on state-of-the-art methods.

2 Preliminaries on (directed) mixed graphs

To account for latent confounding and selection bias, many concepts of causal inference have been extended to mixed graphs [30]. Cyclic causal relationships have also been incorporated succesfully into causal graphical modelling [31–33] although, for the most part, these works do not deal with undirected edges.

A mixed graph (MG) is a tuple G = ( V , ℰ , ℬ , U ) of a set of nodes V , a set of directed edges ℰ , a set of bidirected edges ℬ and a set of undirected edges U . All these sets are assumed to be countable. Directed edges will be depicted by one-sided arrows A → B or B ← A , bidirected edges by two-sided arrows A ↔ B and undirected edges by simple lines A − B . We will assume that graphs considered in this work do not admit self-edges of any type, i.e. both nodes of an edge are not allowed to coincide. A DMG is a mixed graph without undirected edges, and in this case, we will always suppress the (empty) set U from the notation. Finally, a directed graph (DG) is a DMG without bidirected edges, and again we will suppress the (empty) set ℬ from the notation. A walk π from A ∈ V to B ∈ V on a mixed graph G is a finite alternating tuple π = ( π ( 1 ) , e 1 , π ( 2 ) , e 2 , … , e m − 1 , π ( m ) ) , π ( 1 ) = A , π ( m ) = B of nodes π ( i ) ∈ V and edges e i ∈ ℰ ∪ ℬ ∪ U such that e i connects π ( i ) and π ( i + 1 ) , i.e. e i ∈ { π ( i ) → π ( i + 1 ) , π ( i ) ← π ( i + 1 ) , π ( i ) ↔ π ( i + 1 ) , π ( i ) − π ( i + 1 ) } . A path is a walk whose nodes π ( 1 ) , … , π ( m ) are all (pairwisely) different. A trivial walk (path) is a walk (path) that consists of only one node and no edges. A walk (path) is called right-directed if it is of the form π ( 1 ) → π ( 2 ) → ⋯ → π ( m ) , left-directed if it is of the form π ( 1 ) ← π ( 2 ) ← ⋯ ← π ( m ) and directed if it is left- or right-directed. A cycle on G is a directed walk π = ( π ( 1 ) , e 1 , π ( 2 ) , e 2 , … , e m − 1 , π ( m ) ) such that π ( 1 ) = π ( m ) , and a graph is said to be acyclic if it does not admit any cycles. As is common practice, directed acyclic graphs will be abbreviated as DAGs. A subset of nodes W ⊂ V of a mixed graph is strongly connected if for any two nodes A , B ∈ W there is a directed path from A to B . In particular, there is a cycle between any two nodes in a strongly connected subset. The strongly connected components of G are the maximal strongly connected subsets of V , i.e. those that cannot be enlarged without losing their strong connectivity. For any node A , the unique strongly connected component that contains A will be written as sc ( A ) . The strongly connected components of a G = ( V , ℰ , ℬ ) form a partition of V , i.e. V is a disjoint union of its strongly connected components. We also use the common conventions that A ∈ V is called a parent of B ∈ V if there is a directed edge A → B , and an ancestor of B if there is a directed path from A to B . Conversely, in the first case, B is called a child of A , and in the latter case, B is called a proper descendant of A . A descendant of A is a node that is either A itself or a proper descendant of A . A collider of a walk π = ( π ( 1 ) , e 1 , π ( 2 ) , e 2 , … , e m − 1 , π ( m ) ) is an inner node π ( i ) , 1 < i < m of π such that both its adjacent edges point into π ( i ) . Any inner node of π that is not a collider on π is consequently called a non-collider of π .

For the purpose of encoding conditional independencies efficiently when modelling causal relationships of random variables graphically, different notions of graphical separation have been introduced for different types of graphs.

Definition 1

(m-separation, see [30]) Let G = ( V , ℰ , ℬ , U ) be a mixed graph and let S ⊂ V be a set of nodes. A walk π between nodes A = π ( 1 ) and B = π ( m ) is said to be m-blocked by S if one of the following holds:

its first node A or its last node B lie in S ;
there is a collider of π that does not have any descendants in S ;
S contains a non-collider of π .

If all walks (or, equivalently, all paths) between A and B are m -blocked by S , we say that A and B are m -separated by S and write A ⋈ G m B ∣ S . If A and B are not m -separated by S , we say that they are m -connected by S .

If the graph G is a DAG, m -separation is known under the more familiar name d -separation. Since m -separation can be inadequate to deal with cyclic relationships (see the study by Bongers et al. [33] for a detailed explanation of why this is the case), another type of separation dubbed σ -separation was introduced by Forré and Mooij [31] and studied by Mooij and Claassen and Bongers et al. [32,33]. We have only found the definition of σ -separation for DMGs in the literature, but it is easily adapted to general mixed graphs as well. σ -separation also reduces to the more familiar notion of d-separation in the case of directed acyclic graphs.

Definition 2

( σ -separation [31]) Let G = ( V , ℰ , ℬ , U ) be a mixed graph and let S be a set of nodes. A walk π from A = π ( 1 ) and B = π ( m ) is said to be σ -blocked by a subset S ⊂ V if one of the following holds:

its first node A or its last node B lie in S ;
there is a collider of π that does not have any descendants in S ;
S contains a non-collider π ( i ) that has a neighbor π ( j ) , j ∈ { i − 1 , i + 1 } such that
1. π ( j ) ∉ sc ( π ( i ) ) and
2. the edge of π between π ( i ) and π ( j ) is of the form π ( i ) → π ( j ) or π ( i ) − π ( j ) .

If all walks (or, equivalently, all paths) between A and B are σ -blocked by S , we say that A and B are σ -separated by S and write A ⋈ G σ B ∣ S . If A and B are not σ -separated by S , we say that they are σ -connected by S .

A convenient way of linking the usual notion of d -separation on DAGs and σ -separation is through acyclification [33].

Definition 3

(Acyclification of a MG, see [33]) Let G = ( V , ℰ , ℬ , U ) be a mixed graph. The acyclification of G is the graph G acy = ( V , ℰ acy , ℬ acy , U acy ) defined as follows:

there is a directed edge A → B ∈ ℰ acy if and only if A ∈ pa G ( sc G ( B ) ) \ sc G ( B ) ;
there is an undirected edge A − B ∈ U acy if and only if A ∉ sc G ( B ) and A − B ∈ U ;
there is a bidirected edge A ↔ B ∈ ℬ acy if and only if sc G ( A ) = sc G ( B ) or there exist A ′ ∈ sc G ( A ) , B ′ ∈ sc G ( B ) with A ′ ↔ B ′ ∈ ℬ .

The following result is a straightforward generalization of [33, Supplement, Proposition A.19]. It states that σ -separation on a mixed graph can alternatively be understood as d -separation on its acyclification (Figure 3).

Figure 4

Left: A DAG partitioned such that the resulting group DMG is cyclic [23, Figure 1(d)]. Right: A cyclic micro DMG partitioned such that the resulting group DMG is acyclic.

Proposition 1

Let G = ( V , ℰ , ℬ , U ) be a mixed graph with acyclification G acy , let A , B ∈ V and let S ⊂ V be a subset of nodes. Then

A ⋈ G σ B ∣ S ⇔ A ⋈ G acy d B ∣ S .

3 Group (D)MGs

We will now move to the setting where nodes of graphs are no longer supposed to correspond to scalar random variables but to groups of random variables. If the graphs of groups under investigation are assumed acyclic and directed, they appear in the literature under the name Group DAGs [18] or Cluster DAGs [23]. We will adopt the former terminology. Even for directed acyclic graphs, many of the following results including those of Sections 4 and 5 are new. Proofs of the results of this section are either provided immediately or have been moved to Appendix B.1.

From now on, we will reserve the bold letter X for a given countable set X = { X 1 , X 2 , … } of micro-nodes. Although all results of this section are still purely graphical, we will also sometimes freely refer to the micro-nodes as micro-variables as they will correspond to random variables later on. A partition of X is a set P of pairwise disjoint subsets of X such that ∪ Y ∈ P Y = X . Partitions will always be assumed finite and its elements will be called variable groups and will be denoted by bold letters other than X , e.g. W , Y , Z .

Definition 4

Let X be the set of micro-nodes, and let P be a partition of X into finitely many subsets. A DMG of groups or group (D)MG is a (directed) mixed graph G = ( V , ℰ , ℬ , U ) whose nodes are the elements of P , i.e. V = P . If G is an acyclic directed graph, we speak of a Group DAG.

To clearly distinguish the usual setting from the group setting, we will speak of a micro (D)MG and a micro DAG if all groups are of size one. There are two natural ways of deriving a group MG: one can (a) coarsen a MG over micro-nodes to a group MG or (b) use a structural causal model over random vectors to induce a group MG directly. The former approach is the main focus of this work, while the latter will be defined and shortly discussed in Section A of the appendix.

3.1 From micro-variable graphs to graphs of groups

If we start out with a mixed graph G over (the micro nodes in) X and a partition P of X , there is a straightforward way to obtain a group MG over P by “coarsening” the graph G . The resulting graph is the quotient of the G with respect to P (in the category-theoretical sense) and is therefore referred to as the quotient graph of G (with respect to P ) in graph theory [34]. In the context of causal inference, quotient graphs of (bi)directed graphs were first introduced by Anand et al. [23, Definition 1] under the name cluster DAGs.

Definition 5

(see [23]) Let G be a mixed graph over X , and let P be a partition of X . The coarse graph or quotient graph co ( G , P ) is the mixed graph with nodes Y ∈ P obtained by

drawing a directed edge Y → Z if and only if Y ≠ Z , and there is a directed edge Y → Z on G with Y ∈ Y and Z ∈ Z ;
drawing a bidirected edge Y ↔ Z if and only if Y ≠ Z , and there is a bidirected edge Y ↔ Z on G with Y ∈ Y and Z ∈ Z ;
drawing an undirected edge Y − Z , if and only if Y ≠ Z , and there is an undirected edge Y − Z on G with Y ∈ Y and Z ∈ Z .

Note that we do not allow self-edges on co ( G , P ) but that multiple edges, each of a different type, are possible between two nodes of co ( G , P ) .

Clearly, in this generality, the newly defined graph co ( G , P ) need not be acyclic even if the underlying micro graph G is a DAG. On the other hand, the coarse graph can be acyclic even if the micro graph G does have cycles, see Figure 4 for illustrations of both of these statements.

$Figure 5 The micro path π \pi (in red) from W 2 {W}_{2} to Z 1 {Z}_{1} is coarsened to the path co ( π ) {\rm{co}}\left(\pi ) in the group DMG co ( G , P ) , P = { W , Y , Z } {\rm{co}}\left({\mathcal{G}},{\mathcal{P}}),\hspace{0.33em}{\mathcal{P}}=\left\{{\bf{W}},{\bf{Y}},{\bf{Z}}\right\} . The three P {\mathcal{P}} -segments of π \pi are W 2 ↔ W 1 {W}_{2}\leftrightarrow {W}_{1} , Y 1 ← Y 3 {Y}_{1}\leftarrow {Y}_{3} and Z 1 {Z}_{1} .$

Figure 5

The micro path π (in red) from W 2 to Z 1 is coarsened to the path co ( π ) in the group DMG co ( G , P ) , P = { W , Y , Z } . The three P -segments of π are W 2 ↔ W 1 , Y 1 ← Y 3 and Z 1 .

To discuss separation on directed graphs of groups, it is useful to introduce walk (path) segments and coarse paths.

Definition 6

Let G be a micro MG over X , and let P be a partition of X . Moreover, let π = ( π ( 1 ) , e 1 , π ( 2 ) , … , e m − 1 , π ( m ) ) be a walk on G . A subwalk π ( i , j ) = ( π ( i ) , e i , … , e j − 1 , π ( j ) ) , i ≤ j of π is called a P -segment of π if there exists a group Y ∈ P such that π ( l ) ∈ Y for all i ≤ l ≤ j and π ( i − 1 ) , π ( j + 1 ) ∉ Y . If i = 1 or j = m , we only require the respective one-sided condition.

We can thus represent any walk π = ( π ( 1 ) , e 1 , π ( 2 ) , … , e m − 1 , π ( m ) ) on a mixed graph as a sequence ( π ( i 0 , i 1 ) , e i 1 , π ( i 1 , i 2 ) , e i 2 , … , π ( i s − 1 , i s ) ) , i 0 = 1 , i s = m where π ( i l , i l + 1 ) are the P -segments of π and e i l are edges that connect nodes that belong to different groups of P . We call this representation the P -segment representation of π , see Figure 5.

Figure 6

Illustration of acyclification and coarsening. In general, these operations do not commute with each other.

Definition 7

Let G be a micro MG over X , let P be a partition of X and let Y , Z ∈ P . Consider a walk π from Y ∈ Y to Z ∈ Z on G with P -segment representation ( π ( i 0 , i 1 ) , e i 1 , π ( i 1 , i 2 ) , e i 2 , … , π ( i s − 1 , i s ) ) , i 0 = 1 , i s = m . The coarse walk (path) co ( π ) = ( co ( π ) ( 1 ) , e ˜ 1 , … , e ˜ u − 1 , co ( π ) ( u ) ) of π is the walk on co ( G , P ) defined as follows:

co ( π ) ( l ) is the unique W ∈ P containing the nodes of the P -segment π ( i l − 1 , i l ) ;
e ˜ l connects co ( π ) ( l ) and co ( π ) ( l + 1 ) and is of the same type (directed, bidirected, undirected) as e i l .

Remark

If π in Definition 7 is a path, then co ( π ) need not be a path as well. For instance, if π is of the form π = W 1 → Y 1 → W 2 → Y 2 and the micro nodes are grouped as W = { W 1 , W 2 } , Y = { Y 1 , Y 2 } , then co ( π ) = W → Y → W → Y is no longer a path. On the other hand, a micro-walk that is not a path can coarsen to a macro-path if micro-node repetitions only happen within P -segments. Note also, that if π is a directed walk, then co ( π ) is directed as well, see [23, Supplement, Proposition 2].

Lemma 1

Let G be a micro MG over X , and let P be a partition of X .

If co ( G , P ) is acyclic, then for any strongly connected component W of G , there is Y ∈ P such that W ⊂ Y .
The converse of (i) is not true.
If the elements of P are exactly the strongly connected components of G , then co ( G , P ) is acyclic.^[1]

Definition 8

We will call a partition P of X

acyclic with respect to the micro MG G if the coarse graph co ( G , P ) is acyclic.
maximally acyclic if P is the partition of G into its strongly connected components.

In particular, acyclicity of a partition P entails unidirectionality, that is, all directed edges between micro nodes Y ∈ Y and Z ∈ Z , Y ≠ Z on the micro graph G must point in the same direction, e.g. from the elements of Y to the elements of Z .

It was pointed out by Anand et al. [23] that coarsening micro DAGs to group DAGs induces an equivalence relation on the set of DAGs over X , and this observation carries through when the acyclicity assumption on the micro DAGs is dropped.

Definition 9

Given a partition P , we will call two micro MGs G and G ′ P -equivalent if their coarse graphs with respect to P are the same, i.e. if co ( G , P ) = co ( G ′ , P ) .

The two operations of acyclification in the sense of Definition 3 and coarsening in the sense of Definition 5 do not commute in general, see Figure 6. However, if the partition for coarsening is acyclic with respect to the micro MG, then acyclification of the micro MG has no effect on coarsening.

$Figure 7 Two simple examples of d d -faithfulness violations. In the first figure, faithfulness is violated due to the internal disconnectedness of Y {\bf{Y}} . In the second figure, conditioning on Y {\bf{Y}} will open the macro path from W W to Z {\bf{Z}} but closes the micro path, see Figure 6(i) in the study by Anand et al. [23, Supplement].$

Figure 7

Two simple examples of d -faithfulness violations. In the first figure, faithfulness is violated due to the internal disconnectedness of Y . In the second figure, conditioning on Y will open the macro path from W to Z but closes the micro path, see Figure 6(i) in the study by Anand et al. [23, Supplement].

Theorem 1

Let G be a mixed graph and let P be a partition of its nodes. If P is acyclic with respect to G , then

co ( G acy , P ) = co ( G , P ) .

Lemma 2

Let G be a mixed graph and let P be a partition of its nodes. If there exist Y ∈ Y and Z ∈ Z such that sc G ( Y ) = sc G ( Z ) , then sc co ( G , P ) ( Y ) = sc co ( G , P ) ( Z ) .

Proof

This result directly follows from the following fact: if there is a directed path from Y to Z (respectively Z to Y ), then the induced coarse path is a directed path from Y to Z (respectively from Z to Y ).□

The following result clarifies the relationship between σ -separation on the micro- and the group level. It generalizes [23, Theorem 1] to σ -separation in DMGs and also demonstrates that said theorem does not generalize to arbitrary mixed graphs in which undirected edges are present.

Lemma 3

Let G = ( V , ℰ , ℬ ) be a DMG, and let P be a partition of its nodes. Consider a micro walk π on G and denote its induced coarse walk on co ( G , P ) by co ( π ) .

If co ( π ) is σ -blocked by a set S ⊂ P of nodes of co ( G , P ) , then π is σ -blocked by T = ⋃ W ∈ S W .
The converse of (i) is not true.
If S ⊂ P is a set of nodes of co ( G , P ) that σ -separates Y , Z in G ˜ , then T = ⋃ W ∈ S W σ -separates any pair of micro nodes Y ∈ Y , Z ∈ Z in G .
(i) and (iii) are no longer true in arbitrary mixed graphs.

We also record the analogue of Lemma 3 for m -separation for the sake of completeness.

Lemma 4

Lemma 3 remains true if σ -separation is replaced by m-separation.

The proofs of Lemma 3 (i)–(iii) and of Lemma 4 only require straightforward adjustments of the proof of [23, Theorem 1] to σ -separation ( m -separation), and to the fact that we need to deal with walks instead of paths. We include these proofs in Appendix B.1 for the convenience of the reader.

4 Markov properties for group (D)MGs

In this section, we will quickly recap the different types of Markov properties that relate m -separation, respectively σ -separation, to conditional independence statements for scalar node variables. Then we will discuss the transferral of Markov properties from micro graphs to graphs of groups under coarsening. The results of this section are thus no longer purely graphical and micro nodes will always correspond to univariate random variables while nodes of group MGs will consequently always correspond to groups of variables respectively random vectors.

If G is a mixed graph over a set of node variables X with joint distribution P X , then we recall that the pair ( G , P X ) is said to have the m -Markov property (or to be m -Markovian) if every valid m -separation statement on G implies the corresponding conditional independence statement, i.e. for A , B ∈ X and S ⊂ X

A ⋈ G m B ∣ S ⇒ A ⊥ ⊥ B ∣ S .

If the converse implication also holds, i.e.

A ⊥ ⊥ B ∣ S ⇒ A ⋈ G m B ∣ S ,

then ( G , P X ) is said to be m -faithful. Similar properties can also be defined for σ - instead of m -separation: ( G , P X ) is said to have the σ -Markov property (or to be σ -Markovian) if for A , B ∈ X and S ⊂ X

A ⋈ G σ B ∣ S ⇒ A ⊥ ⊥ B ∣ S .

and is σ -faithful if the converse implication also holds, i.e.

A ⊥ ⊥ B ∣ S ⇒ A ⋈ G σ B ∣ S .

To introduce analogous properties for mixed graphs of groups, the first observation is that there are now two possible notions of conditional independence that can be considered: pairwise conditional independence and mutual conditional independence. For convenience, we will assume that all distributions have positive densities.

Definition 10

(Mutual and pairwise independence) Two groups of random variables Y = { Y 1 , Y 2 , … } and Z = { Z 1 , Z 2 , … } are called

mutually conditionally independent given a third group W = { W 1 , W 2 , … } (written Y ⊥ ⊥ Z ∣ W ) if their joint conditional density almost surely factorizes as p ( y , z ∣ w ) = p ( y ∣ w ) p ( z ∣ w ) ;
pairwise (conditionally) independent given a third group W = { W 1 , W 2 , … } (written Y ⊥ ⊥ p w Z ∣ W ) if for all Y ∈ Y and all Z ∈ Z , we have Y ⊥ ⊥ Z ∣ W .

The following well-known characterization illustrates the difference between pairwise and mutual independence nicely: for mutual independence to hold, not only pairwise independence but also conditional independencies involving entries of Y and Z in the conditioning set are required as the next lemma illustrates. For a proof of the following result, see [35, Section 4].

Lemma 5

Consider groups of random variables Y , Z , and W and let Z ′ ⊂ Z be a non-empty subset. The following are equivalent:

Y and Z are mutually conditionally independent given W .
We have Y ⊥ ⊥ Z ′ ∣ W and Y ⊥ ⊥ Z \ Z ′ ∣ Z ′ , W .

Lemma 6

Consider disjoint groups of random variables Y , Z , and W and assume that Y and Z are finite and non-empty. If for any Y ∈ Y , Z ∈ Z and any subset ℳ ⊂ Y ∪ Z \ { Y , Z } , we have Y ⊥ ⊥ Z ∣ W , ℳ , then Y and Z are mutually conditionally independent given W .

The situation is more convenient in graphical models in which the σ -Markov property and σ -faithfulness hold on the micro-level. In this case, mutual and pairwise conditional independence turn out to be the same in the sense of the following lemma.

Lemma 7

Let G be a micro DMG over the micro-variables X and suppose that the pair ( G , P X ) is σ -Markovian and σ -faithful. Let P be a partition of X with coarse graph co ( G , P ) and let S ⊂ P . Then two variable groups Y , Z ∈ P \ S are conditionally mutually independent given T ≔ ⋃ W ∈ S W if and only if they are pairwisely conditionally independent given T .

4.1 σ -Markov properties

For group MGs, we can now introduce the following Markov properties with respect to σ -separation.

Definition 11

( σ -Markov properties) Let X be a set of scalar random variables with joint distribution P X and let P be a partition of X . Let G ′ be a mixed graph with node set P . We say that ( G ′ , P X ) has the

σ -Markov property (or is σ -Markovian) if for Y , Z ∈ P and S ⊂ P , we have
Y ⋈ G ′ σ Z ∣ S ⇒ Y ⊥ ⊥ Z ∣ S .
weak σ -Markov property (or is weakly σ -Markovian) if for Y , Z ∈ P and S ⊂ P , we have
Y ⋈ G ′ σ Z ∣ S ⇒ Y ⊥ ⊥ p w Z ∣ S .

Remark

In the previous definition, S is a set of sets and therefore, to be precise, we should have written ⋃ W ∈ S W instead of S in the independence statements. However, whenever the context is clear, we prefer to use S instead to keep the notation more simple.

σ -Markovianity transfers nicely from the micro to the macro-level. See Appendix B.2 for the proof of the following theorem.

Theorem 2

Let G be a micro DMG over the micro-variables X and suppose that the pair ( G , P X ) is σ -Markovian. Let P be a partition of X into finite sets, with coarse graph co ( G , P ) . Then ( co ( G , P ) , P X ) is σ -Markovian and consequently weakly σ -Markovian.

Remark

It is worthwhile to remark here that while being sufficient, the σ -Markov property of ( G , P X ) is certainly not necessary for the σ -Markov property of ( co ( G , P ) , P X ) as the latter does not care about non-Markovianity strictly within variable groups. For instance, if Y = { Y 1 , Y 2 } , Z = { Z 1 } are two variable groups with only one micro-edge Y 2 → Z 1 , then the coarse graph Y → Z is σ -Markovian with respect to any distribution as there are no σ -separations. In particular, it is σ -Markovian with respect to distributions in which Y 1 and Y 2 are not independent, that is for distributions that are not σ -Markovian on the micro-graph G .

4.2 m -Markov properties

For good measure, we provide analogues of Definition 11 and Theorem 2 for m -separation.

Definition 12

( m -Markov properties) Let X be a set of scalar random variables with joint distribution P X and let P be a partition of X . Let G ′ be a mixed graph with node set P . We say that ( G ′ , P X ) has the

m -Markov property (or is m -Markovian) if for Y , Z ∈ P and S ⊂ P \ { Y , Z } , we have
Y ⋈ G ′ m Z ∣ S ⇒ Y ⊥ ⊥ Z ∣ S .
weak m -Markov property (or is weakly m -Markovian) if for Y , Z ∈ P and S ⊂ P , we have
Y ⋈ G ′ m Z ∣ S ⇒ Y ⊥ ⊥ p w Z ∣ S .

The analogue of Theorem 2 is as follows. Again, see Appendix B.2 for a proof.

Theorem 3

Let G be a micro DMG over X and suppose that the pair ( G , P X ) is m -Markovian. Let P be a partition of X into finite sets, with coarse graph co ( G , P ) . Then ( co ( G , P ) , P X ) is m-Markovian and in particular weakly m-Markovian.

5 Types of faithfulness for group (D)MGs

In this section, we discuss how different notions of faithfulness on scalar mixed graphs relate to faithfulness on a coarsened graph. As we will see, faithfulness is often not preserved under coarsening. However, we will provide sufficient criteria for faithfulness to hold in both the cyclic and the acyclic setting. We discuss when the strong assumptions that are needed to guarantee faithfulness on the macro-level might be realistic and continue with a discussion on weaker notions of faithfulness. Proofs of the results of this section are provided in Appendix B.3.

As mentioned by Parviainen and Kaski [18], where coarsening a scalar DAGs G to a group DAG co ( G , P ) by means of a partition P , d -faithfulness, i.e. faithfulness with respect to d -separation, need not be preserved. Since DAGs are special cases of mixed graphs, and m -separation/respective σ -separation collapse to d -separation on DAGs, this conclusion does not change when either of these separations is considered instead. Figure 7 shows simple examples of σ / m -faithfulness violations for ( co ( G , P ) , P X ) that occur even if σ / m -Markovianity and σ / m -faithfulness of a the pair ( G , P X ) is assumed. This observation seriously challenges the most naive approach to causal discovery between groups of variables, namely, running the standard PC-algorithm with multivariate conditional independence tests or any adaption thereof that relies on the causal faithfulness condition. We also observe that, conversely, σ / m -faithfulness of ( co ( G , P ) , P X ) need not imply σ / m -faithfulness of ( G , P X ) . This is because any σ / m -faithfulness violation for ( G , P X ) that is confined within a variable group will not affect σ / m -faithfulness of ( co ( G , P ) , P X ) . As a concrete example, if ( G , P X ) is not σ / m -faithful and P collects all variables in one group, then ( co ( G , P ) , P X ) is always σ / m -faithful for the trivial reason that only one node is present.

$Figure 8 The two boundaries of the edge e {\bf{e}} that is marked in red in the macro graph.$

Figure 8

The two boundaries of the edge e that is marked in red in the macro graph.

5.1 Faithfulness criteria for coarse graphs

In this subsection, we will work towards two σ -faithfulness criteria for group DMGs that are obtained from coarsening a micro-DMG. We will start with the following simple characterization of σ -faithfulness.

Lemma 8

Let G be a scalar DMG over the micro-variables X and suppose that the pair ( G , P X ) is σ -Markovian and σ -faithful. Let P be a partition with coarse graph co ( G , P ) . Then ( co ( G , P ) , P X ) is σ -faithful if and only if the following holds: whenever Y and Z are σ -connected by a set S ⊂ P , then there exist Y ∈ Y and Z ∈ Z that are σ -connected by T = ⋃ W ∈ S W .

Corollary 1

Let G be a scalar DMG over the micro-variables X and suppose that the pair ( G , P X ) is σ -Markovian and σ -faithful. Let P be a partition with coarse graph co ( G , P ) . Assume that for any path Π on co ( G , P ) , there exists a path π on G such that

co ( π ) = Π and
whenever Π is σ -unblocked by a set S ⊂ P , then π is σ -unblocked by T = ⋃ W ∈ S W .

Then ( co ( G , P ) , P X ) is σ -faithful.

Proof

This is a direct consequence of Lemma 8.□

In Theorem 4, we will now derive a simple sufficient condition that guarantees σ -faithfulness on a coarsened graph. In a nutshell, it shows that σ -faithfulness does hold if variable groups are sufficiently connected internally. Before formulating Theorem 4, we need to introduce some additional definitions (see Figure 8 for a graphical illustration).

$Figure 9 An example of a non-local σ \sigma -faithfulness violation (resp. d d -faithfulness violation as there are no cycles). If the joint distribution is σ \sigma -Markovian and σ \sigma -faithful to the micro graph, the group DAG does not contain any orientation faithfulness violations. At the same time, all micro paths between the groups V {\bf{V}} and Z {\bf{Z}} are σ \sigma -blocked, while the only macro path is σ \sigma -open.$

Figure 9

An example of a non-local σ -faithfulness violation (resp. d -faithfulness violation as there are no cycles). If the joint distribution is σ -Markovian and σ -faithful to the micro graph, the group DAG does not contain any orientation faithfulness violations. At the same time, all micro paths between the groups V and Z are σ -blocked, while the only macro path is σ -open.

Definition 13

Let G be a mixed graph with edge sets ℰ , ℬ , U , and let P be a partition of its nodes. Moreover, let e be an edge on co ( G , P ) .

If e = Z → Y is right-directed, define the set of e -micro edges as follows:
mic ( e ) ≔ { e ∈ ℰ ; e = Z → Y with Z ∈ Z , Y ∈ Y } .
If e = Z ← Y is left-directed, define the set of e -micro edges as follows:
mic ( e ) ≔ { e ∈ ℰ ; e = Z ← Y with Z ∈ Z , Y ∈ Y } .
If e = Z ↔ Y is bidirected, define the set of e -micro edges as follows:
mic ( e ) ≔ { e ∈ ℬ ; e = Z ↔ Y with Z ∈ Z , Y ∈ Y } .
If e = Z − Y is undirected, define the set of e -micro edges as follows:
mic ( e ) ≔ { e ∈ U ; e = Z − Y with Z ∈ Z , Y ∈ Y } .

Given an arbitrary edge e = ( Z , Y ) ∈ ℰ ∪ ℬ ∪ U , the e -boundary of Z is then the projection of mic ( e ) to its source node, i.e.

bd e ( Z ) ≔ { Z ∈ Z ; there is Y ∈ Y such that ( Z , Y ) ∈ mic ( e ) } ⊂ Z .

Similarly, the e -boundary of e = ( Z , Y ) is defined as follows:

bd e ( Y ) ≔ { Y ∈ Y ; there is Z ∈ Z such that ( Z , Y ) ∈ mic ( e ) } ⊂ Y .

Theorem 4

(Faithfulness criterion 1) Let G be a DMG over the micro-variables X with distribution P X , and let P be a partition of its nodes. Assume the following:

The pair ( G , P X ) is σ -Markovian and σ -faithful.
For any strongly connected component W of G , there exists W ∈ P , with W ⊂ W .
For any adjacent pair of edges e = ( W , Y ) , e ′ = ( Y , Z ) , and any Y ∈ bd e ( Y ) , there exists Y ′ ∈ bd e ′ ( Y ) such that sc G ( Y ) = sc G ( Y ′ ) .

Then, ( co ( G , P ) , P X ) is σ -faithful and co ( G , P ) is acyclic.

Corollary 2

Let G be a DMG over the micro-variables X with distribution P X , and let P be the partition into the strongly connected components of G . If the pair ( G , P X ) is σ -Markovian and σ -faithful, then ( co ( G , P ) , P X ) is σ -faithful.

Corollary 2 is no longer true if σ -separation is replaced by m -separation. The graph on the right of Figure 4 provides a counterexample, as every micro-path between groups W and Z is m -blocked by Y but σ -unblocked by Y . This example serves as another illustration that the notions of separation entail different consequences, see the study by Bongers et al. [33] for more.

The previous results show that if cyclic relationships are present internal to the variable groups, this can be an advantage for causal discovery rather than a disadvantage. Assuming a variable group to be well connected internally to achieve σ -faithfulness on the group level is to some degree at odds with assuming acyclicity on the micrograph as acyclicity disallows paths to be present if they induce a cycle. However, if one zooms in on the proof of Theorem 4, it becomes clear that condition (iii) can be replaced by weaker sufficient conditions that still guarantee σ -faithfulness, even if the micrograph is acyclic. These conditions need to be formulated separately for (almost) mediators, confounders and colliders and are therefore more technical to formulate. Here, by an almost mediator, we mean a motive of the form A ↔ B → C (right-directed almost mediator) or A ↔ B ← C (left-directed almost mediator). For (almost) mediators, condition (iii) can be replaced by

For any adjacent pair of edges e = W → Y , e ′ = Y → Z (or e = W ↔ Y , e ′ = Y → Z ), and any Y ∈ bd e ( Y ) there exists Y ′ ∈ bd e ′ ( Y ) and a right-directed (possibly trivial) path Y → ⋯ → ⋯ → Y ′ that does not leave Y .
For any adjacent pair of edges e = W ← Y , e ′ = Y ← Z or ( e = W ← Y , e ′ = Y ↔ Z ), and any Y ′ ∈ bd e ′ ( Y ) there exists Y ∈ bd e ( Y ) and a left-directed (possibly trivial) path Y ← ⋯ ← ⋯ ← Y ′ that does not leave Y .

For confounders the corresponding condition becomes

For any adjacent pair of edges e = W ← Y , e ′ = Y → Z , and any Y ∈ bd e ( Y ) there exists Y ′ ∈ bd e ′ ( Y ) and a confounding path Y ← ⋯ ← Y ″ → ⋯ → ⋯ → Y ′ that does not leave Y .

Finding an appropriate condition for colliders is a bit less straightforward, as faithfulness violations may arise by conditioning on a collider Y , e.g. W → Y ← Z in such a way that while a micro-collider inside Y is unblocked, a non-collider in Y is blocked again, see e.g. the second example in Figure 7. In Lemma 4, this was avoided by enforcing these non-colliders to only point to neighbours in the same strong connected component and condition (iii) in the Definition of σ -separation, Definition 2. The following condition, although strong, will do the job.

For any adjacent pair of colliding edges e = ( W , Y ) , e ′ = ( Y , Z ) , and any bd e ( Y ) ∩ bd e ′ ( Y ) ≠ ∅ , i.e. there exist colliding edges ( W , Y ) , ( Y , Z ) with ( W , Y ) ∈ mic ( e ) and ( Y , Z ) ∈ mic ( e ′ ) .

Thus, we have the following σ -faithfulness criterion that is more meaningful when a micro DMG is acyclic, i.e. an ADMG. Note that in this case, condition (ii) of the following theorem is trivially satisfied. In addition, perhaps surprisingly, it does not enforce the coarse graph co ( G , P ) to be acyclic as did Theorem 4.

Theorem 5

(Faithfulness criterion 2) Let G be a DMG over the micro-variables X with distribution P X , and let P be a partition of its nodes. Assume the following:

The pair ( G , P X ) is σ -Markovian and σ -faithful.
For any strongly connected component W of G , there exists W ∈ P , with W ⊂ W .
An adjacent pair of edges e = ( W , Y ) , e ′ = ( Y , Z ) , satisfies conditions (iii-a), (iii-b), (iii-c), or (iii-d) depending on whether it is a right-directed (almost) mediator, a left-directed almost mediator, a confounder or a collider, respectively.

Then, ( co ( G , P ) , P X ) is σ -faithful.

The discussion in this section also shows the importance of choosing variable groups carefully if one wants to guarantee σ -faithfulness which may be a non-trivial task in real-world applications. Parviainen and Kaski [18] tested empirically how often group level faithfulness would be violated in Erdös-Rényi random DAGs with groups of small sizes. They found that such violations were likely to appear in sparse graphs but unlikely to appear in dense random graphs. This matches the theoretical results of this section that internally well-connected groups help to ensure group level faithfulness.

5.2 Adjacency and orientation faithfulness

We will therefore consider the two weaker notions of adjacency faithfulness and orientation faithfulness. The former is at the base of the conservative PC-algorithm [5] and does transfer from the micro-variable to the group level.

Definition 14

(Adjacency faithfulness) A pair ( G , P ) of a mixed graph G and a distribution P over its node variables is adjacency faithful if any two nodes X , Y that are independent given some conditioning set S are not adjacent, i.e. they do not share an edge.

Note that adjacency faithfulness only makes reference to the skeleton of the graph G and not to any specific type of separation.

Lemma 9

Let G be a mixed graph over the variables X with distribution P X , and let P be a partition that induces the coarse graph co ( G , P ) . If the pair ( G , P X ) is adjacency faithful on G , then the pair ( co ( G , P ) , P X ) is adjacency faithful as well.

Proof

Suppose that Y ⊥ ⊥ Z ∣ S for some S ⊂ P \ { Y , Z } . Because mutual conditional independence implies pairwise conditional independence, it follows by adjacency faithfulness on G that Y and Z do not share an edge for all Y ∈ Y , Z ∈ Z . By definition of co ( G , P ) , Y and Z do not share an edge.□

Remark

Lemma 9 does not use the full strength of adjacency faithfulness on the micro-level: in fact, it suffices to assume that X and Y that belong to different variable groups do not share an edge if they are conditionally independent given a conditioning set S . In other words, adjacency faithfulness violations within a group do not matter for adjacency faithfulness on the macro-level.

Combining Theorem 2 with Lemma 9, we see that if ( G , P X ) is a σ -Markovian and adjacency faithful pair of a DMG G and a distribution of micro-variables P X , then for a given partition P , the pair ( co ( G , P ) , P X ) is strongly σ -Markovian and adjacency faithful as well. If the graph co ( G , P ) is moreover a DAG, these are exactly the assumptions that the conservative PC algorithm of [5] requires to be sound. To our knowledge, soundness of conservative PC has not been discussed beyond the acyclic case, but we believe it to hold as well. This is because soundness of the PC algorithm is not affected by allowing cycles and working with σ -separation as demonstrated by Mooij and Claassen [32]. Recall that the conservative PC algorithm takes the observational distribution as an input and outputs a so-called e-pattern, see the study by Ramsey et al. [5] for an exact definition.

Corollary 3

Let ( G , P X ) be a σ -Markovian and σ -faithful pair of a DMG G and a distribution of micro-variables P X . Let P be a partition of X such that co ( G , P ) is a DAG. Then the conservative PC algorithm with vector-valued (oracle) conditional independence tests is sound for co ( G , P ) in that it outputs an e-pattern that represents co ( G , P ) .

In an e-pattern, specific violations of faithfulness, namely, violations of orientation faithfulness can be singled out and are marked by a ⁎. To recap the definition of orientation faithfulness for DAGs, we recall that a triple of nodes ( X , Y , Z ) in a DAG is called unshielded if there is an edge between X and Y and an edge between Y and Z but none between X and Z .

Definition 15

(Orientation faithfulness) Let G be a DAG over a set of variables X with distribution P X . The pair ( G , P X ) is called orientation faithful if for any unshielded triple ( X , Y , Z ) the following holds.

If ( X , Y , Z ) is a collider, then X and Z are dependent given any subset of X \ { X , Z } that contains Y ;
If ( X , Y , Z ) is a non-collider, then X and Z are dependent given any subset of X \ { X , Z } that does not contain Y ;

For DMGs with potential cycles, orientation faithfulness is more tricky to define, as the absence of an edge between two nodes X , Y does no longer mean that they can be σ -separated. To deal with this, we will rather introduce the following notion of local faithfulness for DMGs which agrees with orientation faithfulness if the graph is a DAG.

Definition 16

(Local faithfulness) Let G be a DMG over a set of variables X with distribution P X . A local faithfulness violation is a short path ( X , e 1 , Y , e 2 , Z ) such that there exists a set S ⊂ X \ { X , Y , Z } with X ⊥ ⊥ Z ∣ S and X ⊥ ⊥ Z ∣ S , Y .

The pair ( G , P X ) is called locally faithful if there are no local faithfulness violations.

Lemma 10

[5] If G is a DAG, a pair ( G , P X ) is locally faithful if it is orientation faithful.

Examples of faithfulness violations in the literature are typically either violations of adjacency or orientation faithfulness. Figure 9 shows that if the nodes correspond to variable groups, there are faithfulness violations that are non-local. In other words, both orientation and adjacency faithfulness are satisfied, still ( σ - or d -)faithfulness is violated. In particular, such non-local violations would not be marked in the output of the conservative PC algorithm.

5.3 Faithfulness and Meek’s orientation rules revisited

Constrained-based algorithms for causal discovery such as the PC-algorithm [2] infer the directionality of arrows in a DAG by first identifying v -structures and then applying Meek’s orientation rules^[2] [26]. In this subsection, only the first of these rules will be relevant. It states that an edge X − Y is to be oriented as X → Y if there is an edge Z → X such that Z and Y are non-adjacent. Parviainen and Kaski [18] discuss the validity of Meek’s orientation rules for group DAGs using the example depicted in Figure 10. Translated to our terminology, their example consists of a micro-variable DAG G , a partition P = { V , W , Y , Z } of the micro-variables and a group DAG G ′ with nodes V , W , Y , Z such that

the micro-level pair ( G , P ) is causally Markovian and d -faithful, where P is the micro-variable distribution;
the macro-level pair ( G ′ , P ) is causally Markovian and d -faithful;
G ′ ≠ co ( G , P ) and in particular Y ← V in co ( G , P ) and Y → V in G ′ .

$Figure 10 Left: The micro DAG G {\mathcal{G}} presented in the study by Parviainen and Kaski [18]. Middle: The macro DAG presented in the study by Parviainen and Kaski [18]. Right: The coarse DMG co ( G , P ) {\rm{co}}\left({\mathcal{G}},{\mathcal{P}}) with respect to the indicated partition.$

Figure 10

Left: The micro DAG G presented in the study by Parviainen and Kaski [18]. Middle: The macro DAG presented in the study by Parviainen and Kaski [18]. Right: The coarse DMG co ( G , P ) with respect to the indicated partition.

As the mentioned orientation rule implies the orientation Y → V of G ′ instead of the correct orientation Y ← V in the ground truth group DMG co ( G , P ) , Parviainen and Kaski [18] argue that Meek’s orientation rules are no longer valid for group DAGs even if d-faithfulness on the group level does hold. However, we argue that faithfulness should refer to the cyclic ground truth graph co ( G , P ) , and the pair ( co ( G , P ) , P ) does violate σ -faithfulness: the groups W and Z are not σ -separated in co ( G , P ) but are independent. In fact, by the study by Mooij and Claassen [32, Corollary 1] which does not make assumptions on the dimensionality of the node variables, the PC-algorithm (and thus the Meek rules for DAGs) is sound if the ground truth graph of groups is directed and acyclic, and if this DAG and the joint distribution of the variables are assumed d -faithful^[3] to each other. To summarize, in the example given in the study by Parviainen and Kaski [18], the Meek rules lead to a wrong orientation, because the graph of groups is incorrectly assumed to be acyclic.

6 Grouped time series graphs

When using graphical modes to model causation for time evolving processes, there are several common modeling choices that are discussed in the literature that can all be adapted to the group setting. The arguably most common notion is that of a (stationary) time series DMG (ts-DMG for short) G = ( V , ℰ , ℬ ) in which the processes are unrolled in time and discretized. That is, the processes are modelled as univariate infinite time series X i = ( X i ( t ) ) t ∈ Z , i ∈ I = { 1 , … , n } and the nodes of the ts-DMG correspond to the indices ( i , t ) ∈ V = I × Z . As usual, we freely identify an index ( i , t ) with a variable X i ( t ) as long as there is no danger of confusion. In other words, there is a node in the causal graph for every time instance of every process. In addition, directed edges are not allowed to point into the past, i.e. X i ( s ) → X j ( t ) implies s ≤ t . Finally, the stationarity assumption means that the presence of edges only depends on the time lag between nodes and not the actual time instances. More precisely, if there is a directed or bidirected edge ( X i ( s ) , X j ( t ) ) , then there is an edge ( X i ( s + u ) , X j ( t + u ) ) of the same type for any u ∈ Z . A coarser representation of causal interactions between time series is that of a time series summary DMG or process DMG G sum = ( V sum , ℰ sum , ℬ sum ) in which a node corresponds to a process X i as a whole, i.e. V sum = I . Such graphs thus express whether processes causally influence each other but hold no information on the time lag of the interaction. Depending on the convention, self-edges ( X i , X i ) are allowed or not allowed, and we stick to the latter (no self-edges) in this work. While some causal discovery methods [10,36,37] aim to infer the time unrolled ts-DMG, others such as Granger causality [38] infer the process graph. Clearly, any ts-DMG can be projected to a process DMG by ignoring the time component and adding a (bi)directed edge ( X i , X j ) , i ≠ j if and only if there is a (bi)directed edge ( X i ( s ) , X j ( t ) ) for some s , t ∈ Z . Note that this is nothing but a special instance of our coarsening operation in the case where micrographs have infinite nodes, see Figure 11.

Figure 11

The summary graph viewed as a coarsened group DMG of the unrolled time series DMG.

Lemma 11

If G = ( V , ℰ , ℬ ) , V = I × Z is a time series DMG, then its summary DMG is co ( G , P ) for the partition P = { { X i } × Z } i ∈ I ≅ I .

Of course, there is no formal reason to disallow more general partitions of V = I × Z . For instance, when Q is a partition of the set of processes { X 1 , … , X n } ≅ I , we can define the grouped ts-DMG of G as co ( G , Q ′ ) where Q ′ = { Y × { t } ; Y ∈ Q , t ∈ Z } is the contemporaneous partition of Q , see Figure 12. We can coarsen the grouped ts-DMG further to obtain the grouped summary DMG or grouped process DMG

co ( G , Q ′ ) sum = co ( G , Q ″ )

where Q ″ = { Y × Z ; Y ∈ Q } ≅ Q is the full process partition of Q , see Figure 13.

$Figure 12 Left: A partition P {\mathcal{P}} of an unrolled time series DMG into contemporaneous groups. Right: The grouped ts-DMG co ( G , P ) {\rm{co}}\left({\mathcal{G}},{\mathcal{P}}) with respect to the partition P {\mathcal{P}} .$

Figure 12

Left: A partition P of an unrolled time series DMG into contemporaneous groups. Right: The grouped ts-DMG co ( G , P ) with respect to the partition P .

Figure 13

Left: A process grouping of a ts DMG. Right: The corresponding grouped summary DMG.

6.1 Faithfulness in grouped time series graphs

Given that grouped ts-DMGs and grouped summary DMGs are special cases of coarsened graphs, the criteria of Theorems 4 and 5 are still sufficient to ensure σ -faithfulness.

Corollary 4

Let G be time series DMG and let Q be a partition of the set of processes { X 1 , … , X n } with contemporaneous partition Q ′ and full process partition Q ″ . Moreover, let P X be the joint distribution of { X i ( t ) } i ∈ I , t ∈ Z . If the assumptions of Theorems 4 or 5 are satisfied with respect to Q ′ (respectively Q ″ ), then the pair ( co ( G , Q ′ ) , P X ) (respectively ( co ( G , Q ″ ) , P X ) ) is σ -faithful.

At the same time, if these criteria are not assumed to hold, violations of σ -faithfulness are still easily constructed even if there are no contemporeaneous edges and all micro-processes are autocorrelated, see Figure 14 for a faithfulness violation on the grouped summary DMG. In addition, in micro-level ts-DMGs, cycles can only appear in the contemporaneous part of the graph as directed edges cannot point backwards in time. Cycles will thus only be included in the grouped time series DMG if the time resolution of the analyzed data is not fine enough to resolve all feedback loops. If the time resolution is believed to be fine enough, all cycles are resolved which renders Theorem 4 useless in the ts-domain.

$Figure 14 A d d -faithfulness violation on the grouped summary DMG. Conditioning on the process W {\bf{W}} blocks all micro paths between the processes Y {\bf{Y}} and Z {\bf{Z}} .$

Figure 14

A d -faithfulness violation on the grouped summary DMG. Conditioning on the process W blocks all micro paths between the processes Y and Z .

7 Interpretation of causation in group (D)MGs

Many of the examples presented in this work, see Figures 7 and 9, show that group DMGs have to be carefully interpreted when associating a causal meaning to paths in the graph; a point that has already been made by Parviainen and Kaski [18]. They formulate a notion of potential and actual causation in terms of interventions that can be mirrored in our graphical language.

Definition 17

(Apparent and true causes) Let G be a DMG over a set of micro-variables X and let P be a partition of X inducing the group DMG co ( G , P ) . We say that Y ∈ P is an apparent cause of Z ∈ P if there exists a directed path Y → ⋯ → Z on co ( G , P ) . Y is called a true cause of Z if there is a directed path Y → ⋯ → Z on G for some Y ∈ Y and Z ∈ Z .

In other words, directed paths on group DMGs may not be regarded as truly causal in general as corresponding micro-paths might be absent. In particular, intervening on a potential cause Y of Z might not change the distribution of the effect group Z . We record the following result for good measure.

Lemma 12

Let G be a DMG over a set of micro-variables X and let P be a partition of X inducing the group DMG co ( G , P ) .

If Y → Z is a directed edge, then Y is a true cause of Z .
If the conditions (ii) and (iii) of Theorem 4 are satisfied, then any apparent cause of a group Z ∈ P is a true cause of Z .
If the condition (ii) and (iii-a) of Theorem 5 are satisfied, then any apparent cause of a group Z ∈ P is a true cause of Z .

Proof

The first claim of the lemma follows directly from the definition of co ( G , P ) . The second and third claim follows directly from the proof of Lemmas 4 and 5 where for a given directed path Π = Y → ⋯ → Z on co ( G , P ) , we constructed a connecting directed micro-path π = Y → ⋯ → Z on G for some Y ∈ Y and Z ∈ Z such that co ( π ) = Π .□

7.1 Causation in grouped time series graphs

We now turn to the question whether any apparent cause in a grouped ts-DMGs or a grouped summary DMG is a true cause. For grouped ts-DMGs, the answer is no for the same reason as for usual group DMGs. On the level of the grouped summary graph, however, apparent causation implying true causation may be more realistic, at least if the grouped processes are believed to be causally mixing, a notion inspired by the common assumption of mixing in dynamical systems.

Definition 18

Consider a ts-DMG G = ( V , ℰ , ℬ ) over micro processes X 1 , … , X n , X i = ( X i ( t ) ) t ∈ Z . Let Q be a partition of { X 1 , … , X n } and consider the induced grouped ts-DMG co ( G , Q ′ ) where Q ′ = { Y × { t } ; Y ∈ Q , t ∈ Z } .

Then, the pair ( G , Q ) is called causally mixing if for any Y ∈ Q and any pair of micro-processes X i , X k ∈ Y the following holds:

for any s ∈ Z , there exists t > s and a directed path X i ( s ) → X i 1 ( s + 1 ) → X i 2 ( s + 2 ) → ⋯ → X i m ( t − 1 ) → X k ( t ) such that X i α ∈ Y for all α = 1 , … , m .

Causal mixing means that after a sufficient amount of time has passed, causal information has fully spread throughout any process group. We will see now that causal mixing ensures that, at least at the level of the grouped summary graph, directed causal paths can be understood in the usual sense as any apparent cause is a true cause. However, causal mixing does not ensure σ -faithfulness on the grouped summary DMG as the example in Figure 14 demonstrates.

Lemma 13

Consider a stationary ts-DMG G = ( V , ℰ , ℬ ) over micro processes X 1 , … , X n , X i = ( X i ( t ) ) t ∈ Z . Let Q be a partition of { X 1 , … , X n } and consider the induced grouped summary DMG G ˜ ≔ co ( G , Q ″ ) , where Q ″ = Y × Z ; Y ∈ Q } . If ( G , Q ) is causally mixing, then every apparent cause in G ˜ is a true cause in G ˜ .

Proof

For this proof, recall that we can identify elements of Q and Q ″ through the map Y ↦ Y × Z . Consider two process groups Y , Z ∈ Q . Moreover, let Π × Z ≔ ( Π ( 1 ) × Z , e 1 , … , e r − 1 , Π ( r ) × Z ) be a directed path on from Π ( 1 ) × Z = Y × Z to Π ( r ) × Z = Z × Z in the group summary DMG G ˜ = co ( G , Q ″ ) . We need to show that there exists a micro-path π in G from Y ( s ) to Z ( t ) , s ≤ t , for some micro-processes Y ∈ Y and Z ∈ Z . We construct π inductively as follows. First choose a directed micro-edge e 1 = Y ( s ) → W ( s 1 ) ∈ mic ( e 1 ) for some s ≤ s 1 . Then, consider Π ( i ) , 1 < i < r and assume that a directed micro-path π i that ends in W ( s i − 1 ) ∈ bd e i − 1 ( Π ( i ) × Z ) has already been constructed. Choose a micro-process W ′ ∈ Π ( i ) such that W ′ ( t ′ ) ∈ bd e i ( Π ( i ) × Z ) for some t ′ ∈ Z . By causal mixing there is a directed path ξ i from W ( s i − 1 ) to W ′ ( t i ) for some t i > s i − 1 that does not leave Π ( i ) × Z . Stationarity of G and W ′ ( t ′ ) ∈ bd e i ( Π ( i ) × Z ) imply that also W ′ ( t i ) ∈ bd e i ( Π ( i ) × Z ) so we can find a micro-edge e i + 1 ∈ mic ( e i + 1 ) whose source node is W ′ ( t i ) . After concatenating π i + 1 = π i ∘ ξ ∘ e i + 1 , we have obtained the micro-path π i + 1 to Π ( i + 1 ) × Z and we continue inductively until we reach Z × Z .□

8 Further discussions and outlook

In this section, we will zoom out from the technical results of the previous sections and turn towards a high-level discussion on variable groupings and dimension reduction.

8.1 Choosing variable groups

In this work, we have operated under the standing assumption that the partition P of all micro-variables into variable groups is fixed. We have then studied the transferal of causal discovery assumptions from the micro- to the group level given this fixed partition P . While in many problems, practitioners may have clear ideas on which micro-variables should be grouped together or not, in others there might be more than one plausible choice of partition. When the goal is to make this choice in such a way that faithfulness is a realistic assumption on the group level, Theorems 4 and 5 at least provide a heuristic: there should be sufficient causal interactions internal to the variable groups. In particular, grouping together micro-variables that seem to be unrelated causally, appears to be problematic. This seems to be in line with our intuition. After all, why would one group together variables that seem unrelated in the first place? Beyond these heuristic considerations, learning pairs ( P , G ( P ) ) of a partition P and a graph G ( P ) over its constituents from data under appropriate optimality constraints may be an interesting, although challenging problem for future research.

8.2 Dimension reduction and causal discovery

As alluded to in Section 1, in observation-based analyses of causal interactions, the common alternative to working with variable groups in their entirety, is to reduce them to a single univariate variable, or, if they evolve dynamically, to a single index time series. While some form of dimensionality reduction is unavoidable in high-dimensional settings, the goal of this paragraph is to point out the pitfalls of applying a causal discovery method to dimensionally-reduced proxies, at least if dimension reduction is applied naively. In the subsequent paragraph, we will carry out a similar analysis for a second naive approach, namely, using all available micro-variables as the input of a constraint-based causal discovery method. We contrast this to constraint-based group level discovery, that is the application of a constraint-based method such as the PC algorithm to groups of random variables in which only multivariate conditional independence test between groups are employed as a whole.

8.2.1 Applying causal discovery to dimensionally-reduced variables

The most common dimension reduction approach to causal discovery on variable groups X 1 , … , X r proceeds as follows:

Reduce X i to a univariate random variable X i , for instance by setting X i = m ( X i ) to be the group mean or the first principal component in a PCA on X i .
Apply a causal discovery algorithm to X 1 , … , X r .

This procedure is appealing to domain researchers for several reasons. First, dimension reduction techniques can be carried out quickly, they counter the curse of dimensionality, and the resulting quantities can often be interpreted easily. Moreover, as per the law of large numbers, averaging can help to reduce observational noise, at least if noise terms of different members of a given variable group are believed to be weakly correlated. For instance, if every member X i j of group X i is believed to be produced by a common driver and purely observational noise, i.e. X i j = X ˆ i + η i , j and the noise terms η i , j have mean zero and are weakly or un-correlated across the j index, then in the large group limit, the group mean X ˆ i = m ( X i ) will recover X ˆ i . Thus, if the causal dynamics are modelled by structural equations on the X ˆ i such as X ˆ i ≔ f i ( pa ( X ˆ i ) , η X ˆ i ) with pa ( X ˆ i ) ⊂ { X ˆ 1 , … , X ˆ r } and the groups arise as X i = ( X ˆ i , … , X ˆ i ) T + η i with zero-mean noise vectors that are mutually independent across the i index and whose components are weakly or un-correlated, then the group mean will be an appropriate choice of aggregation technique to recover the causal dynamics.

On the other hand, if different parts of a given cause group Y have opposing causal effects on a target group Z that roughly cancel each other, the effect of the group mean of Y on the group mean of Z may be zero, and neither the causal effect nor the dependence Y ⊥̸ ⊥ Z can be recovered from the averaged data. An often invoked real-world example of this are the opposite-sign effects of two different types of blood cholesterol, low-density lipoprotein (LDL) and high-density lipoprotein (HDL), on heart disease [15]. Consequently, research on the effect of total blood cholesterol (LDL + HDL) on heart disease has come to contradictory conclusions.

In a similar vein, conditioning on the mean value m ( W ) of a variable group W may not suffice to recover a conditional independence Y ⊥ ⊥ Z ∣ W . For instance, consider a structural causal model

W 1 ≔ η W 1 W 2 ≔ η W 2 Y ≔ W 1 + 2 W 2 + η Y Z ≔ W 1 + 2 W 2 + η Z ,

with variable partition Y = { Y } , Z = { Z } , W = { W 1 , W 2 } , and with independent noise terms η W 1 , η W 2 , η Y , η Z . Then we have Y ⊥ ⊥ Z ∣ W but m ( Y ) ⊥̸ ⊥ m ( Z ) ∣ m ( W ) , where again m ( ⋅ ) denotes the group mean. The latter relation becomes apparent when rewriting Y = 2 m ( W ) + W 2 + η Y and Z = 2 m ( W ) + W 2 + η Z , so that after conditioning on m ( W ) , Y and Z still share the common random component W 2 which is not fully determined by m ( W ) . Thus, causal discovery approaches that invoke conditional independence tests on aggregated quantities may come to wrong conclusions. However, this example also illustrates that the primary reason for such faulty inferences is that dimension reduction and inference were conducted independently of each other. In fact, in the example above, there is an aggregation of W that does preserve the independence Y ⊥ ⊥ Z ∣ W : if m ′ ( W ) = W 1 + 2 W 2 , then Y ⊥ ⊥ Z ∣ m ′ ( W ) . Research on how variable aggregation and inference can combined in such a way that they inform each other, is still relatively scarce, and we refer to previous studies [15,39,40] for interesting ideas and further discussions.

8.2.2 Micro-level causal discovery

A second straightforward approach to causal discovery on variable groups X 1 , … , X r roughly works as follows:

apply a given causal discovery method to the totality of all micro-variables. This will output a graph over all micro-variables containing edges of different types.
Then coarsen this micro-graph as in Definition 4, that is draw an edge of a specific type between groups Y and Z if there exists an edge of this type between two members Y ∈ Y and Z ∈ Z of these groups.
Alternatively, if only one edge is to be allowed between groups, decide on the type of this edge by a majority rule, e.g. draw a directed edge Y → Z if the majority of edges between members Y ∈ Y and Z ∈ Z are directed as Y → Z .

As constraint-based causal discovery algorithms such as PC typically come with soundness and completeness guarantees under method-specific assumptions [1,41], in theory, the micro-graph (and therefore the macro-graph derived from it) can be inferred to an optimal degree, that is up to a certain type of equivalence. Still, in practice, there are some obvious drawbacks of such an approach. First, as the number of micro-variables within groups can be very high, the computational effort can be massive while much of the inferred micro-level information, namely, all interaction internal to variable groups is of little relevance to the actual task of inferring the interactions between variable groups. This issue is particularly problematic if the variable groups happen to be very dense, i.e. if there are many micro-edges within groups. This is because this case falls firmly into the computational worst case scenario for constraint-based causal inference in which computing time grows exponentially with the number of variables [41]. At the same time, one can argue that typically variable groups are chosen the way they are exactly because their members are highly correlated or have strong causal interactions. From a statistical perspective, running many conditional independence tests on the micro-level that are irrelevant to the actual inference task tends to be detrimental to the method’s success, see Wahl et al. [17] for some toy experiments with two variable groups and continuous data. In addition, the well-known finite sample guarantees of Kalisch and Bühlmann [41] for the PC algorithm again rely on sparsity conditions that may not be met on the micro-graph if the variable groups are very dense, while they might be met on the coarse group DMG.

On the other hand, full micro-variable causal discovery can sometimes orient edges between groups that a group level approach cannot orient, see Figure 15. This can be both a blessing and a curse: While additional orientations are a plus whenever they are correct, a wrong statistical test result of an independence test that only involves micro-variables within the same group can lead to a wrongly oriented edge between variable groups, see Figure 15. Therefore, group level causal discovery can be considered more conservative than full micro-level causal discovery in the sense that it might provide fewer orientations while being more robust to testing errors. Finally, if the causal discovery algorithm at hand assumes the absence of hidden confounders, it will suffer if hidden confounding is actually present in the data. Hence, if hidden confounders only affect micro-variables within the same group, then micro-level causal discovery will be challenged while group level causal discovery will only be affected by confounders between different groups, see again the discussion in Section 4. Nevertheless, in the case of discrete data, conditional independence tests are particularly challenged by large conditioning sets as every state of the conditioning variables has to be considered separately. In this case, the empirical experiments conducted by Parviainen and Kaski [18] suggest that the micro-level causal discovery approach which employs more tests but has smaller conditioning sets than the group level approach outperforms the latter.

$Figure 15 Left: Running the PC-algorithm with perfect independence tests on the micro-variables will infer the full micro-structure and will therefore also be able to orient the group level edge Y → Z {\bf{Y}}\to {\bf{Z}} . group level PC will not be able to infer this orientation. Right: If, due to a wrong statistical test result or due to a faithfulness violation, the micro-level PC-algorithm mistakenly judges Y 1 ⊥ ⊥ Y 3 {Y}_{1}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}{Y}_{3} , it has found a separating set for Y 1 {Y}_{1} and Y 3 {Y}_{3} that does not contain Y 2 {Y}_{2} and will thus orient the unshielded triple Y 1 − Y 2 − Y 3 {Y}_{1}-{Y}_{2}-{Y}_{3} as a collider Y 1 → Y 2 ← Y 3 {Y}_{1}\to {Y}_{2}\leftarrow {Y}_{3} . If the remaining tests return the true (in)dependecies Y 1 ⊥ ⊥ Z ∣ Y 2 , Y 3 ⊥ ⊥ Z ∣ Y 2 {Y}_{1}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}Z| {Y}_{2},\hspace{1em}{Y}_{3}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}Z| {Y}_{2} , then PC’s orientation rules will imply the edge orientation Y 2 → Z {Y}_{2}\to Z . Hence, the PC-algorithm will again infer the micro-structure on the left and the wrong group level orientation Y → Z {\bf{Y}}\to {\bf{Z}} . Note that the wrong test only involves micro-variables that belong to group Y {\bf{Y}} . group level PC will never run this wrong test and will not orient the edge Y − Z {\bf{Y}}-{\bf{Z}} , neither correctly nor wrongly.$

Figure 15

Left: Running the PC-algorithm with perfect independence tests on the micro-variables will infer the full micro-structure and will therefore also be able to orient the group level edge Y → Z . group level PC will not be able to infer this orientation. Right: If, due to a wrong statistical test result or due to a faithfulness violation, the micro-level PC-algorithm mistakenly judges Y 1 ⊥ ⊥ Y 3 , it has found a separating set for Y 1 and Y 3 that does not contain Y 2 and will thus orient the unshielded triple Y 1 − Y 2 − Y 3 as a collider Y 1 → Y 2 ← Y 3 . If the remaining tests return the true (in)dependecies Y 1 ⊥ ⊥ Z ∣ Y 2 , Y 3 ⊥ ⊥ Z ∣ Y 2 , then PC’s orientation rules will imply the edge orientation Y 2 → Z . Hence, the PC-algorithm will again infer the micro-structure on the left and the wrong group level orientation Y → Z . Note that the wrong test only involves micro-variables that belong to group Y . group level PC will never run this wrong test and will not orient the edge Y − Z , neither correctly nor wrongly.

We summarize strengths and pitfalls of dimension reduction causal discovery, micro-level causal discovery as well as group level causal discovery in Table 1.

Table 1

Strengths and weaknesses of the three fundamental approaches to causal discovery for variable groups: causal discovery after dimension reduction, micro-level causal discovery, and group level causal discovery. Approaches that integrate dimension reduction and inference, while perhaps retaining reduced variable groups of smaller size might be a fruitful middle ground

	Dimension reduction + CD	Micro-level CD	group level CD
Strengths	Computationally most efficient	Good for small groups;	Fewer CI tests than micro-level CD;
	approach;	empirically superior	robust to within-group confounding
	noise-removal.	to group level CD on discrete data.	and other violations.
Weaknesses	May change conditional	Computationally inefficient;	Assumptions and interpretation of
	independencies and	vulnerable to	output must be evaluated carefully;
	causal conclusions	within-group assumption	multivariate CI testing
	fundamentally.	violations.	less developed;
			computationally less efficient
			than dimension reduction + CD.

9 Summary

In this work, we have provided a thorough discussion of assumptions for causal discovery on groups of random variables. In particular, we have shown that causal faithfulness is easily violated in generic settings so that faithfulness-based causal discovery methods need to be applied with care. On the other hand, we have presented two criteria (Theorems 4 and 5) on the internal connectivity of variable groups that do guarantee σ -faithfulness. It will be important to develop and evaluate more elaborate group level causal discovery techniques and to compare them to the baseline methods presented in Section 8 empirically, in particular for continuous data. On the theoretical side, it would be worthwhile to study the compatibility of statistical dimension reduction and causal modelling in greater detail, for instance following the ideas laid out in [15,39,40].

Acknowledgements

The authors thank Sofia Faltenbacher for designing the layout used for many figures in this work, as well as the reviewers for their valuable suggestions.

Funding information: J.W., U.N., and J.R. received funding from the European Research Council (ERC) Starting Grant CausalEarth under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 948112).
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and consented to its submission to the journal, reviewed all the results and approved its final version. J.W. wrote the manuscript, and formalized and proved the mathematical results of this work. J.R. proposed the initial idea to investigate causal discovery on variable groups. All three authors discussed and advanced the main ideas of this work on several occasions. U.N. and J.R. proofread the manuscripts and suggested valuable improvements.
Conflict of interest: The authors declare no conflict of interest.

Appendix A Group DMGs from group-valued SCMs

In this appendix, we will shortly discuss another way of obtaining a group DMG that is distinct from coarsening a graph of micro-variables, namely, by defining a model directly through structural equations. For a discussion of counterfactual distributions in vector-valued SCMs, see [23, Supplement, Theorem 7].

Definition A1

(Vector-valued SCMs) A vector-valued structural causal model (vSCM) M = ( S , P E ) over a partition P of a set of random variables X into random vectors X 1 , … , X r is a collection of structural assigments

X i ≔ f i ( pa ( X i ) , E i )

with pa ( X i ) ⊂ { X 1 , … , X r } \ { X i } and multivariate noise vectors E 1 , … , E r with dim ( X i ) = dim ( E i ) that have joint distribution P E . The causal graph G ( M ) of M is the DMG with nodes X 1 , … , X r where a directed edge X j → X i is drawn if X j ∈ pa ( X i ) and a bidirected edge X i ↔ X j is drawn if E i ⊥ ⊥ E j .^[4]

While group DMGs derived by coarsening micro-variable graphs assume a causal structure on the level of the micro-variables and is then “forgotten” after coarsening, in a vector-valued SCM any causal meaning in the form of a graph is only defined on the group level. The internal relationships among the entries of a vector X i that are not due to external influences are modelled only by the distribution P E i and are thus of a probabilistic nature. This seems reasonable for many practical applications where the micro-variables may not be considered causal entities (for instance, imagine X i to be a field of surface tempature measurements in some spatial region). On the other hand, vector-valued SCMs make it hard to derive faithfulness results from properties of the micro-variables, as no notion of faithfulness is purely distributional. Instead faithfulness can only be postulated as an assumption on the group level directly.

B Proofs

B.1 Proofs of the Results in Section 3

Proof of Lemma 1

Assume first that co ( G , P ) is acyclic and let W be a strongly connected component. If there were W 1 , W 2 ∈ W that belonged to different groups of the partition P , say W 1 ∈ Y and W 2 ∈ Z , then on G we could find directed paths π 1 from W 1 to W 2 and π 2 from W 2 to W 1 . Then the induced coarse path co ( π 1 ) would constitute a directed path from Y to Z and the induced coarse path co ( π 2 ) would constitute a directed path from Z to Y . Concatening both paths, we would obtain a cycle which contradicts our assumption.
The converse is already wrong for coarsenings of micro DAGs in which the strongly connected components correspond to the nodes of the graph, see e.g. Figure 4.
Let P be the partition of G into strongly connected component and let π ˜ be a directed path from Y to Z . Then, we argue first that for any two node Y ∈ Y , Z ∈ Z , there is a directed micro path π from Y to Z on G . Indeed, if π ˜ just consists of an edge Y → Z , then there must be Y ′ ∈ Y and Z ′ ∈ Z that are connected by a micro edge Y ′ → Z ′ . By the definition of strongly connected components, there must also be directed paths from Y to Y ′ and from Z ′ to Z , so we have found the desired micro path. If π ˜ has more than one edge, we can proceed similarly by noting that for any motive W → Y → Z there are micro edges W → Y , Y ′ → Z with W ∈ W , Y , Y ′ ∈ Y , Z ∈ Z and either Y = Y ′ or there is a directed path from Y to Y ′ as Y is strongly connected. Concatenating all edges and paths found this way, we obtain the desired micro path. Finally, we conclude by observing that any cycle on co ( G , P ) must thus induce a cycle on the micro MG G . Indeed, a cycle on co ( G , P ) could be decomposed into directed paths π ˜ 1 and π ˜ 2 one from say Y to Z and one from Z to Y to which we then apply the argument above.□

Proof of Theorem 1

Write G = ( V , ℰ , ℬ , U ) and G acy = ( V ˆ , ℰ ˆ , ℬ ˆ , U ˆ ) and P = { X 1 , … , X r } . We have to show that co ( G acy , P ) and co ( G , P ) have the same directed, bidirected and undirected edges.

First, let X i → X j be a directed edge on co ( G , P ) so that there must exist a directed edge A → B ∈ ℰ with A ∈ X i and B ∈ X j . Hence A ∈ pa G ( B ) ⊂ pa G ( sc G ( B ) ) and we also see that A ∉ sc G ( B ) by part (a) of Lemma 1 as P was assumed acyclic with respect to G . So by definition of acyclification, we get A → B ∈ ℰ ˆ and thus the edge X i → X j is present on co ( G acy , P ) . On the other hand, if X i → X j is a directed edge on co ( G acy , P ) , then there must be an edge A → B ∈ ℰ ˆ with A ∈ X i and B ∈ X j . Therefore, A ∈ pa G ( sc G ( B ) ) \ sc G ( B ) , so there must be a node C ∈ sc G ( B ) and an edge A → C ∈ ℰ . By Lemma 1 (a), we obtain sc G ( B ) ⊂ X j so that there must be an edge X i → X j on co ( G , P ) .

We now turn to bidirected edges. If X i ↔ X j is a bidirected edge on co ( G , P ) , then there exists a bidirected edge A ↔ B ∈ ℬ with A ∈ X i and B ∈ X j . By definition of acyclification, we also have A ↔ B ∈ ℬ ˆ , so X i ↔ X j is a bidirected edge on co ( G acy , P ) as well. Finally assume that X i ↔ X j is a bidirected edge on co ( G acy , P ) , so that there exists a bidirected edge A ↔ B ∈ ℬ ˆ with A ∈ X i and B ∈ X j . By acyclicity and Lemma 1 A and B must lie in different strongly connected components of G . Therefore, there must be A ′ ∈ sc G ( A ) ⊂ X i and B ′ ∈ sc G ( B ) ⊂ X j connected by a bidirected edge A ′ ↔ B ′ ∈ ℬ . We conclude that X i ↔ X j must be a bidirected edge on co ( G , P ) . Finally, we discuss undirected edges. Thus assume first that X i − X j is an undirected edge on co ( G , P ) , so that there must exist an undirected edge A − B ∈ U with A ∈ X i and B ∈ X j . Since P was assumed acyclic, we see that A ∉ sc G ( B ) by part (a) of Lemma 1, so that there must be an undirected edge A − B ∈ U ˆ . Thus, X i − X j must be an undirected edge of co ( G acy , P ) . Conversely if X i − X j is assumed to be an undirected edge of co ( G acy , P ) , there must be an undirected edge A − B ∈ U ˆ with A ∈ X i , B ∈ X j . By definition of acyclification, we must have A − B ∈ U and thus, X i − X j must be an undirected edge of co ( G , P ) . This finishes the proof.□

Proof of Lemma 3

We will only discuss the case where co ( π ) (and thus π ) is a non-trivial walk.

If co ( π ) is σ -blocked by S , then there are three options.
1. If the first (or last) node of co ( π ) is in S , then T must contain the first (or last) node of π and thus σ -blocks π .
2. There is a collider W on co ( π ) with S ∩ des ( W ) = ∅ . We argue that in this case W must contain a collider W of the micro walk π . Indeed, if π passes through only one node of W , this follows directly. If π passes through more than one node, π must enter W at a micro node π ( i ) with an edge pointing to π ( i ) (either bidirected or directed) and leave W at a micro node π ( j ) , j > i , again with an edge pointing to π ( j ) (either bidirected or directed). Thus at some point in the path segment π ( i , j ) the directionality of the arrows must oppose each other, that is to say that path segment must contain a collider, say π ( l ) . Any descendant D of π ( l ) must lie in W itself or in a proper descendant of W , say D ∈ D as the directed path π ( l ) → ⋯ → D induces a coarse path W → ⋯ → D . As both W and its proper descendants do not lie in S , des ( π ( l ) ) ∩ T = ∅ .
3. There is a non-collider co ( π ) ( k ) on co ( π ) that is contained in S and an edge co ( π ) ( k ) → co ( π ) ( l ) , l ∈ { k − 1 , k + 1 } with sc co ( G , P ) ( co ( π ) ( k ) ) ≠ sc co ( G , P ) ( co ( π ) ( l ) ) . Therefore, on π , there must be an edge π ( i ) → π ( j ) , j ∈ { i − 1 , i + 1 } with π ( i ) ∈ co ( π ) ( k ) and π ( j ) ∈ co ( π ) ( l ) . Since π ( i ) has an outgoing edge it is a non-collider and by the contraposition of Lemma 2 sc G ( π ( i ) ) ≠ sc G ( π ( j ) ) . Therefore T , which contains π ( i ) σ -blocks π .
We use the following counterexample to show that the converse of (i) is not true. Let G be given by W → Y 1 → Y 2 ← Y 3 → Z partitioned as W = { W } , Y = { Y 1 , Y 2 , Y 3 } , Z = { Z } . Then the path from W to Z is closed since it contains the collider Y 2 while the coarse path W → Y → Z is open.
If π is an arbitrary walk between Y ∈ Y and Z ∈ Z , then co ( π ) is a walk between Y and Z and thus σ -blocked by S . Hence, by assertion (i), π is σ -blocked by T .
Let G be given by W → Y 1 − Y 2 − Y 3 ← Z partitioned as W = { W } , Y = { Y 1 , Y 2 , Y 3 } , Z = { Z } . Then the path W → Y ← Z is σ -blocked while the micro path from W to Z is σ -open as it does not contain any colliders.□

Proof of Lemma 4

For part (i), it suffices to note that the only difference between the two types of separation lies in the their definition for non-colliders, so only part (i)(3) of the proof of Lemma 3 slightly differs. When a coarse walk co ( π ) has a non-collider, say co ( π ) ( k ) , then π must have a non-collider π ( j ) ∈ co ( π ) ( k ) as well. So if co ( π ) ( k ) ∈ S , then π ( j ) ∈ T and π is m -blocked. Part (iii) follows directly from (i) and the counterexamples of parts (ii) and (iv) do not involve cycles and are equally valid for m -separation.□

B.2 Proofs of the Results in Section 4

Proof of Lemma 6

We prove Lemma 6 by induction over n = ∣ Y ∪ Z ∣ . For n = 2 , we must have Y = { Y } , Z = { Z } and the result follows by choosing the subset ℳ = ∅ . Now assume that the result has been shown for some arbitrary but fixed n ≥ 2 and let ∣ Y ∪ Z ∣ = n + 1 . W.l.o.g. we can assume that ∣ Z ∣ > 1 . Let Y ∈ Y and Z ∈ Z be arbitrary. Choosing ℳ = Z \ { Z } , by assumption we have Y ⊥ ⊥ Z ∣ W , Z \ { Z } . According to Lemma 5, we are done if we can show that also Y ⊥ ⊥ Z \ { Z } ∣ W . To prove this, we observe first that Y ∪ Z \ { Z } = n . Moreover, for any Y ∈ Y , Z ′ ∈ Z \ { Z } and ℳ ′ ⊂ Y ∪ Z \ { Z } \ { Y , Z ′ } , we have Y ⊥ ⊥ Z ′ ∣ W , ℳ ′ . Thus, the induction hypothesis implies that Y ⊥ ⊥ Z \ { Z } ∣ W and in particular Y ⊥ ⊥ Z \ { Z } ∣ W as desired.□

Proof of Lemma 7

We only have to prove that conditional pairwise independence implies conditional mutual independence. So let Y , Z be pairwisely independent given T . Iterating part (ii) of Lemma 5, mutual independence follows if we can show that for any pair Y ∈ Y , Z ∈ Z and any subset ℳ ⊂ Y ∪ Z , we also have Y ⊥ ⊥ Z ∣ ℳ ∪ T . By σ -faithfulness of ( G , P X ) , any pair Y ∈ Y , Z ∈ Z is σ -separated by T , that is all micro-paths leading from the group Y to the group Z are σ -blocked by T . If we can show that all micro-paths are still σ -blocked by ℳ ∪ T , the result follows by the σ -Markov property of ( G , P X ) . Since all such micro-paths are σ -blocked by T , we only need to make sure that none of these paths is opened again by adding ℳ to the separating set. Suppose there was such a path π starting at π ( 1 ) ∈ Y and ending at π ( r ) ∈ Z , that is σ -blocked by T but σ -unblocked by T ∪ ℳ . Let π ( k ) be the last node of π in Y and let π ( l ) , k < l be the first node of π in Z after k . Since T ⊂ X \ { Y , Z } , the subpath π ′ of π starting at π ′ ( 1 ) = π ( k ) and ending at π ′ ( s ) = π ( l ) must still be σ -blocked by T . On the other hand, since ℳ ∪ T σ -unblocks π , it must σ -unblock π ′ . Thus, there must be at least one collider on π ′ that has a descendant in ℳ . Let π ′ ( i ) be the last such collider on π ′ with descendant D ∈ ℳ . If π ′ ( i ) = D , then π ′ ( i ) ∈ Y ∪ Z contradicting the fact that π ′ does not have inner nodes in Y ∪ Z . Therefore, D must be a proper descendant of π ′ ( i ) . As D ∈ ℳ , we have in particular D ∈ Y ∪ Z , and we will assume w.l.o.g. D ∈ Y . Let π ″ = π ( i ) → ⋯ → D be the descending micro-path and assume that π ″ ( j ) is the first node that belongs to ℳ . But then, the concatenation of π ″ ( j ) ← ⋯ ← π ′ ( i ) and π ′ ( i , s ) leads from Y to Z and is σ -unblocked by T contradicting our assumption that all such micro-paths must be σ -blocked by T .□

Proof of Theorem 2

We assume that P X does not have the strong σ -Markov property and show that this leads to a contradiction. Since ( co ( G , P ) , P X ) does not have the σ -Markov property, there must be vectors Y , Z that are σ -separated by a set of vectors S but not mutually conditionally independent given S . By Lemma 3, for every pair Y ∈ Y , Z ∈ Z , all paths on G between Y and Z are σ -blocked by T . On the other hand, using Lemma 6, we see that there must be Y ′ ∈ Y , Z ′ ∈ Z and a subset ℳ ⊂ Y ∪ Z such that

Y ′ ⊥̸ ⊥ Z ′ ∣ ℳ , T .

By the σ -Markov property on the micro DMG G this means that there must be a path π between Y ′ and Z ′ on G that is not σ -blocked by ℳ , T , but is σ -blocked by T . Therefore, Y ∉ S and Z ∉ S as otherwise we would have Y ′ ∈ T or Z ′ ∈ T . Moreover, T cannot contain any non-collider π ( l ) of π pointing to a neighbor π ( l ± 1 ) in a different strongly connected component, π must have at least one collider and any collider on π must have a descendant in ℳ . Our goal is now to construct a path π ˜ on co ( G , P ) that is not σ -blocked by S resulting in a contradiction. Consider first the coarsened path co ( π ) of π . S can not contain any non-colliders of co ( π ) pointing to a neighbor in a different strongly connected component of co ( G , P ) by Lemma 2. If co ( π ) does not have any colliders it must be σ -unblocked by S and we are done. Therefore, assume that co ( π ) does contain colliders. If all such colliders would have a descendant in S , again the path would be σ -unblocked by S and the desired contradiction would be obtained. Thus, we can assume that at least one collider on co ( π ) does not have any descendants in S . Any collider C of co ( π ) must contain a micro-collider C of the micro-path π and by the considerations above there must be a directed micro-path C → ⋯ → M for some M ∈ ℳ ⊂ Y ∪ Z . Coarsening this micro-path we see that for any collider C of co ( π ) there must be a directed macro-path from C to Y or to Z . Writing out co ( π ) = ( co ( π ) ( 1 ) , … , co ( π ) ( r ) ) where co ( π ) ( 1 ) = Y and co ( π ) ( r ) = Z , we define the sets

U = { k ∣ co ( π ) ( k ) ∈ col ( co ( π ) ) , S ∩ des ( co ( π ) ( k ) ) = ∅ , and Z ∈ des ( co ( π ) ( k ) ) }

and

U ′ = { k ∣ co ( π ) ( k ) ∈ col ( co ( π ) ) , des ( co ( π ) ( k ) ) = ∅ , and Y ∈ des ( co ( π ) ( k ) ) } .

By the aforementioned considerations, at least one of these sets must be non-empty. If U is non-empty, let k ′ be its minimum, so that co ( π ) ( k ) is the collider closest to Y . Thus, the subpath path co ( π ) ( 1 ) , … , co ( π ) ( k ′ ) must be right- directed, i.e. co ( π ) ( 1 ) → ⋯ → co ( π ) ( k ′ ) . Since k ′ ∈ U , we can join it with a directed path from co ( π ) ( k ′ ) to Z yielding a path Y → ⋯ → Z . This path can not be σ -blocked by S as all of its nodes are either non-colliders of co ( π ) or descendants of co ( π ) ( k ′ ) . Thus we have found the desired path. If U ′ is non-empty, the argument is analoguous with k ′ ′ ≔ max U ′ instead of k ′ and left-directed instead of right-directed paths.□

Proof of Theorem 3

Up to a few subtleties, the proof is similar to Theorem 2. We assume that P X that ( co ( G , P ) , P X ) does not have the m -Markov property. We show that this leads to a contradiction. Since ( co ( G , P ) , P X ) does not have the m -Markov property, there must be vectors Y , Z that are m -separated by a set of vectors S but not mutually conditionally independent given S . Hence, by Lemma 4, for every pair Y ∈ Y , Z ∈ Z , all paths on G between Y and Z are m -blocked by T . On the other hand, using Lemma 6, we see that there must be Y ′ ∈ Y , Z ′ ∈ Z and a subset ℳ ⊂ Y ∪ Z such that

Y ′ ⊥̸ ⊥ Z ′ ∣ ℳ , T .

By the m -Markov property on G this means that there must be a path π between Y ′ and Z ′ on G that is not m -blocked by ℳ , T , but is m -blocked by T . Therefore, T cannot contain any non-colliders of π , π must have at least one collider and any collider on π must have a descendant in ℳ . Our goal is now to construct a path π ˜ on co ( G , P ) that is not m -blocked by S resulting in a contradiction. Consider first the coarsened path co ( π ) of π . S can not contain any non-colliders of co ( π ) as otherwise T would contain a non-collider of π . So if co ( π ) does not have any colliders it must be m -open given S and we are done. So assume that co ( π ) does contain colliders. If all such colliders would have a descendant in S , again the path would be m -opened by S and the desired contradiction would be obtained. Thus, assume that at least one collider on co ( π ) does not have any descendants in S . Any collider C of co ( π ) must contain a micro-collider C of the micro-path π and by the considerations above there must be a directed micro-path C → ⋯ → M for some M ∈ ℳ ⊂ Y ∪ Z . Coarsening this micro-path we see that for any collider C of co ( π ) there must be a directed macro-path from C to Y or to Z . Writing out co ( π ) = ( co ( π ) ( 1 ) , … , co ( π ) ( r ) ) where co ( π ) ( 1 ) = Y and co ( π ) ( r ) = Z , we define the sets

U = { k ∣ co ( π ) ( k ) ∈ col ( co ( π ) ) , S ∩ des ( co ( π ) ( k ) ) = ∅ , and Z ∈ des ( co ( π ) ( k ) ) }

and

U ′ = { k ∣ co ( π ) ( k ) ∈ col ( co ( π ) ) , des ( co ( π ) ( k ) ) = ∅ , and Y ∈ des ( co ( π ) ( k ) ) } .

By the aforementioned considerations, at least one of these sets must be non-empty. If U is non-empty, let k ′ be its minimum, so that co ( π ) ( k ) is the collider closest to Y . Thus, the subpath path co ( π ) ( 1 ) , … , co ( π ) ( k ′ ) must be right- directed, i.e. co ( π ) ( 1 ) → ⋯ → co ( π ) ( k ′ ) . Since k ′ ∈ U , we can join it with a directed path from co ( π ) ( k ′ ) to Z yielding a path Y → ⋯ → Z . This path can not be m -blocked by S as all of its nodes are either non-colliders of co ( π ) or descendants of co ( π ) ( k ′ ) . Thus, we have found the desired path. If U ′ is non-empty, the argument is analoguous with k ′ ′ ≔ max U ′ instead of k ′ and left-directed instead of right-directed paths.□

B.3 Proofs of the Results in Section 5

Proof of Lemma 8

Assume first that ( co ( G , P ) , P X ) is σ -faithful and let be Y and Z σ -connected by a set S ⊂ P . Therefore by assumption Y ⊥̸ ⊥ Z ∣ T . By Lemma 7 and our assumptions, conditional pairwise independence of groups implies conditional mutual independence. Applying the logical contraposition, we thus obtain Y ⊥̸ ⊥ p w Z ∣ T . Thus there must be Y ∈ Y and Z ∈ Z such that Y ⊥̸ ⊥ Z ∣ T . By σ -Markovianity of ( G , P X ) , Y and Z must be σ -connected by T .

Conversely, assume that whenever Y and Z are σ -connected by a set S ⊂ P . Then there exist Y ∈ Y and Z ∈ Z that are σ -connected by T = ⋃ W ∈ S W . By σ -faithfulness of ( G , P X ) it follows that Y ⊥̸ ⊥ Z ∣ T and thus Y ⊥̸ ⊥ Z ∣ T . Thus, ( co ( G , P ) , P X ) is σ -faithful.□

Proof of Theorem 4

Let W , Y ∈ P and assume that Π = ( Π ( 1 ) , e 1 , Π ( 2 ) , e 2 , … , e n − 1 , Π ( n ) ) is a path on co ( G , P ) with Π ( 1 ) = W , Π ( n ) = Y that is σ -unblocked by S ⊂ P . Let T = ⋃ Z ∈ S Z . According to Corollary 1, we need to construct a path π that coarsens to Π and that is σ -unblocked by T . Consider the edge e 1 and choose e 1 ∈ mic ( e 1 ) whose nodes we will immediately denote by π ( 1 ) ∈ Π ( 1 ) and π ( 2 ) ∈ Π ( 2 ) , i.e. e 1 = ( π ( 1 ) , π ( 2 ) ) . For i = 2 , … , n − 1 , we proceed inductively as follows. Assume that we have already defined a path ( π ( 1 ) , e 1 … , e s − 1 , π ( s ) ) such that π ( s − 1 ) ∈ Π ( i − 1 ) , π ( s ) ∈ Π ( i ) and e s − 1 ∈ mic ( e i − 1 ) . By assumption (iii) of the theorem, we can find a node Y ∈ bd e i ( Π ( i ) ) ⊂ Π ( i ) such that sc ( π ( s ) ) = sc ( Y ) . If e i − 1 is left-directed and e i is left-directed or bidirected, choose a left-directed path ξ ( i ) = ( ξ ( 1 ) = π ( s ) , e ˜ 1 , … , e ˜ m , ξ ( m ) = Y ) , in all other cases, choose a right-directed path ξ i = ( ξ i ( 1 ) = π ( s ) , e ˜ 1 , … , e ˜ m , ξ i ( m ) = Y ) . Note that by condition (ii), all nodes of ξ i must remain in Π ( i ) . Concatenate ξ i with π , i.e. set e s − 1 + j = e ˜ j , π ( s − 1 + j ) = ξ i ( j ) , j = 1 , … , m . Finally since Y ∈ bd e i ( Π ( i ) ) we can choose an edge e s + m ∈ mic ( e i ) connecting Y = π ( s + m ) to some π ( s + m + 1 ) ∈ Π ( i + 1 ) . If i + 1 = n , we have finished the construction of our micro-path π and by construction co ( π ) = Π . We need to show now that T σ -unblocks π . There are different cases to check.

Assume that π ( 1 ) ∈ T (respectively π ( n ) ∈ T ). In this case, we must have Π ( 1 ) ∈ S (or Π ( n ) ∈ S ) which would σ -block Π , contrary to our assumption. Thus, π ( 1 ) , π ( n ) ∉ T .
Assume that π ( k ) is a collider on π for some 1 < k < len ( π ) and that π ( k ) ∈ Π ( i ) .
1. The first case to discuss here is ∣ Π ( i ) ∣ = 1 , i.e. Π ( i ) = { π ( k ) } . In this case, either Π ( i ) ∈ S in which case π ( k ) ∈ T or Π ( i ) must have a proper descendant S ∈ S . Then similar to the aforementioned construction of π , using (ii) and (iii), we can also construct a descending path π ( k ) → ⋯ → S for some S ∈ S . Thus, π ( k ) has a descendant in T and again the collider π ( k ) is σ -unblocked.
2. The second case is ∣ Π ( i ) ∣ > 1 . Because of our choice of the internal path ξ i as directed, π ( k ) ∈ bd e i or π ( k ) ∈ bd e i − 1 . We will only discuss the first case π ( k ) ∈ bd e i as the second one is completely analogous. If π ( k ) ∈ bd e i , then π ( k + 1 ) ∈ Π ( i + 1 ) and the edge e k = ( π ( k ) , π ( k + 1 ) ) must be left- or bidirected as π ( k ) is a collider. Again because of the way we chose ξ i , the unique edge on π in mic ( e i − 1 ) must be right- or bidirected. Thus, both e i − 1 and e i must have an arrowhead towards Π ( i ) , i.e Π ( i ) is a collider on Π . Thus, Π ( i ) must have a descendant in S . Suppose first that this descendant is Π ( i ) itself. Then the collider π ( k ) is σ -unblocked as it is contained in T . The other nodes on π that are part of Π ( i ) are non-collider but do not point to neighbors on π that are part of a different strongly connected component. Thus, by condition (3) in the definition of σ -separation (Definition 2) including them in T does not σ -block π . Next, suppose that the descendant of Π ( i ) in S is a proper descendant. Once again, by using (ii) and (iii), we can construct a descending path π ( k ) → ⋯ → S for some S ∈ S so that the collider π ( k ) is unblocked.
Assume that π ( k ) is a non-collider on π for some 1 < k < len ( π ) and that π ( k ) ∈ Π ( i ) . As S σ -unblocks Π , we must be in one of the following situations. Either (I) Π ( i ) ∉ S or (II) Π ( i ) ∈ S but if e i − 1 = Π ( i − 1 ) ← Π ( i ) or e i = Π ( i ) → Π ( i + 1 ) then sc ( Π ( i − 1 ) ) = sc ( Π ( i ) ) , respectively sc ( Π ( i ) ) = sc ( Π ( i + 1 ) ) .
1. In this case, π ( k ) ∉ T thus the non-collider π ( k ) is σ -unblocked by T .
2. If Π ( i ) ∈ S suppose that e i = Π ( i ) → Π ( i + 1 ) has a tail at Π ( i ) . As stated above, the fact that S σ -unblocks Π means that we must have sc ( Π ( i + 1 ) ) = sc ( Π ( i ) ) . If π ( k + 1 ) is also an element of Π ( i ) , then by construction of π , it has a right-directed edge π ( k ) → π ( k + 1 ) and sc ( π ( k ) ) = sc ( π ( k + 1 ) ) . Thus, even though π ( k ) ∈ T , it is still σ -unblocked by T . Thus we can assume that π ( k + 1 ) ∈ Π ( i + 1 ) which means in particular that π ( k ) ∈ bd e i ( Π ( i ) ) and π ( k + 1 ) ∈ bd e i ( Π ( i + 1 ) ) . As sc ( Π ( i + 1 ) ) = sc ( Π ( i ) ) , there exists a directed path Γ on co ( G , P ) starting at Π ( i + 1 ) and ending at Π ( i ) . As with the construction of π , because of the boundary connection condition (iii), we can once again construct a micro-path γ from π ( k + 1 ) to π ( k ) so that sc ( π ( k ) ) = sc ( π ( k + 1 ) ) . So, once again even though π ( k ) ∈ T , it is still σ -unblocked by T . The final case is that e i does not have a tail at Π ( i ) which means that e i − 1 = Π ( i − 1 ) ← Π ( i ) must have one. The argument that sc ( π ( k ) ) = sc ( π ( k − 1 ) ) is then completely parallel to the discussion for tailed e i , taking into account that the path segment ξ i of π that is internal to Π ( i ) is left-directed (or trivial) by construction. Therefore, also in this case, even though π ( k ) ∈ T , it is still σ -unblocked by T .

We have shown earlier that every collider of π has a descendant in T and that every non-collider is either not part of T or points exclusively to neighbors in the same connected component. In summary, T σ -unblocks π , so σ -faithfulness is proven.

To show that co ( G , P ) is acyclic, assume that there exists a right-directed cycle Π with Π ( 1 ) = Π ( len ( Π ) ) . W.l.o.g., we can assume that Π is irreducible. Let P ′ be the partition of G into strongly connected components and ℋ = co ( G , P ′ ) , which is always acyclic. Then condition (iii) implies that Π induces a right-directed walk Γ 0 on ℋ with no repeating middle vertices such that Γ 0 ( 1 ) , Γ 0 ( len ( Γ 0 ) ) ⊂ Π ( 1 ) . If Γ 0 ( 1 ) = Γ 0 ( len ( Γ 0 ) ) , we have found a cycle in ℋ and thus a contradiction. If not, we can again use condition (iii) to construct a right-directed walk Γ ′ with Γ ′ ( 1 ) = Γ 0 ( len ( Γ 0 ) ) and Γ ′ ( len ( Γ ′ ) ) ⊂ Π ( 1 ) . Concatenating Γ 0 and Γ 0 to Γ 1 = Γ 0 ∘ Γ ′ , we have found two walks now that start in the same strongly connected component Γ 0 ( 1 ) and end in P ′ ∩ 2 Π ( 1 ) . Again if Γ 1 ( len ( Γ 1 ) ) = Γ ( 1 ) , we are done, otherwise we continue to construct walks Γ k on ℋ in this manner. Since the set P ′ ∩ 2 Π ( 1 ) is finite, at some point, the condition Γ k ( len ( Γ k ) ) = Γ ( k ) must be be met and we arrive at a contradiction.□

Proof of Theorem 5

Let W , Y ∈ P and assume that Π = ( Π ( 1 ) , e 1 , Π ( 2 ) , e 2 , … , e n − 1 , Π ( n ) ) is a path on co ( G , P ) with Π ( 1 ) = W , Π ( n ) = Y that is σ -unblocked by S ⊂ P . Let T = ⋃ Z ∈ S Z . According to Corollary 1, we need to construct a path π that coarsens to Π and that is σ -unblocked by T . We proceed as in the proof of Lemma 4 and for non-colliders ( e i , Π ( i + 1 ) , e i + 1 ) , the construction of π completely parallels that proof. For a collider ( e i , Π ( i + 1 ) , e i + 1 ) , we simply can choose the micro path ξ i in the argument to be a micro-collider ( ξ i ( 1 ) , e 1 i , ξ ( 2 ) , e 2 i , ξ ( 3 ) ) with ξ i ( 1 ) ∈ Π ( i − 1 ) , ξ i ( 2 ) ∈ Π ( i ) , ξ i ( 3 ) ∈ Π ( i + 1 ) . Showing that T unblocks π again completely parallels the the proof of Lemma 4, noting that if π ( k ) is a non-collider on π it must lie in a group Π ( i ) that is a non-collider of Π . If π ( k ) is a collider on π , then π ( k ) ∈ Π ( i ) and Π ( i ) must be a collider on Π . Since S σ -unblocks Π ( i ) , we must either have Π ( i ) ∈ S or Π ( i ) must have a proper descendant in S , say W . In the first case, it follows that π ( k ) ∈ T . In the second case, consider the descending path Γ = ( Π ( i ) , e ˜ 1 , … , e ˜ s , W ) . We can again construct a descending micro-path γ starting at some γ ( 1 ) ∈ bd e ˜ 1 ( Π ( i ) ) to some W ∈ W . Since the pair e i , e ˜ 1 is an (almost) mediator, we can use condition (iii-a) to find a directed micro-path from π ( k ) to γ ( 1 ) and hence to W . Thus, π ( k ) has a descendant in T as desired.□

References

[1] Pearl J. Causality: Models, Reasoning and Inference. 2nd ed. USA: Cambridge University Press; 2009. 10.1017/CBO9780511803161Suche in Google Scholar

[2] Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. vol. 81 of Lecture Notes in Statistics. New York, NY: Springer; 1993. http://link.springer.com/10.1007/978-1-4612-2748-9. 10.1007/978-1-4612-2748-9Suche in Google Scholar

[3] Spirtes P. An anytime algorithm for causal inference. In: International Workshop on Artificial Intelligence and Statistics. PMLR; 2001. p. 278–85. https://proceedings.mlr.press/r3/spirtes01a.html. Suche in Google Scholar

[4] Peters J, Janzing D, Schölkopf B. Elements of causal inference - foundations and learning algorithms. Adaptive Computation and Machine Learning Series. Cambridge, MA, USA: The MIT Press; 2017. Suche in Google Scholar

[5] Ramsey J, Spirtes P, Zhang J. Adjacency-faithfulness and conservative causal inference. In: Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. UAI’06. Arlington, Virginia, USA: AUAI Press; 2006. p. 401–8. Suche in Google Scholar

[6] Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A. A linear non-Gaussian acyclic model for causal discovery. J Mach Learn Res. 2006 Dec;7:2003–30. Suche in Google Scholar

[7] Runge J, Bathiany S, Bollt E, Camps-Valls G, Coumou D, Deyle E, et al. Inferring causation from time series in earth system sciences. Nature Commun. 2019 Jun;10(1):2553. Number: 1 Publisher: Nature Publishing Group. https://www.nature.com/articles/s41467-019-10105-3. 10.1038/s41467-019-10105-3Suche in Google Scholar PubMed PubMed Central

[8] Semedo JD, Gokcen E, Machens CK, Kohn A, Yu BM. Statistical methods for dissecting interactions between brain areas. Current Opinion Neurobiol. 2020 Dec;65:59–69. https://www.sciencedirect.com/science/article/pii/S0959438820301367. 10.1016/j.conb.2020.09.009Suche in Google Scholar PubMed PubMed Central

[9] Perich MG, Rajan K. Rethinking brain-wide interactions through multi-region “Network of Networks”Âİ models. Current Opinion Neurobiol. 2020 Dec;65:146–51. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7822595/. 10.1016/j.conb.2020.11.003Suche in Google Scholar PubMed PubMed Central

[10] Runge J, Nowack P, Kretschmer M, Flaxman S, Sejdinovic D. Detecting and quantifying causal associations in large nonlinear time series datasets. Sci Adv. 2019;5(11):eaau4996. https://www.science.org/doi/abs/10.1126/sciadv.aau4996. 10.1126/sciadv.aau4996Suche in Google Scholar PubMed PubMed Central

[11] Runge J, Petoukhov V, Donges JF, Hlinka J, Jajcay N, Vejmelka M, et al. Identifying causal gateways and mediators in complex spatio-temporal systems. Nature Commun. 2015;6(1):1–10. 10.1038/ncomms9502Suche in Google Scholar PubMed PubMed Central

[12] Wang C. Three-ocean interactions and climate variability: a review and perspective. Climate Dynamics. 2019 Oct;53(7):5119–36. 10.1007/s00382-019-04930-xSuche in Google Scholar

[13] Costanza R, Kubiszewski I, Giovannini E, Lovins H, McGlade J, Pickett KE, et al. Development: time to leave GDP behind. Nature. 2014 Jan;505(7483):283–5. Number: 7483 Publisher: Nature Publishing Group. https://www.nature.com/articles/505283a. 10.1038/505283aSuche in Google Scholar PubMed

[14] Timmermann A, An SI, Kug JS, Jin FF, Cai W, Capotondi A, et al. El Niño–Southern oscillation complexity. Nature. 2018 Jul;559(7715):535–45. Number: 7715 Publisher: Nature Publishing Group. https://www.nature.com/articles/s41586-018-0252-6. 10.1038/s41586-018-0252-6Suche in Google Scholar PubMed

[15] Rubenstein PK, Weichwald S, Bongers S, Mooij JM, Janzing D, Grosse-Wentrup M, et al. Causal consistency of structural equation models. In: Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI); 2017. p. ID 11. http://auai.org/uai2017/proceedings/papers/11.pdf. Suche in Google Scholar

[16] Zhang W, Wang Z, Stuecker MF, Turner AG, Jin FF, Geng X. Impact of ENSO longitudinal position on teleconnections to the NAO. Climate Dynamics. 2019 Jan;52(1):257–74. 10.1007/s00382-018-4135-1. Suche in Google Scholar

[17] Wahl J, Ninad U, Runge J. Vector causal inference between two groups of variables. Proc AAAI Conference Artif Intelligence. 2023 Jun;37(10):12305–12. https://ojs.aaai.org/index.php/AAAI/article/view/26450. 10.1609/aaai.v37i10.26450Suche in Google Scholar

[18] Parviainen P, Kaski S. Learning structures of Bayesian networks for variable groups. Int J Approx Reasoning. 2017;88:110–27. https://www.sciencedirect.com/science/article/pii/S0888613X17303134. 10.1016/j.ijar.2017.05.006Suche in Google Scholar

[19] Shah RD, Peters J. The hardness of conditional independence testing and the generalised covariance measure. Ann Stat. 2020;48(3):1514–38. 10.1214/19-AOS1857Suche in Google Scholar

[20] Josse J, Holmes SP. Measuring multivariate association and beyond. Stat Surveys. 2016;10:132–67. 10.1214/16-SS116Suche in Google Scholar PubMed PubMed Central

[21] Chatterjee S. A survey of some recent developments in measures of association. 2022. https://arxiv.org/abs/2211.04702. Suche in Google Scholar

[22] Hochsprung T, Wahl J, Gerhardus A, Ninad U, Runge J. Increasing effect sizes of pairwise conditional independence tests between random vectors. In: Evans RJ, Shpitser I, editors. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence. vol. 216 of Proceedings of Machine Learning Research. PMLR; 2023. p. 879–89. https://proceedings.mlr.press/v216/hochsprung23a.html. Suche in Google Scholar

[23] Anand TV, Ribeiro AH, Tian J, Bareinboim E. Causal effect identification in cluster DAGs. Proceedings of the AAAI Conference on Artificial Intelligence. 2023 Jun;37(10):12172–9. https://ojs.aaai.org/index.php/AAAI/article/view/26435. 10.1609/aaai.v37i10.26435Suche in Google Scholar

[24] Weinberger N. Faithfulness, coordination and causal coincidences. Erkenntnis. 2018 Apr;83(2):113–33. 10.1007/s10670-017-9882-6Suche in Google Scholar

[25] Marx A, Gretton A, Mooij JM. A weaker faithfulness assumption based on triple interactions. In: Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence. vol. 161 of Proceedings of Machine Learning Research. PMLR; 2021. p. 451–60. https://proceedings.mlr.press/v161/marx21a.html. Suche in Google Scholar

[26] Meek C. Causal inference and causal explanation with background knowledge. In: Proceedings of the Eleventh conference on Uncertainty in Artificial Intelligence. UAI’95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 1995. p. 403–10. https://dl.acm.org/doi/10.5555/2074158.2074204. Suche in Google Scholar

[27] Zscheischler J, Janzing D, Zhang K. Testing whether linear equations are causal: a free probability theory approach. Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011. 2012 Feb:839–48. https://dl.acm.org/doi/abs/10.5555/3020548.3020645. Suche in Google Scholar

[28] Runge J, Gerhardus A, Varando G, Eyring V, Camps-Valls G. Causal inference for time series. Nature Reviews Earth Environ. 2023;4:487–505. https://www.nature.com/articles/s43017-023-00431-y. 10.1038/s43017-023-00431-ySuche in Google Scholar

[29] Glymour C, Zhang K, Spirtes P. Review of causal discovery methods based on graphical models. Frontiers Genetics. 2019;10:524. https://www.frontiersin.org/articles/10.3389/fgene.2019.00524. 10.3389/fgene.2019.00524Suche in Google Scholar PubMed PubMed Central

[30] Zhang J. Causal reasoning with ancestral graphs. J Machine Learn Res. 2008;9(47):1437–74. http://jmlr.org/papers/v9/zhang08a.html. Suche in Google Scholar

[31] Forré P, Mooij JM. Markov properties for graphical models with cycles and latent variables. 2017. https://arxiv.org/abs/1710.08775. Suche in Google Scholar

[32] Mooij JM, Claassen T. Constraint-based causal discovery using partial ancestral graphs in the presence of cycles. In: Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI). PMLR; 2020. p. 1159–68. https://proceedings.mlr.press/v124/m-mooij20a.html. Suche in Google Scholar

[33] Bongers S, Foré P, Peters J, Mooij JM. Foundations of structural causal models with cycles and latent variables. Ann Stat. 2021;49(5):2885–915. 10.1214/21-AOS2064Suche in Google Scholar

[34] McConnell RM, De Montgolfier F. Linear-time modular decomposition of directed graphs. Discrete Appl Math. 2005;145(2):198–209. 10.1016/j.dam.2004.02.017Suche in Google Scholar

[35] Dawid AP. Conditional independence in statistical theory. J R Stat Soc Ser B (Methodological). 1979;41(1):1–31. http://www.jstor.org/stable/2984718. 10.1111/j.2517-6161.1979.tb01052.xSuche in Google Scholar

[36] Runge J. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. In: Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI). vol. 124 of Proceedings of Machine Learning Research. PMLR; 2020. p. 1388–97. https://proceedings.mlr.press/v124/runge20a.html. Suche in Google Scholar

[37] Gerhardus A, Runge J. High-recall causal discovery for autocorrelated time series with latent confounders. In: Advances in neural information processing systems. vol. 33. Curran Associates, Inc.; 2020. p. 12615–25. 10.5194/egusphere-egu21-8259Suche in Google Scholar

[38] Granger CWJ. Investigating causal relations by econometric models and cross-spectral methods. Econometrica. 1969;37(3):424–38. Publisher: [Wiley, Econometric Society]. https://www.jstor.org/stable/1912791. 10.2307/1912791Suche in Google Scholar

[39] Chalupka K, Eberhardt F, Perona P. Multi-level cause-effect systems. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics. vol. 51 of Proceedings of Machine Learning Research. Cadiz, Spain: PMLR; 2016. p. 361–9. https://proceedings.mlr.press/v51/chalupka16.html. Suche in Google Scholar

[40] Chalupka K, Eberhardt F, Perona P. Causal feature learning: an overview. Behaviormetrika. 2017;44(1):137–64. 10.1007/s41237-016-0008-2Suche in Google Scholar

[41] Kalisch M, Bühlmann P. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J Machine Learn Res. 2007;8(22):613–36. http://jmlr.org/papers/v8/kalisch07a.html. Suche in Google Scholar

Received: 2023-06-12

Revised: 2023-11-21

Accepted: 2024-05-10

Published Online: 2024-07-12

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

https://doi.org/10.1515/jci-2023-0041

Schlagwörter für diesen Artikel

causality; causal discovery; graphical models; Markov property; faithfulness; time series

Creative Commons

BY 4.0