Confidence in causal inference under structure uncertainty in linear causal models with equal variances

David Strieder; Mathias Drton

doi:10.1515/jci-2023-0030

Article Open Access

Confidence in causal inference under structure uncertainty in linear causal models with equal variances

David Strieder and Mathias Drton

Published/Copyright: December 19, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Journal of Causal Inference Volume 11 Issue 1

Abstract

Inferring the effect of interventions within complex systems is a fundamental problem of statistics. A widely studied approach uses structural causal models that postulate noisy functional relations among a set of interacting variables. The underlying causal structure is then naturally represented by a directed graph whose edges indicate direct causal dependencies. In a recent line of work, additional assumptions on the causal models have been shown to render this causal graph identifiable from observational data alone. One example is the assumption of linear causal relations with equal error variances that we will take up in this work. When the graph structure is known, classical methods may be used for calculating estimates and confidence intervals for causal-effects. However, in many applications, expert knowledge that provides an a priori valid causal structure is not available. Lacking alternatives, a commonly used two-step approach first learns a graph and then treats the graph as known in inference. This, however, yields confidence intervals that are overly optimistic and fail to account for the data-driven model choice. We argue that to draw reliable conclusions, it is necessary to incorporate the remaining uncertainty about the underlying causal structure in confidence statements about causal-effects. To address this issue, we present a framework based on test inversion that allows us to give confidence regions for total causal-effects that capture both sources of uncertainty: causal structure and numerical size of non-zero effects.

Keywords: confidence intervals; causal-effects; linear structural equation models; equal error variances; graphical models

MSC 2010: 62D20; 62H22

1 Introduction

Questions about causal relations are at the heart of many research problems. Researchers across various fields, ranging from economics to biology and medicine, are interested in distinguishing causes from effects to predict the outcome of interventions in complex systems. While controlled experiments provide an established method for inferring causal relations, in many applied settings, interventional studies are not feasible for ethical or cost considerations. Thus, inferring causal relationships based solely on observational data is an important task that the field of causal discovery addresses.

In the last few decades, causal discovery has gained popularity, with much fundamental research being done on the topics of identification and estimation [1,2]. Identifiability results clarify when it is theoretically possible to go beyond mere correlations. More specifically, they clarify under which circumstances, assumptions lead to unique identification of interventional distributions from a given joint probability distribution. Moreover, numerous structure learning algorithms have been proposed that infer the underlying causal structure based on observed data. Knowledge of the underlying causal structure then allows researchers to reason about the effect of interventions in complex systems, and given a causal structure, standard statistical methods may be applied to estimate the involved causal-effects and provide an uncertainty assessment for the numerical size of the effects. However, in the typical applied setting, the exact causal structure is (at least partially) unknown. In order to draw reliable conclusions and make calibrated confidence statements, the remaining uncertainty about the underlying causal structure needs to be incorporated in causal inference.

Without prior expert knowledge about the underlying causal structure, a commonly employed “naïve” approach would involve splitting the task. In the first step, a causal learning algorithm estimates the underlying causal structure. In the second step, within the inferred model, confidence intervals for the causal parameters can be calculated using classical statistical inference methods. However, the resulting intervals are not valid if the causal learning algorithm infers the wrong causal structure. Such a two-step method conditions away the uncertainty in the causal structure arising from the data-driven model choice, and the classical “double-dipping” problem occurs since both causal structure and the effect are estimated from one data set. Thus, the approach is overly optimistic in its conclusion about the strength and existence of causal-effects, and its confidence regions do not achieve the desired coverage probability, especially under high uncertainty with respect to the underlying structure.

Despite the popularity and growing interest in causal discovery, we are unaware of prior attempts to rigorously account for uncertainty in the causal structure when providing confidence statements about causal-effects. Here, we use the term “confidence” in the technical sense for a set with a given desired frequentist coverage probability. A completely different approach for uncertainty quantification that we do not consider here is to form Bayesian credible sets. In fact, a number of authors have used Bayesian approaches for uncertainty quantification in causal discovery, but primarily with an emphasis on the uncertainty in the graphical structure as opposed to causal-effects (see [3–5] for three selected examples).

In this article, we want to shed light on the problem by highlighting the challenges of the task of handling structure uncertainty and by proposing a solution for constructing confidence intervals for total causal-effects that capture both types of uncertainty: the uncertainty regarding the causal structure resulting from a data-driven model selection and the uncertainty about the numerical size of the effect. Our solution is feasible for problems of small to moderate scale – e.g., our simulation experiments will consider problems with up to 12 variables. However, already at this scale, rigorously accounting for structure uncertainty means to account for more than 1 0 26 possible structures, reflecting the fast superexponential growth with the number of nodes in the causal graph.

In Section 2 of this article, we review the considered framework of structural causal models [6,7]. More specifically, in order to ensure identifiability from observational data alone, we consider a restricted class of causal models, namely, linear structural equation models (LSEMs) with Gaussian errors and equal variances [8], and review the definition of a total causal-effect, which is the target of interest in this study. We then generalize and improve the ideas introduced for bivariate problems by Strieder et al. [9] and present a general framework for constructing confidence intervals for total causal-effects in multivariate data (Sections 3 and 4). Furthermore, we present computational shortcuts and analyze the performance of the introduced framework in numerical experiments (Section 5). In Section 6, we discuss extensions of the framework to partially quantify the uncertainty over causal structures.

2 Background

This section reviews the LSEMs and the total causal-effect, which is the target of interest in this study.

2.1 LSEMs

Structural causal models constitute a commonly used tool to study causal relationships and represent noisy causal relations among a set of interacting variables. Each variable is modeled as a function of a subset of other variables, its causes, and a stochastic error term. The causal perspective emerges from viewing those relations as making assignments rather than representing mathematical equations. The left-hand side, the effect, is assigned the value specified on the right-hand side, the causes. Externally varying the values on the right-hand side results in a change on the left-hand side, but not vice versa. This reflects the asymmetry inherent in cause–effect relations.

We assume access to observational data in the form of a sample of independent copies of a random vector X = ( X 1 , … , X d ) ; without loss of generality, we assume the random vector to have zero mean. Without any further assumptions on the structural causal model, e.g., the involved functions, identification of the underlying causal structure is only possible up to a Markov equivalence class. Therefore, to ensure unique identifiability, following a line of research initiated by Peters and Bühlmann [8], we focus on linear relations and normally distributed errors with equal variances by assuming that X solves the equation system:

(1) X j = ∑ i ≠ j β j , i X i + ε j , j = 1 , … , d ,

where B ≔ [ β j , i ] j , i = 1 d are unknown parameters that represent the direct causal-effects between the variables, and ε j are independent, normally distributed error terms:

ε j ∼ i . i . d . N ( 0 , σ 2 ) , j = 1 , … , d ,

with common, unknown variance σ 2 > 0 . Each specific LSEM restricts a subset of the direct effects β j , i to be zero. Thus, LSEMs are naturally represented by a (minimal) directed graph G ( B ) with vertex set V = { 1 , … , d } . In the associated directed graph, the edge set equals the support of B ; thus, a missing edge i → j indicates that β j , i = 0 . As in related work, we assume the underlying directed graph to be acyclic (DAG), which entails that B is permutation similar to a triangular matrix and System (1) admits the unique solution X = ( I d − B ) − 1 ε , with I d denoting the d × d identity matrix. Incorporating the equal variance assumption, the covariance matrix of X is seen to be

E [ X X T ] = σ 2 ( I d − B ) − 1 ( I d − B ) − T .

In the sequel, we will use the following graphical concepts. Each DAG admits a (not necessarily unique) topological ordering of its vertices, i.e., a permutation π of { 1 , … , d } such that the existence of a directed path from node π ( i ) to node π ( j ) implies i < j . We write i < G j if node i precedes node j in such a causal ordering of the graph. If the DAG contains an edge from node i to node j , then node i is a parent of node j , and we denote the set of all parents of node j with p ( j ) . It has been shown that under the aforementioned modeling assumptions with equal error variances, i.e., homoscedasticity across all interacting variables, the causal ordering is identifiable and corresponds to an ordering of conditional variances [10,11]. Hence, causal-effects can also be uniquely identified.

2.2 Causal-effects

Informally put, we are interested in estimating how much the value of X j changes if we externally intervene in the system and set the value of X i to x i . Such an intervention represents a change in the true data-generating process and, thus, a modification in our probabilistic model. In the structural equation framework, the effect of such an external intervention can be expressed by replacing the i th equation with X i = x i and is denoted as do ( X i = x i ) .

In this work, we seek to construct confidence intervals for the total causal-effect C ( i → j ) in LSEMs. This effect is formally defined as the unit change in the expectation of X j with respect to an intervention in X i , i.e.,

C ( i → j ) ≔ d d x i E [ X j ∣ do ( X i = x i ) ] .

Remark 2.1

Note that the total causal-effect in an LSEM ( B , σ 2 ) can be intuitively expressed based on the respective underlying DAG G ( B ) . The effect C ( i → j ) equals the sum of all products of edge weights along all directed paths from i to j . Furthermore, the effect is zero if there does not exist a directed path from i to j . This follows from the path properties of the matrix ( I d − B ) − 1 that computes the solution X of the structural equations from the independent errors.

Example 2.2

Consider the following LSEM and the corresponding (minimal) DAG:

The total causal-effect is given by the sum of all products of edge weights along all directed paths. For example, we obtain C ( 1 → 2 ) = β 2 , 1 + β 4 , 1 β 2 , 4 , whereas C ( 2 → 1 ) = 0 .

Our aim is to construct a confidence interval for the total causal-effect C ( i → j ) when the underlying causal structure is not known and has to be learned. We stress that, under the identifiability assumption of an underlying Gaussian LSEM with equal error variances, this is a well-defined problem. In fact, every admissible distribution can be represented by a single (minimal) DAG. This DAG then uniquely defines the total causal-effect.

A “naïve” two-step approach that first learns a causal structure and then calculates confidence intervals in the inferred model does not incorporate the uncertainty regarding the underlying causal structure. To respect this uncertainty in the calculations of a valid confidence interval, it is necessary to take all possible structures, i.e., graphs, into account. As a result, the target quantity has a composite structure, and thus, obtaining finite sample distributions of estimators is difficult.

In many situations where the sampling distribution of estimators is unobtainable, bootstrapping and other resampling methods can offer an appealing framework to construct confidence intervals. At first sight, resampling may appear to provide a simple solution to the problem under discussion. This solution relies on computing for each resampled data set a causal-effect estimate. To obtain this estimate, one may compose a consistent model selection method that learns the causal graph with a consistent causal-effect estimator in the learned model. Note that this composition is well defined due to the identifiability of LSEMs with equal error variances. However, as a statistic, this composition lacks smoothness, and thus, bootstrapping procedures may fail severely [12,13]. We emphasize that Drton and Williams [13] demonstrate that even in low dimensions, the asymptotic behavior of confidence intervals and bootstrap tests is difficult to predict for complex composite settings as encountered here. The subtleties stem from the fact that the set of covariance matrices associated with at least one possible DAG form a union of smooth manifolds. This union contains singularities at the intersections of the manifolds, which is highlighted in the bivariate case by Strieder et al. [9]. They demonstrated that singular points arise at the intersections of the two models. Consequently, bootstrapping cannot be used to capture model uncertainty correctly. This failure of the bootstrapping methods can also be observed in our simulations in Section 5.3.

In conclusion, there is a need to develop new frameworks for constructing confidence intervals that circumvent the problems that originate from the non-smooth nature of the target of interest.

3 Confidence intervals via test inversion

In the remainder of this article, we develop a framework that circumvents the problems posed by the non-smooth nature of the total causal-effect and provides valid confidence intervals for causal inference under structure uncertainty. In this section, we introduce our main ansatz, the test inversion approach, and derive the concrete hypothesis that subsequently needs to be tested.

3.1 Inversion of tests

Our main idea is to leverage the classical duality between statistical hypothesis tests and confidence regions in the following way. If we perform valid hypothesis tests for all attainable causal-effects, then all values that we cannot reject form a confidence region for the total causal-effect. More specifically, let α ∈ ( 0 , 1 ) be a fixed significance level. We perform (asymptotically) valid-level α tests for the hypothesis of a fixed total causal-effect ψ for all attainable values of the causal-effect, i.e., for all ψ ∈ R , we test

H 0 : C ( 1 → 2 ) = ψ ,

with a test whose acceptance region we denoted by A ( ψ ) . Then. a (asymptotically) valid ( 1 − α ) -confidence region for the total causal-effect C ( 1 → 2 ) for the data X is given by:

C ( X ) ≔ { ψ ∈ R : X ∈ A ( ψ ) } .

A proof of this classical result can be found, e.g., in [14, Theorem 9.2.2].

With the aforementioned perspective, our problem is shifted to developing suitable hypothesis tests for all attainable causal-effects. Due to the lack of concrete knowledge about the model, all possible causal graphs must be respected. Thus, we have to construct hypothesis tests in a composite setting that remains challenging. In the next section, we explicitly describe the hypotheses that need to be tested to construct confidence regions for the total causal-effect.

3.2 Hypothesis of a fixed causal-effect

Recalling the assumptions of an underlying LSEM with homoscedastic errors across all interacting variables, we consider the following task. We assume that the given data consist of random vectors X ( 1 ) , … , X ( n ) drawn independently from a centered multivariate normal distribution N ( 0 , Σ 0 ) . The distribution’s density p ( x ∣ Σ 0 ) factorizes according to an unknown DAG while satisfying the assumption of equal error variances. To account for the uncertainty in the causal structure, we need to test at level α whether any multivariate centered normal distribution that factorizes to any DAG under the equal variance assumption allows for a total causal-effect of size ψ , and repeat that test for all values ψ ∈ R . In light of the normality assumption, it is natural to parameterize our statistical model in terms of the covariance matrix Σ . The additional assumption of homoscedasticity across all interacting variables restricts the set of possible covariance matrices Furthermore, as we detail now.

Given a fixed DAG G , under the assumption of equal error variances, all possible normal distributions with a density that factorizes according to G are defined by the family:

(2) N ( G ) ≔ { N ( 0 , Σ ) : ∃ B ∈ R G , σ 2 > 0 , with Σ = σ 2 ( I d − B ) − 1 ( I d − B ) − T } ,

where R G = { B ∈ R d × d : β j , i = 0 if i ∉ p ( j ) } . Considering and combining all possible DAG models, our overall model corresponds to the following union of sets of covariance matrices:

(3) ℳ ≔ ⋃ G ∈ G ( d ) { Σ ∈ PD ( d ) : N ( 0 , Σ ) ∈ N ( G ) } ,

where G ( d ) is the set of all DAGs with d nodes.

Remark 3.1

If G * is a supergraph of G , then the set N ( G ) is a subset of N ( G * ) . Hence, in the definition of ℳ , we only need to consider the union over all complete DAGs with d nodes, i.e., all possible orderings.

The following proposition shows that we can identify the model ℳ , which is a subset of all positive definite matrices satisfying the assumption of equal error variances, solely based on polynomial constraints in the entries of the covariance matrix. Here, and in the remainder of this article, we write Σ i , j ∣ p ( i ) for the conditional covariance matrix, i.e.,

Σ j , i ∣ p ( i ) ≔ Σ j , i − Σ j , p ( i ) ( Σ p ( i ) , p ( i ) ) − 1 Σ p ( i ) , i .

Proposition 3.2

For a covariance matrix Σ ∈ PD ( d ) , the following statements are equivalent:

Σ ∈ ℳ ,
There exists a complete DAG G and σ 2 > 0 such that for all k = 1 , … , d

(4) σ 2 = Σ k , k ∣ p ( k ) = Σ k , k − Σ k , p ( k ) ( Σ p ( k ) , p ( k ) ) − 1 Σ p ( k ) , k .

Proof

“ ⇒ ” Let Σ ∈ ℳ . Then, there exists a DAG G such that N ( 0 , Σ ) ∈ N ( G ) . Considering Remark 3.1, we can assume without loss of generality that G is complete. Straightforward calculations with the multivariate normal density then imply that the variances of each node conditioned on its parents are equal to σ 2 , or see a combinatorial argument as in [15, Equation (7.4)]. Thus, for all k = 1 , … , d ,

σ 2 = Var ( X k ∣ X p ( k ) ) = Σ k , k − Σ k , p ( k ) ( Σ p ( k ) , p ( k ) ) − 1 Σ p ( k ) , k .

“ ⇐ ” Since G is complete, the multivariate normal distribution N ( 0 , Σ ) factorizes according to the graph G . This implies the existence of a matrix B ∈ R G and a positive definite diagonal matrix Ω ∈ R d × d with Σ = ( I d − B ) − 1 Ω ( I d − B ) − T . Equation (4) implies that Ω k , k = Σ k , k ∣ p ( k ) is equal for all k = 1 , … , d , and the claim follows.□

Remark 3.3

Note the identifiability result under homoscedasticity across all interacting variables, e.g., Theorem 1 in [10]. Each Σ ∈ ℳ corresponds to a unique ( B , σ 2 ) and, thus, a unique minimal DAG G ( B ) . As explained in Remark 3.1, the union in the definition of ℳ (3) is not necessarily disjoint, and analogously, the (complete) graph satisfying Condition 2 in Proposition 3.2 is not necessarily unique. Nevertheless, since B is unique, all graphs satisfying Condition 2 in Proposition 3.2 share the same edge set with non-zero edge coefficients and correspond to all valid causal orderings of the minimal DAG. Thus, every “additional” parent of node i in a complete DAG G from Condition 2 compared to the unique minimal DAG G ( B ) is conditionally independent of node i given p G ( B ) ( i ) . This follows by d-separation since every (undirected) path between node i and an additional parent either traverses the parents p G ( B ) ( i ) in the minimal DAG or contains a collider c with i < G ( B ) c , and thus, this collider has no descendent in p G ( B ) ( i ) .

Our testing problem is the following classical statistical problem. We are concerned with a parametric model ℳ , parameterized by the covariance matrix, which is an intricate union of subsets of the cone of positive definite matrices given by polynomial constraints. Within that statistical model, we need to test for all attainable values of the total causal-effect ψ . In the following proposition, we show that the parameter of interest is a function of the covariance matrix. Thus, the considered statistical testing problem can be fully parameterized in terms of the covariance matrix.

Proposition 3.4

Let Σ ∈ ℳ . Then, the total causal-effect C ( i → j ) is given by:

C ( i → j ) = Σ j , i ∣ p ( i ) ( Σ i , i ∣ p ( i ) ) − 1 ,

with parent set p ( i ) based on any corresponding DAG G from Proposition 3.2.

Proof

The case j < G i is trivially true, since in this case, the total causal-effect C ( i → j ) is zero and, considering G is complete, the parent set p ( i ) contains the node j , which yields Σ j , i ∣ p ( i ) = 0 .

Assume i < G j . Since the set p ( i ) satisfies the back-door criterion with respect to i and j , we have

C ( i → j ) ≔ d d x i E [ X j ∣ do ( X i = x i ) ] = d d x i E [ E [ X j ∣ X i = x i , X p ( i ) ] ] .

It follows that the total causal-effect C ( i → j ) is the regression coefficient of X i when regressing X j on ( X i , X p ( i ) ) , which is given by Σ j , i ∣ p ( i ) ( Σ i , i ∣ p ( i ) ) − 1 .

We emphasize that this expression does not depend on the specific choice of the DAG G . All graphs satisfying Condition 2 in Proposition 3.2 share the same edge set with non-zero edge weights, and thus, the parent set p ( i ) always satisfies the back-door criterion in the corresponding unique minimal DAG (see Remark 3.3).□

Using Proposition 3.4, we can define the hypothesis space ℳ ψ ⊂ ℳ , given by all covariance matrices Σ ∈ ℳ that allow for a total causal-effect of size ψ via

ℳ ψ ≔ { Σ ∈ PD ( d ) : ∃ complete DAG G and σ 2 > 0 with ψ σ 2 = Σ j , i ∣ p ( i ) and σ 2 = Σ k , k ∣ p ( k ) ∀ k = 1 , … , d } .

Put together, we have to solve the statistical testing problem

(5) H 0 ( ψ ) : Σ ∈ ℳ ψ against H 1 : Σ ∈ ℳ \ ℳ ψ ,

for all attainable ψ ∈ R . Then, we invert the tests to construct valid confidence intervals for the total causal-effect. Similar to the union ℳ , we can write the hypothesis space as a union of single hypotheses over all (complete) DAGs, i.e., H 0 ( ψ ) ≔ ⋃ G ∈ G ( d ) H 0 ( ψ ) ( G ) with single hypotheses H 0 ( ψ ) ( G ) : Σ ∈ ℳ ψ ( G ) , where

(6) ℳ ψ ( G ) ≔ { Σ ∈ PD ( d ) : ∃ σ 2 > 0 with ψ σ 2 = Σ j , i ∣ p ( i ) and σ 2 = Σ k , k ∣ p ( k ) ∀ k = 1 , … , d } .

Remark 3.5

It is clear that for ψ ≠ 0 , we only need to consider the union over all graphs with i < G j . In contrast, for a zero effect, we have to consider all graphs, since the effects of multiple directed paths might “cancel” each other. Furthermore, for graphs with j < G i , the constraint given by the size zero effect is not informative, since j ∈ p ( i ) and consequently, Σ j , i ∣ p ( i ) = 0 . Therefore, the corresponding hypothesis space has higher dimension.

Example 3.6

Consider the following LSEM and the corresponding (minimal) DAG:

There exist two directed paths from node 1 to node 2 in the corresponding (minimal) DAG, and thus, 1 < G 2 in all valid causal orderings. However, the effect of the two paths “cancel” each other, such that C ( 1 → 2 ) = 0.25 − 0 . 5 2 = 0 .

4 Testing for total causal-effects

For constructing confidence intervals for the total causal-effect, we invert tests of Hypothesis (5). In this section, we present two concrete tests based on likelihood ratio theory: (i) a classical constrained likelihood ratio test [16] and (ii) an approach based on the recently proposed split likelihood ratio test [17].

4.1 Constrained likelihood ratio test

The first option we consider is to form a classical likelihood ratio statistic for the testing Problem (5). Writing ℓ n for the Gaussian log-likelihood function based on the sample X ( 1 ) , … , X ( n ) , the likelihood ratio statistic for (5) is

(7) λ ˇ n ( ψ ) ≔ 2 ( sup Σ ∈ ℳ ℓ n ( Σ ) − sup Σ ∈ ℳ ψ ℓ n ( Σ ) ) .

In regular settings, the statistic may be compared to a chi-square quantile for an asymptotic test. Here, however, asymptotic distribution theory is difficult because both the null hypothesis and the alternative are intricate unions and do not satisfy the usual regularity assumptions. In the simplified bivariate setting, Strieder et al. [9] avoid the difficulties posed by the existence of singularities by testing modified problems in two different approaches. In the first approach, the hypothesis and alternative are relaxed to test an ordering of (conditional) variances. However, it is not feasible to optimize this modified testing problem in problems beyond the bivariate case. Furthermore, the mixture weights of the arising limit distribution, which is a mixture of chi-square distributions, are difficult to be determined in higher dimensions. The second approach relaxes the alternative to an unconstrained Gaussian model. In this modification, the statistic (7) is changed to a larger statistic for which the optimization of ℳ is replaced by an optimization over the positive definite cone PD ( d ) . This approach, however, may cause problems in the test inversion approach, as the union of tested null hypotheses no longer coincides with the alternative. In particular, small confidence regions may arise due to model misspecification.

In this article, we will derive our confidence region for the total causal-effect from tests that reject for too large values of the original statistic λ ˇ n ( ψ ) from (7). However, we will consider relaxing the alternative ℳ and use the theory of intersection union tests to obtain a simple yet effective upper bound on the distribution of λ ˇ n ( ψ ) .

To obtain this upper bound, first note that the hypothesis we need to test is an intricate union of single hypotheses given by ⋃ G ∈ G ( d ) ℳ ψ ( G ) . While we will later show that every single hypothesis defines a smooth submanifold, their union is not necessarily a smooth submanifold again. Nevertheless, the original test statistic (7) can then be written as λ ˇ n ( ψ ) = min G ∈ G ( d ) λ ˇ n ( ψ ) ( G ) , with

(8) λ ˇ n ( ψ ) ( G ) ≔ 2 ( sup Σ ∈ ℳ ℓ n ( Σ ) − sup Σ ∈ ℳ ψ ( G ) ℓ n ( Σ ) ) .

Thus, the original test statistic can be upper-bounded by every single test statistic λ ˇ n ( ψ ) ( G ) , and a valid-level α test of the union of single hypotheses can be constructed by testing every single hypothesis. We then reject the union null hypothesis if we reject all single hypotheses. A formal proof of this classical theory of intersection union tests can be found, e.g., in [14, Theorem 8.3.23].

In the second step, we relax the alternative to an unconstrained Gaussian model to obtain the following larger test statistic:

λ n ( ψ ) ( G ) ≔ 2 ( sup Σ ∈ PD ( d ) ℓ n ( Σ ) − sup Σ ∈ ℳ ψ ( G ) ℓ n ( Σ ) ) .

Relaxing the alternative to an optimization over the entire cone of positive definite matrices PD ( d ) potentially increases the achieved likelihood and consequently upper-bounds each single test statistic λ ˇ n ( ψ ) ( G ) .

Combining both approximations yields an upper bound of the original test statistic λ ˇ n ( ψ ) and, thus, a simple way to construct asymptotically conservative hypothesis tests. In the following, we state our main result by turning them into conservative confidence intervals for the total causal-effect C ( i → j ) via test inversion.

Theorem 4.1

Let α ∈ ( 0 , 1 ) . Then, an asymptotic ( 1 − α ) -confidence set for the causal-effect C ( i → j ) is given by:

C = { ψ ∈ R : λ ˇ n ( ψ ) ( i < G j ) ≤ χ d , 1 − α 2 } ∪ { 0 : λ ˇ n ( 0 ) ( j < G i ) ≤ χ d − 1 , 1 − α 2 } ,

where λ ˇ n ( ψ ) ( i < G j ) ≔ min G ∈ G ( d ) : i < G j λ ˇ n ( ψ ) ( G ) and λ ˇ n ( 0 ) ( j < G i ) ≔ min G ∈ G ( d ) : j < G i λ ˇ n ( 0 ) ( G ) .

Remark 4.2

Note that we only need to consider complete DAGs, which drastically reduces the number of single hypothesis tests that need to be evaluated and thus the computational burden. Furthermore, we need to respect graphs with j < G i only for zero-sized effects. Therefore, our proposed confidence regions may contain a non-zero interval and an isolated zero. In contrast to the bivariate case, however, zero effects can also correspond to orderings with i < G j due to cancellation (see also Remark 3.5).

Proof

Let ψ ∈ R and Σ ∈ ℳ ψ . Since ℳ ψ = ⋃ G ∈ G ( d ) ℳ ψ ( G ) , there exists a complete graph G , such that Σ ∈ ℳ ψ ( G ) . If i < G j , it follows from the introduced upper bound and Lemma 4.3

P Σ ( λ ˇ n ( ψ ) > χ d , 1 − α 2 ) ≤ P Σ ( λ ˇ n ( ψ ) ( G ) > χ d , 1 − α 2 ) ≤ P Σ ( λ n ( ψ ) ( G ) > χ d , 1 − α 2 ) → α .

We obtain an analogous result for j < G i with Lemma 4.4. Using the test inversion approach yields the claim.□

In the following, we derive the asymptotic distribution of the upper bound λ n ( ψ ) ( G ) of our test statistic under every single hypothesis ℳ ψ ( G ) . As previously indicated, the cases i < G j versus j < G i define different dimensional hypothesis spaces; thus, we consider both instances separately.

First, we consider the case i < G j and assume without loss of generality that ( 1 , 2 , … , d ) is the causal ordering. In this case, the single hypothesis can be written as ℳ ψ ( G ) = { Σ ∈ PD ( d ) : f ψ ( Σ ∣ G ) = 0 } , with

f ψ ( Σ ∣ G ) ≔ Σ j , i ∣ p ( i ) − ψ Σ 1 , 1 Σ 2 , 2 ∣ p ( 2 ) − Σ 1 , 1 ⋮ Σ d , d ∣ p ( d ) − Σ 1 , 1 .

The relaxed model space of all positive definite matrices PD ( d ) can be identified with an open subspace in R 1 2 ( d 2 + d ) . The hypothesis space ℳ ψ ( G ) then defines a 1 2 ( d 2 − d ) -dimensional submanifold of R 1 2 ( d 2 + d ) , since the Jacobian of f ψ has full rank d . To see that, note its (non-zero) triangular structure for derivatives for ( Σ k , k ) k = 2 , … , d in all rows except the first row. The first row, however, has a non-zero derivative for Σ j , i , and zero derivatives for ( Σ k , k ) k = j , … , d .

Using the fact that this single hypothesis defines a smooth submanifold, we determine the limit distribution of the upper bound λ n ( ψ ) ( G ) .

Lemma 4.3

Let G ∈ G ( d ) be a DAG with i < G j and ψ ∈ R . Under the hypothesis H 0 ( ψ ) ( G ) , the likelihood ratio test statistic λ n ψ ( G ) satisfies

λ n ( ψ ) ( G ) → D χ d 2 .

Proof

Since the hypothesis space ℳ ψ ( G ) defines a 1 2 ( d 2 − d ) -dimensional submanifold of R 1 2 ( d 2 + d ) , the tangent cone at every Σ ∈ ℳ ψ ( G ) is a 1 2 ( d 2 − d ) -dimensional linear subspace of R 1 2 ( d 2 + d ) and the claim follows. For details on this classical result, we refer to the study of Drton [18].□

In the case j < G i , the causal effect from i to j can only be zero. Therefore, we solely need to test the single hypothesis H 0 ( 0 ) ( G ) . We assume again without loss of generality that ( 1 , … , d ) is the underlying causal order. The polynomial constraint in Hypothesis (6) corresponding to the fixed-size effect is not informative, as the node j is included in p ( i ) , and thus, Σ j , i ∣ p ( i ) = 0 . Therefore, the hypothesis H 0 ( 0 ) ( G ) is solely defined by d − 1 polynomial constraints corresponding to the assumption of equal error variances, i.e., in this case, we have ℳ 0 ( G ) = { Σ ∈ PD ( d ) : f 0 ( Σ ∣ G ) = 0 } , where

f 0 ( Σ ∣ G ) ≔ Σ 2 , 2 ∣ p ( 2 ) − Σ 1 , 1 ⋮ Σ d , d ∣ p ( d ) − Σ 1 , 1 .

Thus, similar to the previous case, we obtain a chi-square distribution as the limit distribution, however with fewer degrees of freedom.

Lemma 4.4

Let G ∈ G ( d ) be a DAG with j < G i . Under Hypothesis H 0 ( 0 ) ( G ) , the likelihood ratio test statistic λ n ( 0 ) ( G ) satisfies

λ n ( 0 ) ( G ) → D χ d − 1 2 .

Proof

Analogously to Lemma 4.3. Note that the Jacobian has full rank d − 1 due to its triangular structure.□

With Theorem 4.1, we defined an asymptotically valid confidence interval for the total causal-effect that is able to capture both types of uncertainty: the uncertainty about the causal structure and the uncertainty about the numerical size of the effect. Besides the theoretical guarantees for the asymptotic coverage of the proposed confidence set, we demonstrate in the simulation section, that even in finite samples, the confidence set achieves the desired coverage probability. We emphasize again that our new approach uses the original test statistic (7) with an upper bound on its distribution. Thus, we avoid the issues that may arise under model misspecification in the (bivariate) method by Strieder et al. [9] due to contrasting a hypothesis restricted to equal error variances with a general alternative.

In the following, we propose an alternative framework for constructing hypothesis tests for (5) based on the theory of universal inference [17]. While this framework leads to more conservative intervals, it comes with a finite sample guarantee.

4.2 Split likelihood ratio test

Likelihood ratio tests provide a powerful tool for constructing hypothesis tests that are calibrated via asymptotic distribution theory. However, in this context, the needed distribution-theoretic insights are difficult to obtain and we adopted stochastic upper bounds. An interesting alternative is to conduct the needed tests in the framework of universal inference that was introduced by Wasserman et al. [17]. Their method uses a modification of the likelihood ratio statistic, called split likelihood ratio, based on a data-splitting approach. Due to the independence of the split data, one can apply a universal critical value, instead of relying on asymptotic distribution theory. Type I error control is then guaranteed by an application of Markov’s inequality. In the following, we use their methodology for constructing conservative confidence regions for total causal-effects with finite sample guarantee.

To construct a split likelihood ratio test for the test Problem (5), we proceed as follows. First, we split the data into two subsets D 0 = { X ( 1 ) , … , X ( k ) } and D 1 = { X ( k + 1 ) , … , X ( n ) } and define the log-likelihood functions:

ℓ j ( Σ ) ≔ ∑ X ( i ) ∈ D j log p ( X ( i ) ∣ Σ ) , j = 0 , 1 ,

based on D 0 and D 1 , respectively. Let Σ ˜ 1 ≔ argmax Σ ∈ ℳ ℓ 1 ( Σ ) be the maximum-likelihood estimator of Σ restricted to LSEMs with equal error variances based on D 1 . Then, the split likelihood ratio test statistic is defined as:

λ ˜ n ( ψ ) ≔ 2 ( ℓ 0 ( Σ ˜ 1 ) − sup Σ ∈ ℳ ψ ℓ 0 ( Σ ) ) .

The insight of Wasserman, Ramdas, and Balakrishnan [17] is that for any fixed Σ * ∈ PD ( d ) , it holds that

E Σ ∏ X ( i ) ∈ D 0 p ( X ( i ) ∣ Σ * ) p ( X ( i ) ∣ Σ ) ≤ ∫ ∏ X ( i ) ∈ D 0 p ( X ( i ) ∣ Σ * ) = 1 ,

and, thus, Markov’s inequality implies that for any α ∈ ( 0 , 1 ) , under H 0 ( ψ ) : Σ ∈ ℳ ψ ,

(9) P Σ ( λ ˜ n ( ψ ) > − 2 log α ) ≤ α E Σ ∏ X ( i ) ∈ D 0 p ( X ( i ) ∣ Σ ˜ 1 ) sup Σ ∈ ℳ ψ p ( X ( i ) ∣ Σ ) ≤ α E Σ E Σ ∏ X ( i ) ∈ D 0 p ( X ( i ) ∣ Σ ˜ 1 ) sup Σ ∈ ℳ ψ p ( X ( i ) ∣ Σ ) ∣ D 1 ≤ α .

Thus, the decision rule

Reject H 0 ( ψ ) if λ ˜ n ( ψ ) > − 2 log α

constitutes a valid-level α test. This test holds level in finite samples and without any regularity conditions. Thus, we can define the following confidence region for the total causal-effect based on the split likelihood ratio test.

Theorem 4.5

Let α ∈ ( 0 , 1 ) . Then, a ( 1 − α ) -confidence set for the causal-effect C ( i → j ) is given by:

C = { ψ ∈ R : λ ˜ n ( ψ ) ( i < G j ) ≤ − 2 log ( α ) } ∪ { 0 : λ ˜ n ( 0 ) ( j < G i ) ≤ − 2 log ( α ) } ,

where λ ˜ n ( ψ ) ( i < G j ) ≔ min G ∈ G ( d ) : i < G j λ ˜ n ( ψ ) ( G ) and λ ˜ n ( 0 ) ( j < G i ) ≔ min G ∈ G ( d ) : j < G i λ ˜ n ( 0 ) ( G ) and

λ ˜ n ( ψ ) ( G ) ≔ 2 ( ℓ 0 ( Σ ˜ 1 ) − sup Σ ∈ ℳ ψ ( G ) ℓ 0 ( Σ ) ) .

Remark 4.6

Similar to the asymptotic confidence set given by Theorem 4.1, we need to consider all graphs when testing for zero effects. Thus, the confidence region may consist of a non-zero interval and a disconnected zero that reflects the remaining uncertainty about the existence of an effect.

Proof

This result follows directly from the test inversion approach and (9). Note that for non-zero effects, we only need to consider graphs with i < G j .□

5 Computational details and numerical experiments

In this section, we present computational details as well as shortcuts in our proposed algorithm that significantly improve the computation time. We also present a simulation study in which we analyze and compare the coverage probabilities and performance of the different proposed testing procedures.

5.1 Algorithm and computational shortcuts

We focus on the algorithm for calculating the likelihood ratio-based confidence regions introduced in Theorem 4.1, denoted with LRT. By adjusting the critical values in the algorithm and incorporating the data-splitting procedure, the split likelihood ratio-based confidence regions can be computed analogously. As the number of possible underlying causal structures, i.e., the number of DAGs, grows superexponentially with the number of nodes, we need to carefully use combinatorial shortcuts in order to be able to compute our proposed confidence intervals for the total causal-effect.

Our algorithm, presented as Algorithm 1, consists of two subroutines. The first routine possible.order (Algorithm 2) immediately rejects all implausible causal orderings and, thus, reduces the set of possible causal orderings in which we subsequently test for all attainable effects with the second routine testeffect (Algorithm 3). Carefully reducing the space of plausible causal orderings drastically decreases the computation time and allows us to compute our proposed confidence intervals, which take uncertainty over all possible DAGs into account, in a reasonable time for a moderate (but far from trivial) number of involved variables. Precise computation times can be found in Section 5.3, which presents the results of a simulation study.

Algorithm 1 Returns `LRT` confidence region for C ( i → j )
Input: data, level α , stepsize s
(orderings, likelihood, effect) ← `possible.order`(data, α )	▷ all plausible orderings
L 1 ← likelihood[1]	▷ maximum likelihood alternative
startvalues ← unique(effect)
while startvalues is not empty do	▷ start with effects from unrestricted MLEs
(leftbound, rightbound) ← min (startvalues)
(left, right) ← TRUE
while left do
leftbound ← leftbound − s
left ← `test.effect`(data, leftbound, orderings, L1, α )	▷ test for size leftbound effect
end while
while right do
rightbound ← rightbound + s
right ← `test.effect`(data, rightbound, orderings, L1, α )
end while
append (leftbound + s 2 ,rightbound − s 2 ) to intervals	▷ possibly multiple intervals
remove all values smaller than rightbound from startvalues
end while
zero ← `test.effect`(data, 0, orderings, L1, α )	▷ test for size zero effect
return (zero, intervals)

In order to reduce the space of plausible graphical structures as much as possible, we note again that we do not need to consider all possible DAGs but merely complete ones, represented by all possible causal orderings. This immensely decreases the worst-case number of single hypotheses we need to perform to reject an effect of size ψ to d factorial. Our subroutine possible.order (Algorithm 2) further reduces the number of plausible causal orderings as follows.

First, we quickly approximate the maximum likelihood under the alternative with the maximum likelihood L 1 ^ obtained under the ordering obtained from the method in the study by Chen et al. [10] that sorts conditional variances. Subsequently, we calculate the maximum likelihood of all possible orderings without restrictions on the causal-effect and immediately reject implausible orderings using the (largest) critical value χ d , 1 − α 2 . Note that we can search through the space of orderings recursively starting from the first node and immediately reject (partial) orderings if the corresponding (partial) maximum likelihood already exceeds the threshold. Furthermore, for the subsequent testing procedure, the specific ordering of the parent set p ( i ) is not relevant. Thus, for all orderings that only differ in the specific ordering of p ( i ) , we subsequently just test with the parent order that achieves the maximum likelihood. Additionally, we only need to test for a zero effect in the maximum-likelihood ordering out of all orderings with j < G i . Finally, we include an attainable effect in our confidence set if it cannot be rejected for at least one of the possible causal orderings using the subroutine testeffect, Algorithm 3. Therefore, we do not need to continue testing for a specific value of the effect if we find an ordering such that we cannot reject the value. Thus, first sorting the causal orderings by their maximum likelihood further improves the computation time.

Algorithm 2 `possible.order`() returns plausible causal orderings
Input: data, level α
crit.val ← qchisq ( 1 − α , d )	▷ 1 − α quantile of χ d 2 -distr.
Σ ˆ ← cov(data)
for 1 : d do
append which.min( Σ ˆ ⋅ ⋅ ∣ var.order ) to var.order	▷ ordering of cond. variances
end for
L 1 ^ ← `get.likelihood`(var.order)	▷ (unrestricted) maximum likelihood
for 1 : d do
for order in orderings do
for v in setdiff( 1 : d , order) do
part.order ← c(order, v )
likelihood ← `get.likelihood`(part.order)	▷ (partial) maximum likelihood
if 2( L 1 ^ − likelihood) ≤ crit.val then
append part.order to orderings
if v equals i then
effect ← Σ ˆ i , j ∣ order ∕ Σ ˆ i , i ∣ order	▷ causal-effect of maximum likelihood
end if
end if
end for
remove order from orderings
end for	▷ order only relevant after i
choose order with max likelihood out of all orders in orderings that do not contain i and are defined on the same set of nodes
end for	▷ keep only max likelihood for zero effect
choose order with max likelihood out of all orders in orderings with j < G i
sort orderings for decreasing likelihood
return (orderings, likelihood, effect)

Algorithm 3. `test.effect`() tests causal-effect C ( i → j ) = ψ
Input: data, causal-effect ψ , orderings, likelihood alternative L1, level α
for order in orderings do
if j < G i then
crit.val ← qchisq( 1 − α , d − 1 )
if ψ equals zero then
likelihood ← `get.likelihood`(order)	▷ (unrestricted) maximum likelihood
else
likelihood ← − ∞	▷ reject
end if
else
crit.val ← qchisq( 1 − α , d )
likelihood ← `get.likelihood`(order, ψ )	▷ (restricted) maximum likelihood
end if
if 2(L1-likelihood) ≤ crit.val then
return TRUE
end if
end for
return FALSE

We start testing within the plausible causal orderings with the minimal causal-effect that corresponds to the maximum-likelihood estimates. Then, we step-wise decrease the value until we reject in all orderings. The last value we did not reject is a lower bound for our confidence region. We repeat the procedure with increasing steps until we reject in all orderings and the last value we did not reject is an upper bound for this part of our confidence region. If there remain causal-effects from the maximum-likelihood estimates that exceed this upper bound, we repeat this procedure. Thus, using this strategy, we might obtain a confidence region that consists of multiple intervals. Finally, we test whether we have to include zero in our confidence interval, i.e., whether there remains uncertainty about the existence of an effect.

5.2 Maximum-likelihood estimation

Implementing the proposed algorithm involves the routine get.likelihood, which maximizes the centered Gaussian (log-)likelihood function ℓ n ( Σ ) for a given causal ordering. The function is given by:

2 n ℓ n ( Σ ) = − log det ( 2 π Σ ) − tr ( Σ − 1 Σ ˆ ) ,

where Σ ˆ = 1 n ∑ l = 1 n X ( l ) ( X ( l ) ) T is the empirical covariance. Due to the underlying assumption of homoscedasticity across all interacting variables, the set of possible distributions can be uniquely identified via (2). Therefore, equivalently, we maximize

2 n ℓ n ( B , σ 2 ) = − d log ( 2 π σ 2 ) − 1 σ 2 tr ( ( I d − B ) T ( I d − B ) Σ ˆ ) ,

using the edge matrix B ∈ R G and the common variance σ 2 > 0 . The zero entries of the matrix B are implied by the given causal ordering. Furthermore, due to the acyclic structure, B is permutation similar to a lower triangular matrix. Straightforward calculations yield that the maximum of ℓ n ( B , σ 2 ) over the equal variance σ 2 > 0 is given by:

(10) sup σ 2 > 0 ℓ n ( B , σ 2 ) = − n d 2 log ( 2 π d tr ( ( I d − B ) T ( I d − B ) Σ ˆ ) ) − n d 2 .

Maximizing equation (10) over B ∈ R G is then equivalent to minimizing

(11) tr ( ( I d − B ) T ( I d − B ) Σ ˆ ) = 1 n ∑ k = 1 d ∑ l = 1 n ( X k ( l ) − ∑ p ∈ p ( k ) β k , p X p ( l ) ) 2 .

Thus, in the unrestricted case without constraints for a fixed-size causal-effect, the problem is equivalent to solving d separate linear least-squares problems with minimizer:

(12) β ˆ k , p ( k ) T = ( Σ ˆ p ( k ) , p ( k ) ) − 1 Σ ˆ p ( k ) , k , for k = 1 , … , d ,

and corresponding minimum

1 n ∑ k = 1 d ∑ l = 1 n ( X k ( l ) − ∑ p ∈ p ( k ) β ˆ k , p X p ( l ) ) 2 = ∑ k = 1 d Σ ˆ k , k ∣ p ( k ) .

Furthermore, in this case, we can calculate the maximum likelihood of any partial ordering that starts from the source node, since it collapses into subproblems. Thus, we can reject all orderings that start with a given partial ordering if the corresponding partial maximum likelihood already exceeds the threshold.

In addition, we need to calculate the maximum-likelihood estimates under each single hypothesis H 0 ( ψ ) ( G ) . For j < G i , we only test Hypothesis H 0 ( 0 ) ( G ) . However, in that case, the constraint of a size zero causal-effect C ( i → j ) = 0 is not restrictive, and thus the minimizer is similarly given by (12). Therefore, we assume in the following i < G j . Analogously to the previous calculations, maximizing the Gaussian likelihood for a given causal ordering and a fixed total causal-effect can be solved by minimizing the linear least-squares Problem (11). However, the constraint of a fixed total causal-effect C ( i → j ) = ψ links a subset of the regression parameters, and we cannot solve all d linear least-squares problems separately.

Recalling the path properties of the total causal-effect (see Remark 2.1), the assumption C ( i → j ) = ( I d − B ) j , i − 1 = ψ restricts all β k , p ( k ) for k ∈ p ( j ) ∪ { j } \ p ( i ) ∪ { i } . This follows by observing that

( I d − B ) j , i − 1 = ∑ k = 1 d − 1 ( B k ) j , i

and further inductively

( B k ) j , i = ∑ p ∈ p ( j ) \ p ( i ) β j , p ( B k − 1 ) p , i .

For all other k ∉ p ( j ) ∪ { j } \ p ( i ) ∪ { i } , the corresponding part of the optimization Problem (11) can be solved separately with the (unrestricted) minimizer given by (12). However, it remains to minimize the following non-trivial joint least-squares problem:

∑ k ∈ p ( j ) ∪ { j } \ p ( i ) ∪ { i } ∑ l = 1 n ( X k ( l ) − ∑ p ∈ p ( k ) β k , p X p ( l ) ) 2 , where β j , i = ψ − ∑ m = 2 d − 1 ( B m ) j , i .

Note that we can reduce the constraint to the submatrix B s , which is formed by selecting all rows and columns corresponding to the set p ( j ) ∪ { j } \ p ( i ) , since ( B m ) j , i = ( B s m ) j , i . Then, the constraint yields β j , i as a function of the remaining β k , p ( k ) for k ∈ p ( j ) ∪ { j } \ p ( i ) ∪ { i } , and we solve the remaining least-squares problem numerically with quasi-Newton methods.

5.3 Simulations

In this section, we present the results of a simulation study that compares the coverage frequencies and the widths of the proposed confidence regions as well as their computing times. Our simulation experiments are designed as follows. We generate synthetic data sets based on LSEMs with Gaussian errors and equal variances associated with randomly selected directed acyclic graphs in a three-step procedure. First, we randomly select a permutation of d nodes, representing the causal ordering of ( X 1 , … , X d ) . Second, we generate edge weights for all edges in the complete graph that corresponds to the chosen causal ordering according to a normal distribution N ( β , 0.1 ) . In that way, β reflects the average strength of the direct effects in the graph, and thus, smaller values indicate higher uncertainty about the underlying causal structure. In the next step, we prune the complete graph, i.e., we include an edge with a probability of 0.9 for dense and 0.5 for sparse graphs. Finally, we generate n samples according to the LSEM, represented by the chosen graph, with standard normal distributed errors. For a range of edge weights β , sample sizes n and dimensions d , we generated multiple independent data sets and determined the confidence sets for the total causal-effect of X 1 on X 2 with confidence level α = 0.05 using the different proposed methods (LRT given by Theorem 4.1 and SLRT given by Theorem 4.5). We repeated this procedure twice, for the case of a true non-zero effect and the case of no effect. We chose the tuning parameter of the SLRT interval, the splitting ratio, optimal in the sense of Strieder and Drton [19].

We report the observed empirical coverage probabilities for six-dimensional graphs in Table 1. All proposed methods achieve the desired coverage frequency of 0.95 and seem to be conservative. This is expected for the split likelihood ratio method since it is based solely on Markov’s inequality. However, it is also not surprising that the LRT method is conservative, considering the use of intersection union testing and the fact that we set critical values based on the relaxed alternative. For comparison, we also included classical bootstrapping results in our simulation study. The bootstrapping method uses GDS, an established causal discovery algorithm for LSEMs with equal variance, proposed by Peters and Bühlmann [8]. For each resampled data set, the GDS method uses a greedy search algorithm to maximize the likelihood and estimate the total causal-effect. The results validate the theoretical findings in the studies by Strieder et al. [9] and Drton and Williams [13]. Bootstrapping methods do not correctly account for uncertainty in the causal structure and do not achieve the required coverage frequency, as highlighted in bold.

Table 1

Empirical coverage of 95% confidence intervals for the total causal-effect of X 1 on X 2 in randomly generated 6-dim. DAGs (1,000 replications)

Method	n \ β	True effect						No true effect
		Sparse			Dense			Sparse			Dense
		0.05	0.1	0.5	0.05	0.1	0.5	0.05	0.1	0.5	0.05	0.1	0.5
LRT	100	1.00	1.00	0.99	0.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	500	0.99	0.99	0.99	0.99	0.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	1,000	0.98	0.98	1.00	0.99	1.00	0.99	1.00	1.00	1.00	1.00	1.00	1.00
SLRT	100	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	500	0.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	1,000	0.99	0.99	1.00	0.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
Bootstrap	100	0.70	0.75	0.97	0.75	0.80	0.97	1.00	1.00	1.00	1.00	1.00	1.00
	500	0.72	0.77	0.97	0.78	0.85	0.98	1.00	1.00	1.00	1.00	0.99	1.00
	1,000	0.76	0.80	0.95	0.79	0.88	0.98	0.99	0.99	1.00	1.00	1.00	1.00

In addition to the coverage frequencies, we investigated how conclusive the confidence regions are, i.e., how informative are intervals for practitioners in deciding about the existence and size of the causal-effects, while still providing a valid uncertainty assessment. To this, we show in Figure 1 the average width of the non-zero part of the confidence sets. We report the results for dimension d = 6 , where the data were generated according to random DAGs with expected edge weight β = 0.5 , against the sample size. In contrast to the bootstrapping method, our proposed LRT method constitutes a valid confidence region. Nevertheless, even in the considered setting with edge weight β = 0.5 , where all variants achieved the desired coverage, our method significantly outperforms bootstrapping in the dense setting, while performing similarly in the sparse setting. In contrast, the split likelihood ratio tests yield wider confidence intervals. We also note that the confidence intervals are generally wider in the dense setting, i.e., the remaining uncertainty about the numerical size of the effects is higher for dense DAGs. Our intuition here is that in dense DAGs, more directed paths contribute to the total causal-effect; thus, more edge weights are involved with remaining uncertainty. Figure 1 shows that our proposed methods successfully help to locate the numerical size of existing causal-effects while correctly quantifying remaining uncertainty.

$Figure 1 Mean width of 95% confidence intervals for the total causal-effect of X 1 {X}_{1} on X 2 {X}_{2} in randomly generated 6-dim. DAGs (1,000 replications).$

Figure 1

Mean width of 95% confidence intervals for the total causal-effect of X 1 on X 2 in randomly generated 6-dim. DAGs (1,000 replications).

In addition to locating the numerical size of existing effects, an informative confidence interval should help in deciding whether there exists remaining uncertainty about the existence of an effect. Therefore, to investigate whether the proposed methods pick up on the causal direction of the effect between two variables, we plotted the proportions of times zero is included in the confidence sets when there is a true non-zero effect. Figure 2 shows the results for dimensions d = 6 with expected edge weight β = 0.5 against the sample size and for the sample size n = 500 against the expected edge weight. The LRT method eliminates the remaining uncertainty about the direction of the causal-effect the quickest with increasing sample size, significantly outperforming the bootstrapping method and the split likelihood ratio tests. As intuitively expected, for higher average edge weights, there is less uncertainty about the underlying causal structure, and thus, the methods more often exclude the possibility of no causal-effect.

Figure 2

Percentages of times zero contained in 95% confidence intervals in randomly generated 6-dim. DAGs with non-zero effect (1,000 replications): (Left) against sample size and (Right) against expected edge weights.

Furthermore, one observes that in contrast to the bootstrapping method, the confidence regions of our proposed methods contain the zero less often in the dense setting compared to those in the sparse setting. This stems from the fact that there exists less uncertainty about the causal ordering for dense DAGs, and thus, the remaining uncertainty about the direction of the involved effects is lower than in the sparse setting. While there might be less uncertainty about the numerical size of the effect and thus, narrower confidence regions in the sparse setting, the uncertainty about the causal structure is higher.

We obtain similar results for larger causal structures with 12 involved variables. Figure 3 shows the percentages of times our proposed LRT-confidence regions contain zero as well as the mean width of the non-zero part. The data were generated with an expected edge weight β = 0.5 and increasing sample sizes. In the case of no true effect, the confidence intervals always correctly contain zero. The average width of the non-zero part is close to zero, even for low sample sizes. If there is a non-zero total causal-effect, the remaining uncertainty about the direction of the effect is already eliminated in samples of size around 2,000. We highlight again that the remaining uncertainty about the causal structure (reflected by the percentage of times zero is contained in the confidence interval) is higher in the sparse setting. In contrast, the remaining uncertainty about the numerical size of the effect, reflected by the average width of the non-zero part of the confidence intervals, is lower.

$Figure 3 Mean width and percentages of times zero contained in 95% confidence intervals for the total causal-effect of X 1 {X}_{1} on X 2 {X}_{2} in randomly generated 12-dim. DAGs (100 replications).$

Figure 3

Mean width and percentages of times zero contained in 95% confidence intervals for the total causal-effect of X 1 on X 2 in randomly generated 12-dim. DAGs (100 replications).

In summary, the proposed framework for constructing confidence intervals for total causal-effects successfully detects the direction and the numerical size of causal-effects and correctly quantifies the remaining uncertainty in causal structure and effect size.

Figure 4 compares the average computation times over 100 randomly drawn DAGs. As the number of possible causal structures grows superexponentially with the dimension, the computation times for our proposed confidence intervals naturally increase. Since the intervals take uncertainty over all possible structures into account, the computation times similarly increase superexponentially. However, Figure 4 shows that it is feasible to compute this complete uncertainty over all causal structures for moderate dimensions with lower computation times than the bootstrapping method. We plotted the computation times for an expected edge weight β = 0.5 and sample size n = 1,000 against the dimension (up to d = 12 ). Contrasting the bootstrapping procedure based on the greedy search algorithm GDS, the computation times for our proposed confidence intervals are lower in the dense compared to those in the sparse setting. This is not unexpected, as dense DAGs generally coincide with less uncertainty in the causal structure and thus lead to fewer plausible causal orderings passed on to the testing procedure. Similarly, if there is no true effect, computation times are lower since we need to test within fewer plausible causal orderings. In general, the computation times are mainly determined by reducing the set of plausible orderings with Algorithm 2 and the uncertainty in the causal structure within the data.

Figure 4

Mean computation times of 95% confidence intervals for causal-effect in randomly generated DAGs in seconds (100 replications): (Left) against dimension and (Right) against sample size.

Furthermore, Figure 4 also includes the average computation times for increasing sample sizes with expected edge weight β = 0.5 and dimension d = 6 . In contrast to the bootstrapping procedure, increasing the sample size reduces the computation times for our proposed confidence intervals. More available data reduce the uncertainty about the underlying structure, which reduces the computation time.

Our framework is based on the assumption of an underlying Gaussian LSEM with equal error variances. We emphasize that if the true data-generating mechanism follows a Gaussian LSEM with differing error variances, the causal structure and, consequently, the true causal-effects are generally not identifiable. Therefore, our task of constructing a confidence interval for this causal-effect is not well defined. However, in this setting, our method targets the causal-effect in an approximation of the data-generating mechanism under the equal error variances assumption. With practical applications in mind, we investigated this behavior and, thus, the sensitivity of our method toward deviations from equal error variances. We generated data according to the same procedure as described earlier (for d = 6 , β = 0.1 , and n = 500 ), but with error variances sampled uniformly from [ 1 − 0.5 v , 1 + 0.5 v ] . In that way, v quantifies the degree of deviation from homoscedasticity across the interacting variables, ranging from 0 to 1.8. The empirical coverage probabilities in Figure 5 show that our framework is rather robust to small deviations from equal error variances. For small deviations, the targeted causal-effect in the approximated equal error variance model is close to the true causal-effect, and thus, our confidence intervals, nevertheless, cover the (unidentifiable) true effect.

$Figure 5 Empirical coverage of 95% confidence intervals for the total causal-effect of X 1 {X}_{1} on X 2 {X}_{2} in randomly generated 6-dim. DAGs under departure from equal error variances (1,000 replications).$

Figure 5

Empirical coverage of 95% confidence intervals for the total causal-effect of X 1 on X 2 in randomly generated 6-dim. DAGs under departure from equal error variances (1,000 replications).

In conclusion, the likelihood ratio method performs best in our simulation studies and yields the most informative confidence intervals. Nonetheless, while the split likelihood ratio tests are more conservative, we emphasize that it is the only method with theoretical finite sample guarantees.

5.4 Real data example

To illustrate the importance of correctly accounting for uncertainty in the causal structure, we consider the frequently studied Sachs protein data [20]. This collection of 14 data sets comprises expression levels for 11 proteins and phospholipids in human T-cells obtained under different experimental conditions. In applications involving variables from similar domains and measurements obtained by similar techniques, the assumption of equal error variances offers a simple way to conduct exploratory analyzes of cause-effect relations, also in light of the robustness of our method toward small deviations from equal variance that was seen in our simulations. In our data analyzis, we compared our LRT method against naive bootstrapping with the GDS method [8] as well as bootstrapping with the LiNGAM method [21]. The LiNGAM method explicitly assumes non-Gaussian errors to ensure identifiability. We considered the first of the 14 data sets with a sample size of 853 and focused on a subset of 10 proteins (excluding PKA) that have measurements on a similar scale. The protein PKA is a highly variable source node in the ground truth causal DAG reported in the study by Sachs et al. [20]. To account for its influence, we first performed linear regressions of each protein on PKA and continued our analyzis using the residuals.

Figure 6 shows the calculated 95% confidence intervals for six exemplary causal-effects. The regression coefficient of X i when regressing X j on ( X i , X p ( i ) ) , where the conventionally accepted ground truth structure defines the parent set, is seen as the true target effect C ( i → j ) and marked in red. In total, our proposed confidence intervals cover 84 of the 90 pairwise causal-effects among the 10 involved variables. In contrast, naive bootstrapping with GDS and with LiNGAM cover only 66 and 72, respectively. Our confidence intervals indicate that, even within our strict modeling assumptions, there is substantial structure uncertainty. Out of the 90 calculated intervals, 67 contain zero as well as non-zero effects, thus reflecting uncertainty about the direction and existence of effects. In comparison, the naive methods with GDS and with LiNGAM contain both directions in only 35 and 55 cases, respectively. Thus, while the bootstrapping methods yield narrower intervals on average, they do not correctly account for structure uncertainty and tend to miss more minor target effects.

Figure 6

95% confidence intervals for the total causal-effect between proteins from the real world expression data.

6 Discussion

In this article, we raise the problem of rigorously quantifying uncertainty in causal inference when the causal structure is unknown. We demonstrate the intricacies of this task and propose a precise solution for constructing confidence regions for total causal-effects, which capture both types of uncertainty: numerical size of effects and causal structure. Using a test inversion approach with intersection union tests and exploiting combinatorial shortcuts, our proposed framework mathematically rigorously quantifies the remaining uncertainty in the concrete setting of LSEMs with homoscedastic Gaussian errors across all interacting variables.

Our proposal of leveraging test inversions of joint tests for causal structure and size of effect to obtain confidence regions for causal-effect estimates under structure uncertainty has a general nature and with suitable statistical tests could be extended to other modeling assumptions (see Strieder et al.’s study [9] for results from a heuristic for the example of bivariate LSEMs with non-Gaussian errors). Moreover, our framework could be adjusted to target different (causal) quantities by reformulating the corresponding constraint.

Assuming LSEMs with equal error variances ensures identifiability. Thus, a confidence region corresponds to a set of covariance matrices, which all uniquely entail a corresponding total causal-effect. Without equal error variances (or rather identifiability in general), this unique correspondence does not hold. While our computational framework of focusing on causal orders and complete graphs might be adapted to consider a set of plausible covariance matrices, every covariance matrix then entails a set of possible causal-effects that always contains zero. Thus, the corresponding confidence regions are not conclusive about the direction of effects. In contrast, we demonstrated that under the assumption of equal error variances, our proposed confidence regions not only correctly quantify the remaining uncertainty in causal structure and numerical effect size, but also provide conclusive information that can help practitioners to decide about existence and numerical size of effects.

The computational shortcuts we adopt allow us to quantify structure uncertainty over all possible DAGs among a moderate number of nodes. For higher-dimensional systems, it is not feasible to consider all possible DAGs given that there already exist more DAGs on 22 nodes ( 1 0 87 ) than estimated atoms in the universe. However, our method can still be used by practitioners as a valuable tool for analyzing higher-dimensional causal systems by partially quantifying the uncertainty in the causal structure.

In the status quo scenario, practitioners would either confer with experts that provide the underlying causal structure or use structure learning algorithms to estimate the DAG from available data. Starting from this inferred model, classical statistical methods can then be used to calculate confidence intervals for the involved causal parameters. These confidence intervals quantify uncertainty in the numerical size of the causal-effect conditional on the inferred model. However, they do not incorporate uncertainty about the underlying causal structure and the (data-driven) model choice. While in higher-dimensional systems, it is not computationally tractable to quantify structure uncertainty by considering all DAGs, our framework can still be used as a compromise in between. Rather than solely relying on the full structure provided by experts, our method can be used to consider a set of plausible causal structures. This way practitioners can incorporate varying degrees of belief of experts in different parts of the structure by fixing parts with high confidence but still considering the remaining uncertainty over the rest. Along similar lines, instead of focusing on a single output of causal learning algorithms, practitioners can use our method to a set of (most likely) causal structures and incorporate some degree of structure uncertainty in causal-effect estimates.

In summary, we consider our study of total causal-effect in linear causal models with equal error variances as the beginning of a new line of research on confidence sets under structure uncertainty. We anticipate fruitful generalizations in many directions including not only other model settings such as the linear non-Gaussian case, or additive noise models, but also different causal parameters other than total effects. From this perspective, while this article develops our ideas in a very concrete setting, it sets the scene for various follow-up works in providing rigorously justified confidence regions in causal-effect estimation when the causal structure itself is learned from data.

Funding information: This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 83818). Furthermore, this work has been funded by the German Federal Ministry of Education and Research and the Bavarian State Ministry for Science and the Arts. The authors of this work take full responsibility for its content.
Conflict of interest: The authors state no conflict of interest.

References

[1] Pearl J. Causality. Models, reasoning, and inference. 2nd ed. Cambridge: Cambridge University Press; 2009. 10.1017/CBO9780511803161Search in Google Scholar

[2] Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. Adaptive computation and machine learning. Cambridge, MA: MIT Press; 2000. 10.7551/mitpress/1754.001.0001Search in Google Scholar

[3] Hoyer PO, Hyttinen A. Bayesian discovery of linear acyclic causal models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. UAI’09. Arlington, Virginia, USA: AUAI Press; 2009. p. 240–8. Search in Google Scholar

[4] Claassen T, Heskes T. A Bayesian approach to constraint based causal inference. In: Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence. UAI’12. Arlington, Virginia, USA: AUAI Press; 2012. p. 207–16. Search in Google Scholar

[5] Cao X, Khare K, Ghosh M. Posterior graph selection and estimation consistency for high-dimensional Bayesian DAG models. Ann Statist. 2019;47(1):319–48. 10.1214/18-AOS1689Search in Google Scholar

[6] Peters J, Janzing D, Schölkopf B. Elements of causal inference. Adaptive Computation and Machine Learning. Foundations and learning algorithms. Cambridge, MA: MIT Press; 2017. Search in Google Scholar

[7] Maathuis M, Drton M, Lauritzen S, Wainwright M, editors. Handbook of graphical models. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. Boca Raton, FL: CRC Press; 2019. 10.1201/9780429463976Search in Google Scholar

[8] Peters J, Bühlmann P. Identifiability of Gaussian structural equation models with equal error variances. Biometrika. 2014;101(1):219–28. 10.1093/biomet/ast043Search in Google Scholar

[9] Strieder D, Freidling T, Haffner S, Drton M. Confidence in causal discovery with linear causal models. In: Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence. UAI’21. Vol. 161. PMLR; 2021. p. 1217–26. Search in Google Scholar

[10] Chen W, Drton M, Wang YS. On causal discovery with an equal-variance assumption. Biometrika. 2019;106(4):973–80. 10.1093/biomet/asz049Search in Google Scholar

[11] Ghoshal A, Honorio J. Learning linear structural equation models in polynomial time and sample complexity. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics. Vol. 84. PMLR; 2018. p. 1466–75. Search in Google Scholar

[12] Andrews DWK, Guggenberger P. Asymptotic size and a problem with subsampling and with the m out of n bootstrap. Econ Theory. 2010;26(2):426–68. 10.1017/S0266466609100051Search in Google Scholar

[13] Drton M, Williams B. Quantifying the failure of bootstrap likelihood ratio tests. Biometrika. 2011;98(4):919–34. 10.1093/biomet/asr033Search in Google Scholar

[14] Casella G, Berger RL. Statistical inference. The Wadsworth & Brooks/Cole Statistics/Probability Series. Pacific Grove, CA: Wadsworth & Brooks/Cole Advanced Books & Software; 1990. Search in Google Scholar

[15] Drton M. Algebraic problems in structural equation modeling. In: The 50th anniversary of Gröbner bases. Vol. 77 of Adv. Stud. Pure Math. Math. Soc. Japan, Tokyo; 2018. p. 35–86. 10.2969/aspm/07710035Search in Google Scholar

[16] Silvapulle MJ, Sen PK. Constrained statistical inference. Wiley Series in Probability and Statistics. Hoboken, NJ: Wiley-Interscience [John Wiley & Sons]; 2005. Search in Google Scholar

[17] Wasserman L, Ramdas A, Balakrishnan S. Universal inference. Proc Natl Acad Sci USA. 2020;117(29):16880–90. 10.1073/pnas.1922664117Search in Google Scholar PubMed PubMed Central

[18] Drton M. Likelihood ratio tests and singularities. Ann Statist. 2009;37(2):979–1012. 10.1214/07-AOS571Search in Google Scholar

[19] Strieder D, Drton M. On the choice of the splitting ratio for the split likelihood ratio test. Electron J Stat. 2022;16(2):6631–50. 10.1214/22-EJS2099Search in Google Scholar

[20] Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP. Causal protein-signaling networks derived from multiparameter single-cell data. Science. 2005;308(5721):523–9. 10.1126/science.1105809Search in Google Scholar PubMed

[21] Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A. A linear non-Gaussian acyclic model for causal discovery. J Mach Learn Res. 2006;7:2003–30. Search in Google Scholar

Received: 2023-05-16

Revised: 2023-09-09

Accepted: 2023-09-21

Published Online: 2023-12-19

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2023-0030

Keywords for this article

confidence intervals; causal-effects; linear structural equation models; equal error variances; graphical models

Creative Commons

BY 4.0