On the pitfalls of Gaussian likelihood scoring for causal discovery

Christoph Schultheiss; Peter Bühlmann

doi:10.1515/jci-2022-0068

Article Open Access

On the pitfalls of Gaussian likelihood scoring for causal discovery

Christoph Schultheiss and Peter Bühlmann

Published/Copyright: May 11, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Causal Inference Volume 11 Issue 1

Abstract

We consider likelihood score-based methods for causal discovery in structural causal models. In particular, we focus on Gaussian scoring and analyze the effect of model misspecification in terms of non-Gaussian error distribution. We present a surprising negative result for Gaussian likelihood scoring in combination with nonparametric regression methods.

Keywords: graphical models; model misspecification; non-parametric regression; structural causal models

MSC 2010: 62D20

1 Introduction

We consider the problem of finding the causal structure of a set of random variables X 1 , … X p . We assume that the data can be represented by a structural equation model whose structure is given by a directed acyclic graph (DAG), say, G 0 , with nodes 1 , … , p and denote by PA ( j ) the parents of j . Without further assumptions on the structural causal model, one can only find G 0 up to its Markov equivalence class. This can be achieved, for example, with the PC algorithm [1]. Its generality comes at the price of requiring conditional independence tests, which are a statistically hard problem.

Here, we focus on the so-called additive noise model (ANM)

(1) X j ← f j ( X PA ( j ) ) + ℰ j ∀ j ∈ 1 , … , p ,

where the ℰ j are mutually independent, centered random variables. This is a popular modeling assumption that allows for better identifiability guarantees [2,3].

For an arbitrary DAG G , let PA G ( j ) be the nodes from which a directed edge toward j starts. Define

ℰ j G = X j − E [ X j ∣ X PA G ( j ) ] ∀ j ∈ 1 , … , p .

Obviously, ℰ j G 0 = ℰ j . Under not overly restrictive assumptions, it holds that ℰ 1 G , … ℰ p G are mutually independent only if G ⊇ G 0 [3]. Thus, an obvious approach to find G 0 is to loop over all possible graphs and test for independence of the residuals. Of course, this becomes infeasible when the dimensionality p grows.

A more reasonable algorithm based on greedy search is presented in the study by Peters et al. [3]. They also introduce regression with subsequent independence test (RESIT) that iteratively detects sink nodes. Finding the true DAG is guaranteed, assuming perfect regressors and independence tests. It involves O ( p 2 ) nonparametric independence tests, which might be computationally involved or lack power when the sample size is small.

Instead of performing independence tests, one can compare the likelihood score of different graphs

ℒ ( G ) = E log ∏ j = 1 p p j G ( ℰ j G ) = ∑ j = 1 p E [ log ( p j G ( ℰ j G ) ) ] ,

where p j G denotes the density of ℰ j G . Only for independent ℰ j G , their multivariate density factorizes as suggested. Therefore, the true DAG maximizes this quantity due to the properties of the Kullback-Leibler divergence. In practice, this comes with the additional difficulty of estimating the densities p j G ( ⋅ ) , and one needs to add some penalization to prefer simpler graphs and avoid selecting complete graphs [4].

If one additionally assumes that the ℰ j are marginally normally distributed N ( 0 , σ j 2 ) , one can instead consider the Gaussian likelihood

ℒ N ( G ) ≔ ∑ j = 1 p − log ( σ j G ) − 1 2 − 1 2 log ( 2 π ) = − ∑ j = 1 p log ( σ j G ) + C ,

where ( σ j G ) 2 = E [ ( ℰ j G ) 2 ] [5]. It holds, ℒ ( G ) ≥ ℒ N ( G ) and ℒ N ( G 0 ) = ℒ ( G 0 ) . Thus, the problem simplifies to finding the graph that leads to the lowest sum of log-variances.

Under such a normality assumption for ℰ j and the f j ( ⋅ ) in (1) being additive in their arguments, Bühlmann et al. [5] presents a causal discovery method that is consistent for high-dimensional data. To justify this approach for a broader class of error distributions, define

(2) Δ ≔ min G ⊉ G 0 ∑ j = 1 p log ( σ j G ) − log ( σ j G 0 ) .

Then, one has to assume that

(A1) Δ > 0 .

That is, the lowest possible expected negative Gaussian log-likelihood with any graph G that is not a superset of the true G 0 must be strictly larger than the expected negative Gaussian log-likelihood with the true graph G 0 . An argument for that assumption is that a true causal model should be easier to fit in some sense and thus also obtain a lower error variance. Using simple “non-pathological” examples, we demonstrate how this can easily be a fallacy when the true error distribution is misspecified. Thus, we advocate the need to be very careful when using Gaussian scoring with flexible nonparametric regression functions in causal discovery. The main part of this article considers theoretical population properties. We discuss some data applications in Section 5.1.

2 Data-generating linear model

We begin with data-generating linear models in which

f j ( X PA ( j ) ) = ∑ k ∈ PA ( j ) β j k X k .

For these, we find the explicit Theorem 1. The intuition for this result carries over to a range of nonlinear ANMs (1), especially when the causal effects are close to linear. We present according examples in Section 3.

If all ℰ j in a data-generating linear model are Gaussian, i.e., X 1 , … , X p are jointly Gaussian, it is known that any causal order could induce the given multivariate normal distribution and obtain an optimal Gaussian score. Assuming that the distribution is faithful with respect to the true DAG G 0 , the set of the most sparse DAGs obtaining the optimal Gaussian score corresponds to the Markov equivalence class of G 0 [6]. Thus, one can obtain this Markov equivalence class by preferring more sparse DAGs in case of equal scores. In general, the single true DAG cannot be determined even if the full multivariate distribution is known. On the contrary, the linear non-Gaussian acyclic model introduced by Shimizu et al. [7] is known to be identifiable. In such linear non-Gaussian models, algorithms designed for linear Gaussian models, e.g., the PC algorithm using partial correlation to assess conditional independence, do not use all the available information, but typically still provide the same guarantees since they depend on the covariance structure only. The covariance matrix of data generated by a linear causal model does not change when replacing a Gaussian ℰ j by an additive error of the same variance but otherwise arbitrary distribution. Thus, assuming the faithfulness condition for the data-generating distribution, Gaussian scoring for linear causal models in the infinite sample limit leads to the true underlying Markov equivalence class even under misspecification of the error distribution.

If the data-generating model is not known to be linear such that nonparametric regression methods, or the conditional mean as their population version, are applied, this generalization does not hold true anymore, as laid out in the following theorem. Let π be a permutation on { 1 , … , p } and G π be the full DAG according to π , i.e., π ( k ) ∈ PA G π ( π ( j ) ) if and only if k < j .

Theorem 1

Let X 1 , … , X p come from a linear model:

(3) X j ← ∑ k ∈ PA ( j ) β j k X k + ℰ j , with E [ ℰ j ] = 0 , E [ ℰ j 2 ] < ∞ ∀ j ∈ 1 , … , p ,

with mutually independent ℰ j . Then, for every possible permutation π ,

∑ j = 1 p log ( σ j G π ) − log ( σ j G 0 ) ≤ 0 .

That is, for every causal order, the corresponding full graph scores as least as well as the true causal graph.

Furthermore,

∑ j = 1 p log ( σ j G π ) − log ( σ j G 0 ) < 0 if ∃ j ∈ { 1 , … , p } : E [ X j ∣ X PA G π ( j ) ] ≠ X PA G π ( j ) ⊤ β π , j , where β π , j = E [ X PA G π ( j ) X PA G π ( j ) ⊤ ] − 1 E [ X j X PA G π ( j ) ] is t h e l e a s t s q u a r e s p a r a m e t e r .

That is, for every causal order, the corresponding full graph scores strictly better than the true causal graph if at least one conditional expectation is nonlinear in the parental variables.

Apart from some pathological cases, the last condition holds for permutations that are not conformable with the true DAG unless all the ℰ j are Gaussian. Thus, the gap condition (A1) does not hold true for this whole set of distributions, and, without Gaussianity, a wrong model would not only score equivalently but also be preferred. This is in stark contrast to results around Gaussian scoring when fitting linear models.

2.1 Illustrative examples

For illustrative purposes, we restrict ourselves to the two-variable case with X 2 ← β X 1 + ℰ 2 . We have the true DAG G 0 = { X 1 → X 2 } and define G ⊥ = { X 1 ← X 2 } . If β X 1 = D ℰ 2 , it holds

X 2 = E [ X 2 ∣ X 2 ] = E [ β X 1 + ℰ 2 ∣ X 2 ] = 2 * β E [ X 1 ∣ X 2 ] , implying that E [ X 1 ∣ X 2 ] = 1 2 β X 2

such that Δ = 0 ; see the definition in (2). In the two variable linear model, exp ( Δ ) 2 equals the ratio of the attainable mean squared error for the backward direction between the best-fitting unrestricted model and the best-fitting linear model.

First, we consider the analytically tractable case where X 1 = D ℰ 2 ∼ Unif [ − 1 , 1 ] . Then, for β = ± 1 , E [ X 1 ∣ X 2 ] = ± X 2 / 2 and Δ = 0 . For every other nonzero and bounded β , Δ < 0 so a causal discovery method based on Gaussian scoring would – assuming a correct regression function estimator – wrongly claim X 2 → X 1 in the large sample limit. We make things more explicit.

Proposition 1

Let X 2 = β X 1 + ℰ 2 with X 1 = D ℰ 2 ∼ Unif [ − 1 , 1 ] . Define γ = max { ∣ β ∣ , ∣ 1 / β ∣ } . Then, the ratio of the variances is given as follows:

exp ( Δ ) 2 = ∏ j = 1 2 ( σ j G ⊥ ) 2 / ∏ j = 1 2 ( σ j G 0 ) 2 = ( γ 2 + 1 ) ( 2 γ − 1 ) 2 γ 3 ≔ r ( γ ) .

As argued above, r ( 1 ) = 1 . For ∣ β ∣ → 0 , the ratio between the variance products approaches 1 as X 1 and X 2 become independent, and hence, both models perform equally well. For ∣ β ∣ → ∞ , the ratio approaches 1 as X 2 becomes a deterministic linear map of X 1 , and hence, the linear model is invertible. The ratio r ( γ ) is minimized for γ = 3 , with r ( 3 ) ≈ 0.93 strictly decreasing for γ ∈ [ 1 , 3 ) and strictly increasing for γ ∈ ( 3 , ∞ ) . This is shown in Figure 1(a).

$Figure 1 Two-variable linear model: effect of changing β \beta . (a) X 1 = D ℰ 2 ∼ Unif [ − 1 , 1 ] {X}_{1}\mathop{=}\limits^{{\mathcal{D}}}{{\mathcal{ {\mathcal E} }}}_{2}\hspace{0.33em} \sim \hspace{0.33em}\hspace{0.1em}\text{Unif}\hspace{0.1em}{[}-1,1] . (b) X 1 ∼ N ( 0 , 1 ) {X}_{1}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1) , ℰ 2 ∼ Unif [ − 1 , 1 ] {{\mathcal{ {\mathcal E} }}}_{2}\hspace{0.33em} \sim \hspace{0.33em}\hspace{0.1em}\text{Unif}\hspace{0.1em}{[}-1,1] . (c) X 1 ∼ N ( 0 , 1 ) {X}_{1}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1) , ℰ 2 + 1 ∼ χ 1 2 {{\mathcal{ {\mathcal E} }}}_{2}+1\hspace{0.33em} \sim \hspace{0.33em}{\chi }_{1}^{2} .$

Figure 1

Two-variable linear model: effect of changing β . (a) X 1 = D ℰ 2 ∼ Unif [ − 1 , 1 ] . (b) X 1 ∼ N ( 0 , 1 ) , ℰ 2 ∼ Unif [ − 1 , 1 ] . (c) X 1 ∼ N ( 0 , 1 ) , ℰ 2 + 1 ∼ χ 1 2 .

Next, we consider a similar example but with X 1 ∼ N ( 0 , 1 ) . Analytic expressions for E [ X 1 ∣ X 2 ] and Var ( X 1 ∣ X 2 ) can be found in terms of the Gaussian cumulative distribution function. For ( σ 1 G ⊥ ) 2 = E [ Var ( X 1 ∣ X 2 ) ] , we invoke numerical integration. The obtained exp ( Δ ) 2 is shown in Figure 1(b). As elaborated before, it approaches 1 for β → 0 and β → ∞ . In between, it is strictly less than 1 since E [ X 1 ∣ X 2 ] is not linear in X 2 . The minimum of exp ( Δ ) 2 ≈ 0.92 is obtained for β ≈ 0.21 .

Finally, we use an asymmetric error distribution instead, namely, ℰ 2 ∼ χ 1 2 − 1 . Figure 1(c) shows that this leads to more extreme values of Δ < 0 . Notably, E [ X 1 ∣ X 2 ] is not monotone in this case, so the best linear fit is not a good approximation.

3 Beyond a data-generating linear model

If all ℰ j are Gaussian, we know that Δ = 0 for linear conditional expectations in (1), but Δ > 0 otherwise. Thus, involving nonlinearities enables the identifiability of the model.

For non-Gaussian ℰ j in a linear model, Δ < 0 holds true apart from some special cases, see Theorem 1. However, the intuition is that nonlinearities could be beneficial for identifiability. As the lower bound for Δ is negative and not 0, presumably a higher degree of nonlinearity might be necessary to achieve Δ > 0 . We analyze this with the following simple model:

(4) X 2 ← β Sign ( X 1 ) ∣ X 1 ∣ ν E [ ∣ X 1 ∣ 2 ν ] + ℰ 2 , ν > 0 .

The normalization ensures that the variance of X 2 depends only on β . For ν = 1 , we obtain a linear model.

Consider the case X 1 ∼ N ( 0 , 1 ) and ℰ 2 ∼ Unif [ − 1 , 1 ] . Analytic expressions for E [ X 1 ∣ X 2 ] and Var ( X 1 ∣ X 2 ) can be found in terms of the Gaussian cumulative distribution function and the Γ -function. For ( σ 1 G ⊥ ) 2 = E [ Var ( X 1 ∣ X 2 ) ] , we invoke numerical integration. The obtained exp ( Δ ) 2 is shown in Figure 2(a) for different values of β . It confirms the intuition, that sufficiently strong nonlinearity leads to Δ > 0 even for non-Gaussian errors. For β = 1 and β = 2 , the model becomes identifiable for most ν ≠ 1 . For β = 0.5 , stronger nonlinearities are necessary. Furthermore, the minimum Δ is not obtained for ν = 1 but at ν ≈ 1.32 . Thus, not every nonlinear model is better identifiable than the corresponding linear model.

As in the linear case, we consider the effect of an asymmetric error distribution, namely, a scaled and centered chi-squared distribution. We show this in Figure 2(b). The factor 6 leads to the different lines corresponding to the same signal-to-noise ratios. As expected, higher degrees of nonlinearity are necessary to obtain a positive Δ .

$Figure 2 Two-variable nonlinear model (4): effect of changing ν \nu for β = 0.5 \beta =0.5 (solid blue curve), β = 1 \beta =1 (dashed red curve), and β = 2 \beta =2 (dotted black curve). (a) X 1 ∼ N ( 0 , 1 ) {X}_{1}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1) , ℰ 2 ∼ Unif [ − 1 , 1 ] {{\mathcal{ {\mathcal E} }}}_{2}\hspace{0.33em} \sim \hspace{0.33em}\hspace{0.1em}\text{Unif}\hspace{0.1em}{[}-1,1] . (b) X 1 ∼ N ( 0 , 1 ) {X}_{1}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1) , 6 ℰ 2 + 1 ∼ χ 1 2 \sqrt{6}{{\mathcal{ {\mathcal E} }}}_{2}+1\hspace{0.33em} \sim \hspace{0.33em}{\chi }_{1}^{2} . (c) X 1 = D 3 ℰ 2 ∼ N ( 0 , 1 ) {X}_{1}\mathop{=}\limits^{{\mathcal{D}}}\sqrt{3}{{\mathcal{ {\mathcal E} }}}_{2}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1) .$

Figure 2

Two-variable nonlinear model (4): effect of changing ν for β = 0.5 (solid blue curve), β = 1 (dashed red curve), and β = 2 (dotted black curve). (a) X 1 ∼ N ( 0 , 1 ) , ℰ 2 ∼ Unif [ − 1 , 1 ] . (b) X 1 ∼ N ( 0 , 1 ) , 6 ℰ 2 + 1 ∼ χ 1 2 . (c) X 1 = D 3 ℰ 2 ∼ N ( 0 , 1 ) .

We show the behavior for a correctly specified model in Figure 2(c). For ν = 1 , Δ = 0 , as implied by the unidentifiability result; otherwise, Δ > 0 . For ν ≠ 1 , higher signal-to-noise ratios lead to a more distinct Δ > 0 .

3.1 Monotonicity of f 2 ( ⋅ )

The nonlinearities discussed here are designed to be slight deviations from the linear model and, thus, chosen to be strictly monotone. Notably, for non-monotone functions, the intuition that the anti-causal model is harder to fit is more applicable. In particular, if X 1 is centered and symmetric and f 2 ( ⋅ ) is an even function, it holds E [ X 1 ∣ X 2 ] ≡ 0 . Then,

(5) ( σ 1 G ⊥ ) 2 = Var ( X 1 ) and exp ( Δ ) 2 = Var ( X 1 ) Var ( X 2 ) Var ( X 1 ) Var ( ℰ 2 ) > 1 .

Thus, the gap condition (A1) is satisfied regardless of the distribution of ℰ 2 as long as X 1 and X 2 have a finite variance.

4 Heteroskedastic noise model

A simple extension of model (1) that has recently gained some attention, is the heteroskedastic noise model, also referred to as location-scale noise model

X j ← f j ( X PA ( j ) ) + g j ( X PA ( j ) ) ℰ j with E [ ℰ j ] = 0 , E [ ℰ j 2 ] = 1 ∀ j ∈ 1 , … , p ,

with some nonnegative functions g j ( ⋅ ) [8–10]. It comes with similar identifiability guarantees as the ANM when testing for mutual independence between the ℰ j . Let accordingly

f j G ( X PA G ( j ) ) = E [ X j ∣ X PA G ( j ) ] , g j G ( X PA G ( j ) ) 2 = E [ ( X j − f j G ( X PA G ( j ) ) ) 2 ∣ X PA G ( j ) ] , and ℰ j G = X j − f j G ( X PA G ( j ) ) g j G ( X PA G ( j ) )

be the conditional means, conditional variances, and residuals according to any potentially wrong DAG G .

Then, one obtains

ℒ ( G ) = ∑ j = 1 p E log p j G ( ℰ j G ) g j G ( X PA G ( j ) ) ℒ N ( G ) ≔ − ∑ j = 1 p E [ log ( g j G ( X PA G ( j ) ) ) ] + C = − ∑ j = 1 p 1 2 E [ log ( g j G ( X PA G ( j ) ) 2 ) ] + C ≥ − ∑ j = 1 p 1 2 log ( E [ g j G ( X PA G ( j ) ) 2 ] ) + C = − ∑ j = 1 p log ( σ j G ) + C .

Thus, when fitting heteroskedastic models, the score can only be increased compared to the homoskedastic fit. This can further increase the difficulty of finding the correct direction under non-Gaussian noise. Even if the true forward model is homoskedastic, i.e., g j ( ⋅ ) ≡ σ j , the backward model is typically heteroskedastic and profits from this new score. For example, in the setup of Figure 1(a), Δ would be negative even for β = 1 . If one allows to fit heteroskedastic models, a result analagous to Theorem 1 exists. A negative gap is obtained unless all conditional expectations are linear and all conditional variances are constant for the wrong causal order.

In Figure 3, we review the examples from Figures 1(a) and 2(c) and see how allowing for a heteroskedastic fit makes the problem harder. For the sake of comparison, we look at exp ( Δ ) 2 although it does not have the same simple interpretation in the heteroskedastic case.

$Figure 3 Two variable additive model: fitting homoskedastic models (solid blue curve) versus heteroskedastic models (dashed red curve). (a) Model (3) with X 1 = D ℰ 2 ∼ Unif [ − 1 , 1 ] {X}_{1}\mathop{=}\limits^{{\mathcal{D}}}{{\mathcal{ {\mathcal E} }}}_{2}\hspace{0.33em} \sim \hspace{0.33em}\hspace{0.1em}\text{Unif}\hspace{0.1em}{[}-1,1] . (b) Model (4) with X 1 = D 3 ℰ 2 ∼ N ( 0 , 1 ) {X}_{1}\mathop{=}\limits^{{\mathcal{D}}}\sqrt{3}{{\mathcal{ {\mathcal E} }}}_{2}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1) and β = 2 \beta =2 .$

Figure 3

Two variable additive model: fitting homoskedastic models (solid blue curve) versus heteroskedastic models (dashed red curve). (a) Model (3) with X 1 = D ℰ 2 ∼ Unif [ − 1 , 1 ] . (b) Model (4) with X 1 = D 3 ℰ 2 ∼ N ( 0 , 1 ) and β = 2 .

In terms of the location-scale noise model, the data-generating model, as shown in Figure 3(a) is unidentifiable as X 1 ∣ X 2 is uniformly distributed, i.e., its distribution is independent of X 2 apart from location and scale. This does not contradict the identifiability results as they are derived for random variables with full support in R .

5 Discussion

5.1 Data applications

For an extensive comparison between methods relying on Gaussian scoring and nonparametric independence tests in ANMs or heteroskedastic noise models, we refer to the study by Immer et al. [10], where several fitting methods are considered, combined with both approaches, and evaluated on a variety of benchmark cause and effect pairs. Those pairs include both real and artificial data. For some of the considered data sources, using independence tests clearly improved the success rate for inferring the causal direction as compared to using the Gaussian score.

Let us consider two specific examples of the Tübingen data by Mooij et al. [11]. Details on the data can be found in Section D.11 of their article. Both examples have the temperature as the effect variable, while the cause is the day of the year or the intensity of the solar radiation, respectively. The corresponding scatter plots and the contour lines of the density estimates are shown in Figure 4. It is evident that in neither case the cause variable is normally distributed, the days are perfectly uniformly distributed while solar radiation is right-skewed. Therefore, the assumptions for Gaussian scoring to infer the true causal direction are not fulfilled. For the first dataset, we restricted the numerical analysis to the time frame 1st April to 30th September (31st March to 29th September in leap years) to circumvent the issue that the data are circular. This is indicated by the black dotted lines.

Figure 4

Scatter plot and contour lines of the density estimate for two selected pairs from the Tübingen data.

To evaluate the Gaussian scores, we estimate the conditional expectation for either direction with a smoothing spline. In the first case, the causal effect is non-monotone, and the conditional mean in the anti-causal direction is not very informative to predict the day of the year. Therefore, we obtain the correct causal direction with Gaussian scoring even though the assumptions are not fulfilled. We obtain the data estimate exp ( Δ ) 2 = 1.48 .

The effect of solar radiation on the temperature appears to be monotone, which makes the conditional expectation in the anti-causal direction more informative. In addition, it seems that the conditional expectation in the causal direction is not so far from being linear. This indeed makes the Gaussian scoring algorithm prefer the wrong direction. The estimate is exp ( Δ ) 2 = 0.91 . Similarly, it fails for the first dataset when considering the first 183 days of the year instead ( exp ( Δ ) 2 = 0.99 ).

With RESIT relying on independence testing, we see for both datasets that the hypothesis of residuals being independent of the predictor is rejected in either direction. This indicates that the ANM in (1) is not rich enough to explain the data. However, applying Algorithm 1 from the study by Peters et al. [3], which minimizes the estimated dependence between predictor and residuals, finds the true causal direction for both data pairs.

5.2 Conclusion

We discuss causal discovery in structural causal models using Gaussian likelihood scoring and analyze the effect of model misspecification.

In the case where the data-generating distribution comes from a linear structural equation model and linear regression functions are used for estimation, the following holds. When the true error distribution is Gaussian, one can only identify the Markov equivalence class of the underlying data-generating DAG. The same holds true when the error distribution is non-Gaussian, but one wrongly relies on a Gaussian error distribution for estimation.

Thus, popular algorithms like the greedy equivalence search [12] for Gaussian models or the PC algorithm [1] assessing partial correlation are potentially conservative and only infer the Markov equivalence class when the error distributions are non-Gaussian, as they do not exploit the maximal amount of information. But they are safe to use within the domain of data-generating linear structural equation models. We prove here that this fact does not necessarily hold true when invoking nonparametric regression estimation. Especially, if the true causal model is linear or just “slightly nonlinear,” one would systematically obtain the wrong causal direction under error misspecification. As optimizing Gaussian scores is the same as optimizing ℓ 2 -loss, regressors that are more flexible than necessary for the causal model support anti-causal decisions. The intuition carries over when allowing for the flexibility of heteroskedastic error terms. If the true causal model has homoskedastic additive errors, fitting heteroskedastic models will increase the range of setups where misspecified Gaussian scoring chooses anti-causal relationships.

To overcome these issues, one could rely on general nonparametric independence tests, either between the different residuals or between residuals and predictors. Of course, this comes at a higher computational cost and potentially lower sample efficiency in cases where Gaussian scoring works, including in the presence of non-monotonic causal effects.

Funding information: The project leading to this application has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 786461).
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission. The authors applied the SDC approach for the sequence of authors.
Conflict of interest: The authors state that there is no conflict of interest.
Data availability statement: The datasets analyzed during this study are available in the database with cause-effect pairs, https://webdav.tuebingen.mpg.de/cause-effect/. We consider pairs 42 and 77.

Appendix A Proofs

A.1 Proof of Theorem 1

It is well known that for jointly Gaussian variables X 1 , … , X p , every possible causal order can induce the multivariate distribution with a suitable linear model. As the sum of log-variances determines the Kullback-Leibler divergence in the Gaussian case, this sum must be equal for all these linear models that induce the same multivariate distribution.

For every possible multivariate distribution with existing and bounded moment matrix Σ X = E [ X X ⊤ ] , which the assumed model has, the linear least squares parameter and corresponding residual variances using arbitrary sets of regressor covariates are completely determined by the moment matrix. Therefore, the residual variances correspond to those of multivariate Gaussian data. Accordingly, one can obtain the same sum of log-variances as for the true model using the best linear predictors for every possible permutation. This proves the non-strict inequality in the theorem.

If for some variable j the conditional expectation given its parents (in the DAG G π ) is not a linear function, the linear model cannot be optimal in terms of residual variance. Therefore, σ j G π in an unrestricted model is lower than the residual standard error of the best fitting linear model, and the score is further improved. Hence, the inequality is strict as soon as there exists at least one such variable.

A.2 Proof of Proposition 1

The variances of X 1 and X 2 as well as ℰ 2 follow directly from the properties of the uniform distribution.

( σ 1 G 0 ) 2 = Var ( X 1 ) = 1 3 , ( σ 2 G 0 ) 2 = Var ( ℰ 2 ) = 1 3 , and ( σ 2 G ⊥ ) 2 = Var ( X 2 ) = β 2 Var ( X 1 ) + Var ( ℰ 2 ) = 1 3 ( β 2 + 1 ) .

The last term requires some more work. Due to the symmetry, we can assume without loss of generality that β ≥ 0 . For the densities, we obtain

f X 1 ( x 1 ) = 1 2 1 { ∣ x 1 ∣ ≤ 1 } , f X 2 ∣ X 1 ( x 2 , x 1 ) = 1 2 1 { ∣ x 2 − β x 1 ∣ ≤ 1 } , and f X 2 ( x 2 ) = ∫ 1 4 1 { ∣ x 1 ∣ ≤ 1 } 1 { ∣ x 2 − β x 1 ∣ ≤ 1 } d x 1 = 1 4 min 1 , x 2 + 1 β − max − 1 , x 2 − 1 β ≔ 1 4 ( a − b ) .

For notational simplicity, we define random variables A and B with realizations a and b . We obtain the moments

E [ X 1 ∣ X 2 ] = ∫ x 1 f X 1 ∣ X 2 ( x 1 , X 2 ) d x 1 = ∫ x 1 4 1 { ∣ x 1 ∣ ≤ 1 } 1 { ∣ X 2 − β x 1 ∣ ≤ 1 } d x 1 ∫ 1 4 1 { ∣ x 1 ∣ ≤ 1 } 1 { ∣ X 2 − β x 1 ∣ ≤ 1 } d x 1 = A 2 − B 2 2 ( A − B ) = 1 2 ( A + B ) , E [ X 1 2 ∣ X 2 ] = ∫ x 1 2 f X 1 ∣ X 2 ( x 1 , X 2 ) d x 1 = ∫ x 1 2 4 1 { ∣ x 1 ∣ ≤ 1 } 1 { ∣ X 2 − β x 1 ∣ ≤ 1 } d x 1 ∫ 1 4 1 { ∣ x 1 ∣ ≤ 1 } 1 { ∣ X 2 − β x 1 ∣ ≤ 1 } d x 1 = A 3 − B 3 3 ( A − B ) = 1 3 ( A 2 + A B + B 2 ) , Var ( X 1 ∣ X 2 ) = E [ X 1 2 ∣ X 2 ] − E [ X 1 ∣ X 2 ] 2 = 1 12 ( A − B ) 2 .

Finally, we are interested in

( σ 1 G ⊥ ) 2 = E [ Var ( X 1 ∣ X 2 ) ] = E 1 12 ( A − B ) 2 = ∫ − 1 − β 1 + β 1 12 ( a − b ) 2 f X 2 ( x 2 ) d x 2 = ∫ − 1 − β 1 + β 1 48 ( a − b ) 3 d x 2 = 2 ∫ 0 1 + β 1 48 ( a − b ) 3 d x 2 .

Assume first β ≥ 1 . Then, b = ( x 2 − 1 ) / β in the area of integration. For x 2 ≥ β − 1 , it holds a = 1 .

( σ 1 G ⊥ ) 2 = ∫ 0 1 + β 1 24 ( a − b ) 3 d x 2 = ∫ 0 β − 1 1 24 2 β 3 d x 2 + ∫ β − 1 1 + β 1 24 β + 1 − x 2 β 3 d x 2 = 1 24 β 3 8 ( β − 1 ) + ∫ 0 2 u 3 d u = 1 24 β 3 ( 8 β − 4 ) = 1 6 β 3 ( 2 β − 1 ) ,

where we applied the change of variable u = β + 1 − x 2 to simplify the integration. Inserting residual variances with γ = β , the proposition’s statement follows.

Alternatively, if β < 1 , a = 1 in the interval of integration. For x 2 < 1 − β , it holds b = − 1 .

( σ 1 G ⊥ ) 2 = ∫ 0 1 + β 1 24 ( a − b ) 3 d x 2 = ∫ 0 1 − β 1 24 ( 2 ) 3 d x 2 + ∫ 1 − β 1 + β 1 24 β + 1 − x 2 β 3 d x 2 = 1 24 8 ( 1 − β ) + 1 β 3 ∫ 0 2 β u 3 d u = 1 24 ( 8 − 4 β ) = 1 6 ( 2 − β ) ,

where we applied the change of variable u = β + 1 − x 2 to simplify the integration. Inserting all the residual variances with γ = 1 / β , the proposition’s statement follows.

B Derivations for the figures

Assume model (4), which has model (3) as a special case for ν = 1 . With X 1 ∼ N ( 0 , 1 ) , the normalization is

V ( ν ) ≔ E [ ∣ X 1 ∣ 2 ν ] = 2 ν Γ ν + 1 2 π .

As before,

( σ 1 G 0 ) 2 = Var ( X 1 ) , ( σ 2 G 0 ) 2 = Var ( ℰ 2 ) and ( σ 2 G ⊥ ) 2 = Var ( X 2 ) = β 2 + Var ( ℰ 2 ) .

B.1 Gaussian and uniform

For ℰ 2 ∼ Unif [ − 1 , 1 ] , X 1 is constrained to lie within

Sign ( ( X 2 − 1 ) V ( ν ) / β ) ∣ ( X 2 − 1 ) V ( ν ) / β ∣ 1 / ν and Sign ( ( X 2 + 1 ) V ( ν ) / β ) ∣ ( X 2 + 1 ) V ( ν ) / β ∣ 1 / ν ,

i.e., given X 2 , it is a truncated standard Gaussian. The conditional mean and variance follow from standard theory about truncated Gaussian random variables. For simplicity, call the upper bound A and the lower bound B . The density of X 2 is then given as follows:

f 2 ( x 2 ) = ∫ b a 1 2 ϕ ( x 1 ) d x 1 = 1 2 ( Φ ( a ) − Φ ( b ) ) .

Finally,

( σ 1 G ⊥ ) 2 = E [ Var ( X 1 ∣ X 2 ) ]

is obtained by numerically integrating over (a sufficient part of) the real line.

B.2 Two Gaussian random variables, or Gaussian and χ 1 2

Except for V ( ν ) , all quantities are obtained by brute force numerical integration.

B.3 Two uniform random variables with heteroskedastic fitting

We can mainly follow the derivation in Appendix A.2. Instead of ( σ 1 G ⊥ ) 2 , we need exp ( E [ log ( Var ( X 1 ∣ X 2 ) ) ] ) , which is obtained by numerical integration.

References

[1] Spirtes P, Glymour CN, Scheines R, Heckerman D. Causation, prediction, and search. Cambridge, MA, USA: MIT Press; 2000. 10.7551/mitpress/1754.001.0001Search in Google Scholar

[2] Hoyer P, Janzing D, Mooij JM, Peters J, Schölkopf B. Nonlinear causal discovery with additive noise models. Adv Neur In. 2008;21:689–96. Search in Google Scholar

[3] Peters J, Mooij J, Janzing D, Schölkopf B. Causal discovery with continuous additive noise models. J Mach Learn Res. 2014;15:2009–53. Search in Google Scholar

[4] Nowzohour C, Bühlmann P. Score-based causal learning in additive noise models. Statistics. 2016;50(3):471–85. 10.1080/02331888.2015.1060237Search in Google Scholar

[5] Bühlmann P, Peters J, Ernest J. CAM: Causal additive models, high-dimensional order search and penalized regression. Ann Statist. 2014;42(6):2526–56. 10.1214/14-AOS1260Search in Google Scholar

[6] Zhang J, Spirtes P. Detection of unfaithfulness and robust causal inference. Minds Mach. 2008;18(2):239–71. 10.1007/s11023-008-9096-4Search in Google Scholar

[7] Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A, Jordan M. A linear non-Gaussian acyclic model for causal discovery. J Mach Learn Res. 2006;7(10):2003–30. Search in Google Scholar

[8] Strobl EV, Lasko TA. Identifying patient-specific root causes with the heteroscedastic noise model. 2022. arXiv: http://arXiv.org/abs/arXiv:220513085. 10.1145/3535508.3545553Search in Google Scholar

[9] Xu S, Mian OA, Marx A, Vreeken J. Inferring cause and effect in the presence of heteroscedastic noise. In: International Conference on Machine Learning. PMLR; 2022. p. 24615–630. Search in Google Scholar

[10] Immer A, Schultheiss C, Vogt JE, Schölkopf B, Bühlmann P, Marx A. On the identifiability and estimation of causal location-scale noise models. 2022. arXiv: http://arXiv.org/abs/arXiv:221009054. Search in Google Scholar

[11] Mooij JM, Peters J, Janzing D, Zscheischler J, Schölkopf B. Distinguishing cause from effect using observational data: methods and benchmarks. J Mach Learn Res. 2016;17(1):1103–204. Search in Google Scholar

[12] Chickering DM. Optimal structure identification with greedy search. J Mach Learn Res. 2002;3:507–54. Search in Google Scholar

Received: 2022-10-20

Revised: 2023-03-14

Accepted: 2023-03-29

Published Online: 2023-05-11

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2022-0068

Keywords for this article

graphical models; model misspecification; non-parametric regression; structural causal models

Creative Commons

BY 4.0