Abstract
We consider likelihood score-based methods for causal discovery in structural causal models. In particular, we focus on Gaussian scoring and analyze the effect of model misspecification in terms of non-Gaussian error distribution. We present a surprising negative result for Gaussian likelihood scoring in combination with nonparametric regression methods.
1 Introduction
We consider the problem of finding the causal structure of a set of random variables
Here, we focus on the so-called additive noise model (ANM)
where the
For an arbitrary DAG
Obviously,
A more reasonable algorithm based on greedy search is presented in the study by Peters et al. [3]. They also introduce regression with subsequent independence test (RESIT) that iteratively detects sink nodes. Finding the true DAG is guaranteed, assuming perfect regressors and independence tests. It involves
Instead of performing independence tests, one can compare the likelihood score of different graphs
where
If one additionally assumes that the
where
Under such a normality assumption for
Then, one has to assume that
That is, the lowest possible expected negative Gaussian log-likelihood with any graph
2 Data-generating linear model
We begin with data-generating linear models in which
For these, we find the explicit Theorem 1. The intuition for this result carries over to a range of nonlinear ANMs (1), especially when the causal effects are close to linear. We present according examples in Section 3.
If all
If the data-generating model is not known to be linear such that nonparametric regression methods, or the conditional mean as their population version, are applied, this generalization does not hold true anymore, as laid out in the following theorem. Let
Theorem 1
Let
with mutually independent
That is, for every causal order, the corresponding full graph scores as least as well as the true causal graph.
Furthermore,
That is, for every causal order, the corresponding full graph scores strictly better than the true causal graph if at least one conditional expectation is nonlinear in the parental variables.
Apart from some pathological cases, the last condition holds for permutations that are not conformable with the true DAG unless all the
2.1 Illustrative examples
For illustrative purposes, we restrict ourselves to the two-variable case with
such that
First, we consider the analytically tractable case where
Proposition 1
Let
As argued above,
![Figure 1
Two-variable linear model: effect of changing
β
\beta
. (a)
X
1
=
D
ℰ
2
∼
Unif
[
−
1
,
1
]
{X}_{1}\mathop{=}\limits^{{\mathcal{D}}}{{\mathcal{ {\mathcal E} }}}_{2}\hspace{0.33em} \sim \hspace{0.33em}\hspace{0.1em}\text{Unif}\hspace{0.1em}{[}-1,1]
. (b)
X
1
∼
N
(
0
,
1
)
{X}_{1}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1)
,
ℰ
2
∼
Unif
[
−
1
,
1
]
{{\mathcal{ {\mathcal E} }}}_{2}\hspace{0.33em} \sim \hspace{0.33em}\hspace{0.1em}\text{Unif}\hspace{0.1em}{[}-1,1]
. (c)
X
1
∼
N
(
0
,
1
)
{X}_{1}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1)
,
ℰ
2
+
1
∼
χ
1
2
{{\mathcal{ {\mathcal E} }}}_{2}+1\hspace{0.33em} \sim \hspace{0.33em}{\chi }_{1}^{2}
.](/document/doi/10.1515/jci-2022-0068/asset/graphic/j_jci-2022-0068_fig_001.jpg)
Two-variable linear model: effect of changing
Next, we consider a similar example but with
Finally, we use an asymmetric error distribution instead, namely,
3 Beyond a data-generating linear model
If all
For non-Gaussian
The normalization ensures that the variance of
Consider the case
As in the linear case, we consider the effect of an asymmetric error distribution, namely, a scaled and centered chi-squared distribution. We show this in Figure 2(b). The factor
![Figure 2
Two-variable nonlinear model (4): effect of changing
ν
\nu
for
β
=
0.5
\beta =0.5
(solid blue curve),
β
=
1
\beta =1
(dashed red curve), and
β
=
2
\beta =2
(dotted black curve). (a)
X
1
∼
N
(
0
,
1
)
{X}_{1}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1)
,
ℰ
2
∼
Unif
[
−
1
,
1
]
{{\mathcal{ {\mathcal E} }}}_{2}\hspace{0.33em} \sim \hspace{0.33em}\hspace{0.1em}\text{Unif}\hspace{0.1em}{[}-1,1]
. (b)
X
1
∼
N
(
0
,
1
)
{X}_{1}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1)
,
6
ℰ
2
+
1
∼
χ
1
2
\sqrt{6}{{\mathcal{ {\mathcal E} }}}_{2}+1\hspace{0.33em} \sim \hspace{0.33em}{\chi }_{1}^{2}
. (c)
X
1
=
D
3
ℰ
2
∼
N
(
0
,
1
)
{X}_{1}\mathop{=}\limits^{{\mathcal{D}}}\sqrt{3}{{\mathcal{ {\mathcal E} }}}_{2}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}(0,1)
.](/document/doi/10.1515/jci-2022-0068/asset/graphic/j_jci-2022-0068_fig_002.jpg)
Two-variable nonlinear model (4): effect of changing
We show the behavior for a correctly specified model in Figure 2(c). For
3.1 Monotonicity of
f
2
(
⋅
)
The nonlinearities discussed here are designed to be slight deviations from the linear model and, thus, chosen to be strictly monotone. Notably, for non-monotone functions, the intuition that the anti-causal model is harder to fit is more applicable. In particular, if
Thus, the gap condition (A1) is satisfied regardless of the distribution of
4 Heteroskedastic noise model
A simple extension of model (1) that has recently gained some attention, is the heteroskedastic noise model, also referred to as location-scale noise model
with some nonnegative functions
be the conditional means, conditional variances, and residuals according to any potentially wrong DAG
Then, one obtains
Thus, when fitting heteroskedastic models, the score can only be increased compared to the homoskedastic fit. This can further increase the difficulty of finding the correct direction under non-Gaussian noise. Even if the true forward model is homoskedastic, i.e.,
In Figure 3, we review the examples from Figures 1(a) and 2(c) and see how allowing for a heteroskedastic fit makes the problem harder. For the sake of comparison, we look at
In terms of the location-scale noise model, the data-generating model, as shown in Figure 3(a) is unidentifiable as
5 Discussion
5.1 Data applications
For an extensive comparison between methods relying on Gaussian scoring and nonparametric independence tests in ANMs or heteroskedastic noise models, we refer to the study by Immer et al. [10], where several fitting methods are considered, combined with both approaches, and evaluated on a variety of benchmark cause and effect pairs. Those pairs include both real and artificial data. For some of the considered data sources, using independence tests clearly improved the success rate for inferring the causal direction as compared to using the Gaussian score.
Let us consider two specific examples of the Tübingen data by Mooij et al. [11]. Details on the data can be found in Section D.11 of their article. Both examples have the temperature as the effect variable, while the cause is the day of the year or the intensity of the solar radiation, respectively. The corresponding scatter plots and the contour lines of the density estimates are shown in Figure 4. It is evident that in neither case the cause variable is normally distributed, the days are perfectly uniformly distributed while solar radiation is right-skewed. Therefore, the assumptions for Gaussian scoring to infer the true causal direction are not fulfilled. For the first dataset, we restricted the numerical analysis to the time frame 1st April to 30th September (31st March to 29th September in leap years) to circumvent the issue that the data are circular. This is indicated by the black dotted lines.

Scatter plot and contour lines of the density estimate for two selected pairs from the Tübingen data.
To evaluate the Gaussian scores, we estimate the conditional expectation for either direction with a smoothing spline. In the first case, the causal effect is non-monotone, and the conditional mean in the anti-causal direction is not very informative to predict the day of the year. Therefore, we obtain the correct causal direction with Gaussian scoring even though the assumptions are not fulfilled. We obtain the data estimate
The effect of solar radiation on the temperature appears to be monotone, which makes the conditional expectation in the anti-causal direction more informative. In addition, it seems that the conditional expectation in the causal direction is not so far from being linear. This indeed makes the Gaussian scoring algorithm prefer the wrong direction. The estimate is
With RESIT relying on independence testing, we see for both datasets that the hypothesis of residuals being independent of the predictor is rejected in either direction. This indicates that the ANM in (1) is not rich enough to explain the data. However, applying Algorithm 1 from the study by Peters et al. [3], which minimizes the estimated dependence between predictor and residuals, finds the true causal direction for both data pairs.
5.2 Conclusion
We discuss causal discovery in structural causal models using Gaussian likelihood scoring and analyze the effect of model misspecification.
In the case where the data-generating distribution comes from a linear structural equation model and linear regression functions are used for estimation, the following holds. When the true error distribution is Gaussian, one can only identify the Markov equivalence class of the underlying data-generating DAG. The same holds true when the error distribution is non-Gaussian, but one wrongly relies on a Gaussian error distribution for estimation.
Thus, popular algorithms like the greedy equivalence search [12] for Gaussian models or the PC algorithm [1] assessing partial correlation are potentially conservative and only infer the Markov equivalence class when the error distributions are non-Gaussian, as they do not exploit the maximal amount of information. But they are safe to use within the domain of data-generating linear structural equation models. We prove here that this fact does not necessarily hold true when invoking nonparametric regression estimation. Especially, if the true causal model is linear or just “slightly nonlinear,” one would systematically obtain the wrong causal direction under error misspecification. As optimizing Gaussian scores is the same as optimizing
To overcome these issues, one could rely on general nonparametric independence tests, either between the different residuals or between residuals and predictors. Of course, this comes at a higher computational cost and potentially lower sample efficiency in cases where Gaussian scoring works, including in the presence of non-monotonic causal effects.
-
Funding information: The project leading to this application has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 786461).
-
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission. The authors applied the SDC approach for the sequence of authors.
-
Conflict of interest: The authors state that there is no conflict of interest.
-
Data availability statement: The datasets analyzed during this study are available in the database with cause-effect pairs, https://webdav.tuebingen.mpg.de/cause-effect/. We consider pairs 42 and 77.
Appendix A Proofs
A.1 Proof of Theorem 1
It is well known that for jointly Gaussian variables
For every possible multivariate distribution with existing and bounded moment matrix
If for some variable
A.2 Proof of Proposition 1
The variances of
The last term requires some more work. Due to the symmetry, we can assume without loss of generality that
For notational simplicity, we define random variables
Finally, we are interested in
Assume first
where we applied the change of variable
Alternatively, if
where we applied the change of variable
B Derivations for the figures
Assume model (4), which has model (3) as a special case for
As before,
B.1 Gaussian and uniform
For
i.e., given
Finally,
is obtained by numerically integrating over (a sufficient part of) the real line.
B.2 Two Gaussian random variables, or Gaussian and
χ
1
2
Except for
B.3 Two uniform random variables with heteroskedastic fitting
We can mainly follow the derivation in Appendix A.2. Instead of
References
[1] Spirtes P, Glymour CN, Scheines R, Heckerman D. Causation, prediction, and search. Cambridge, MA, USA: MIT Press; 2000. 10.7551/mitpress/1754.001.0001Search in Google Scholar
[2] Hoyer P, Janzing D, Mooij JM, Peters J, Schölkopf B. Nonlinear causal discovery with additive noise models. Adv Neur In. 2008;21:689–96. Search in Google Scholar
[3] Peters J, Mooij J, Janzing D, Schölkopf B. Causal discovery with continuous additive noise models. J Mach Learn Res. 2014;15:2009–53. Search in Google Scholar
[4] Nowzohour C, Bühlmann P. Score-based causal learning in additive noise models. Statistics. 2016;50(3):471–85. 10.1080/02331888.2015.1060237Search in Google Scholar
[5] Bühlmann P, Peters J, Ernest J. CAM: Causal additive models, high-dimensional order search and penalized regression. Ann Statist. 2014;42(6):2526–56. 10.1214/14-AOS1260Search in Google Scholar
[6] Zhang J, Spirtes P. Detection of unfaithfulness and robust causal inference. Minds Mach. 2008;18(2):239–71. 10.1007/s11023-008-9096-4Search in Google Scholar
[7] Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A, Jordan M. A linear non-Gaussian acyclic model for causal discovery. J Mach Learn Res. 2006;7(10):2003–30. Search in Google Scholar
[8] Strobl EV, Lasko TA. Identifying patient-specific root causes with the heteroscedastic noise model. 2022. arXiv: http://arXiv.org/abs/arXiv:220513085. 10.1145/3535508.3545553Search in Google Scholar
[9] Xu S, Mian OA, Marx A, Vreeken J. Inferring cause and effect in the presence of heteroscedastic noise. In: International Conference on Machine Learning. PMLR; 2022. p. 24615–630. Search in Google Scholar
[10] Immer A, Schultheiss C, Vogt JE, Schölkopf B, Bühlmann P, Marx A. On the identifiability and estimation of causal location-scale noise models. 2022. arXiv: http://arXiv.org/abs/arXiv:221009054. Search in Google Scholar
[11] Mooij JM, Peters J, Janzing D, Zscheischler J, Schölkopf B. Distinguishing cause from effect using observational data: methods and benchmarks. J Mach Learn Res. 2016;17(1):1103–204. Search in Google Scholar
[12] Chickering DM. Optimal structure identification with greedy search. J Mach Learn Res. 2002;3:507–54. Search in Google Scholar
© 2023 the author(s), published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.
Articles in the same Issue
- Research Articles
- Adaptive normalization for IPW estimation
- Matched design for marginal causal effect on restricted mean survival time in observational studies
- Robust inference for matching under rolling enrollment
- Attributable fraction and related measures: Conceptual relations in the counterfactual framework
- Causality and independence in perfectly adapted dynamical systems
- Sensitivity analysis for causal decomposition analysis: Assessing robustness toward omitted variable bias
- Instrumental variable regression via kernel maximum moment loss
- Randomization-based, Bayesian inference of causal effects
- On the pitfalls of Gaussian likelihood scoring for causal discovery
- Double machine learning and automated confounder selection: A cautionary tale
- Randomized graph cluster randomization
- Efficient and flexible mediation analysis with time-varying mediators, treatments, and confounders
- Minimally capturing heterogeneous complier effect of endogenous treatment for any outcome variable
- Quantitative probing: Validating causal models with quantitative domain knowledge
- On the dimensional indeterminacy of one-wave factor analysis under causal effects
- Heterogeneous interventional effects with multiple mediators: Semiparametric and nonparametric approaches
- Exploiting neighborhood interference with low-order interactions under unit randomized design
- Robust variance estimation and inference for causal effect estimation
- Bounding the probabilities of benefit and harm through sensitivity parameters and proxies
- Potential outcome and decision theoretic foundations for statistical causality
- 2D score-based estimation of heterogeneous treatment effects
- Identification of in-sample positivity violations using regression trees: The PoRT algorithm
- Model-based regression adjustment with model-free covariates for network interference
- All models are wrong, but which are useful? Comparing parametric and nonparametric estimation of causal effects in finite samples
- Confidence in causal inference under structure uncertainty in linear causal models with equal variances
- Special Issue on Integration of observational studies with randomized trials - Part II
- Personalized decision making – A conceptual introduction
- Precise unbiased estimation in randomized experiments using auxiliary observational data
- Conditional average treatment effect estimation with marginally constrained models
- Testing for treatment effect twice using internal and external controls in clinical trials
Articles in the same Issue
- Research Articles
- Adaptive normalization for IPW estimation
- Matched design for marginal causal effect on restricted mean survival time in observational studies
- Robust inference for matching under rolling enrollment
- Attributable fraction and related measures: Conceptual relations in the counterfactual framework
- Causality and independence in perfectly adapted dynamical systems
- Sensitivity analysis for causal decomposition analysis: Assessing robustness toward omitted variable bias
- Instrumental variable regression via kernel maximum moment loss
- Randomization-based, Bayesian inference of causal effects
- On the pitfalls of Gaussian likelihood scoring for causal discovery
- Double machine learning and automated confounder selection: A cautionary tale
- Randomized graph cluster randomization
- Efficient and flexible mediation analysis with time-varying mediators, treatments, and confounders
- Minimally capturing heterogeneous complier effect of endogenous treatment for any outcome variable
- Quantitative probing: Validating causal models with quantitative domain knowledge
- On the dimensional indeterminacy of one-wave factor analysis under causal effects
- Heterogeneous interventional effects with multiple mediators: Semiparametric and nonparametric approaches
- Exploiting neighborhood interference with low-order interactions under unit randomized design
- Robust variance estimation and inference for causal effect estimation
- Bounding the probabilities of benefit and harm through sensitivity parameters and proxies
- Potential outcome and decision theoretic foundations for statistical causality
- 2D score-based estimation of heterogeneous treatment effects
- Identification of in-sample positivity violations using regression trees: The PoRT algorithm
- Model-based regression adjustment with model-free covariates for network interference
- All models are wrong, but which are useful? Comparing parametric and nonparametric estimation of causal effects in finite samples
- Confidence in causal inference under structure uncertainty in linear causal models with equal variances
- Special Issue on Integration of observational studies with randomized trials - Part II
- Personalized decision making – A conceptual introduction
- Precise unbiased estimation in randomized experiments using auxiliary observational data
- Conditional average treatment effect estimation with marginally constrained models
- Testing for treatment effect twice using internal and external controls in clinical trials