Startseite Causal additive models with smooth backfitting
Artikel Open Access

Causal additive models with smooth backfitting

  • Asger B. Morville und Byeong U. Park EMAIL logo
Veröffentlicht/Copyright: 16. Oktober 2025
Veröffentlichen auch Sie bei De Gruyter Brill

Abstract

A fully nonparametric approach to learning causal structures from observational data is proposed. The method is described in the setting of additive structural equation models with a link to causal inference. The estimation procedure of the additive structural equation functions is based on a novel application of the smooth backfitting (SBF) approach. The flexibility of the nonparametric procedure results in strong theoretical properties in the estimation of the variable ordering. It is shown that under mild conditions, the ordering estimate is consistent. Through simulations, it is demonstrated that our method is superior to the state-of-the-art approaches to causal learning. In particular, the SBF approach shows robustness when the noise is heteroscedastic.

MSC 2020: 62G05; 62G08

1 Introduction

In statistical analyses, one is often interested in finding causal relationships among observed variables. In traditional statistical models, one is typically not able to infer causal structures using only the joint distribution of the variables. This is because there are often multiple data generating mechanisms that lead to an identical joint distribution. One way of finding causal links among variables is to construct interventional experiments. This technique is used in contexts such as clinical trials and A/B-testing on user platforms and is considered the gold standard for finding causal relationships. In many cases, however, such interventional experiments are expensive to carry out. Furthermore, they may be infeasible to conduct from an ethical point of view. As such, for observational data, the conventional practice is to simply avoid causal conclusions, making correct probabilistic statements such as “Variable X is positively correlated with Variable Y ” instead of a causal statement such as “Large values of Variable X cause large values of Variable Y .”

An approach to causal discovery with observational data that has drawn recent attention is to formulate a causal model where each joint distribution in the model is associated with a unique data generating mechanism. In this approach, a causal model represents a family of data generating mechanisms and each data generating mechanism corresponds to a directed acyclic graph (DAG) with a set of structural equations linking the “cause” and “effect” variables in the DAG. By estimating the underlying DAG (and the associated set of structural equations) from observational data, one can deduce legitimate causal relationships among variables. This approach is recently studied by Peters et al. [1], who has shown that the underlying DAG is identifiable if the associated structural equations are non-linear in the variables and the additive error terms have a density with non-zero support on the real line. It is further developed by Bühlmann et al. [2], where the structural equations are assumed to be additive in the variables as well as in the error terms. Here, the errors are assumed to follow a Gaussian distribution.

The objective of this article is to propose a new nonparametric approach to the estimation of the underlying DAG for p real-valued random variables ( X 1 , , X p ) in the causal additive model considered in the study of Bühlmann et al. [2]. The general inference scheme is to first determine the ordering of X j among all possible permutations of the variables and then perform edge selection, which is similar to the the method proposed in Bühlmann et al. [2]. While they use a spline method to estimate the structural equations, we employ a smooth backfitting (SBF) technique. The SBF is a general nonparametric method that can be used to estimate a multivariate regression function being composed of lower dimensional component functions. The idea has been proven to lead to a powerful technique in various regression setups [37]. Recently, it has been further extended to non-Euclidean data cases [811]. One of the main obstacles in using the SBF technique is that it is applicable to compactly supported covariates while all variables involved in the structural equation models (SEMs), we need to estimate are supported on the entire real line. We develop a novel approach to overcoming the obstacle (Section 3.2). Our approach to dealing with unboundedly supported covariates can also be applied to other problems of using the SBF idea.

We show that our approach based on the SBF is consistent under weak conditions when estimating the variable ordering. Furthermore, our numerical results demonstrate that it is superior to the state-of-the-art algorithms for causal discovery. We also extend the model to incorporate heterogeneous variance, and we provide the conditions under which the ordering estimator is consistent in this setting. Importantly, we note that the theoretical developments in the study of Bühlmann et al. [2] do not address the case of heterogeneous variance. Research focusing on this setting has primarily been conducted in simple two-variable set ups, see, e.g., [12].

There exists a large body of research regarding the estimation of DAG structures from observational data. Constraint-based methods directly make a set of conditional independence statements through statistical tests. The methods then identify a set of DAGs consistent with this set of conditional independence statements, see, e.g., [13] and [14]. Methods that are more related to our approach construct a score for each possible DAG structure. The DAG that maximizes the score function is then found by traversing through the space of DAGs in a computationally feasible way, see, e.g., [1517]. Specifically, the general ordering-based search method of Wang et al. [17] has received much attention in the DAG estimation literature. This is due to the fact that once an ordering among the variables has been established, finding the best scoring DAG consistent with this ordering reduces to simple variable selection, see, e.g., [2,18,19]. Other alternative approaches to DAG estimation include gradient-based optimization techniques [20,21] and reinforcement learning [19,22].

DAG estimation within causal models from observational data is performed in models where the underlying DAG is identifiable. Examples include the nonlinear Gaussian model of Peters et al. [1], the linear non-Gaussian model of Shimizu et al. [23], the post-nonlinear model of Zhang and Hyvärinen [24] and the equal-variance linear Gaussian model of Peters et al. [25]. We refer to Glymour et al. [26] for a more comprehensive review of causal discovery algorithms.

Throughout the article, we consider a random vector X = ( X 1 , , X p ) R p . We define a directed graph G over the variables as a tuple G = ( V , E ) , where V = { 1 , , p } is the index set of the variables in X and E { 1 , , p } × { 1 , , p } is the set of directed edges that connect the variables. That is, ( i , j ) E is interpreted as an edge from node i to node j , and we may also write this as X i X j . A directed path from i 1 to i n in G is a sequence of edges { ( i 1 , i 2 ) , ( i 2 , i 3 ) , , ( i n 1 , i n ) } E . The directed graph G is then a DAG if there exists no directed path from a variable to itself. We denote the parents of j in G as pa ( j ) , and thus i pa ( j ) if and only if ( i , j ) E . If i pa ( j ) , we will call X i a parent variable of X j . If ( X l 1 , , X l m ) are the parent variables of X j , we let X pa ( j ) ( X l 1 , , X l m ) . Finally, we say that X i is a descendant of X j if there exists a directed path from j to i in G .

We proceed as follows. In Section 2, we formulate an additive SEM along with some basic assumptions that make the model identifiable and then describe a general estimation procedure of recovering the underlying DAG. In Section 3, we present our proposal of estimating a true ordering of variables based on the SBF technique together with its supporting theory. Section 4 is devoted to extending the method and theory to the case where the additive error terms in the SEM have heterogeneous variances depending on parent variables. In Sections 5 and 6, we report the results of numerical studies demonstrating the efficiency of our proposal and conclude in Section 7 with some discussion. All technical proofs for the theoretical results presented in this article and additional numerical results are deferred to the Appendix.

2 Problem set-up

2.1 The additive structural equation family

We define a general SEM over p random variables { X 1 , , X p } as an ordered triple C = ( G , S , ε ) , where G is a DAG over the p variables, S is a set of p functions relating X j to its parents in G through

(1) X j = f j ( X pa ( j ) , ε j ) , j = 1 , , p ,

and ε = ( ε 1 , , ε p ) is a vector of random variables representing the noise. We assume that all variables X j are observed, i.e., no latent variables are allowed. We assume that f j are measurable and thus X = ( X 1 , , X p ) is a random vector with an associated probability measure P C . We let C denote the set of all SEMs. As each C C gives rise to an associated P C , we may associate a statistical model to each subset C 0 C . In this perspective, C 0 takes the role of an index set of the statistical model { P C : C C 0 } .

We define the additive family C Add C by letting C = ( G , S , ε ) C Add if the structural equations in (1) are of the form

(2) X j = f j ( X pa ( j ) , ε j ) = k pa ( j ) g j , k ( X k ) + ε j ,

where the errors ε j N ( 0 , σ j 2 ) are independent and g j , k : R R are nonlinear transformations. In this setting, if pa ( j ) = , then X j = ε j . The assumption of g j , k being non-linear is made in order to ensure identifiability of the model. Below, we collect the list of properties that characterize C Add . In Assumption 3, p X denotes the density of X .

Assumption 1

The functions g j , k have finite fourth moment, g j , k ( X k ) 4 d P C < .

Assumption 2

The functions g j , k are two times differentiable and non-linear.

Assumption 3

For any k { 1 , , p } and 1 , , d { 1 , , p } \ k with d < p , the second-order derivatives

2 2 x k p X ( x 1 , , x p ) d ( x 1 , , x d ) , 2 2 x k x j p X ( x 1 , , x p ) d ( x 1 , , x d )

exist. In addition, the marginal density p X ( x 1 , , x p ) d ( x 1 , , x d ) is jointly continuous.

Assumption 4

The variances of the noise variables are strictly positive, i.e. σ j 2 > 0 , j = 1 , , p .

Note that the technical condition in Assumption 3 is met when the component functions g j , k and its derivatives are bounded, allowing for differentiation under the integral sign. A variant of this model was proposed in Bühlmann et al. [2]. The set-up of the problem is that we observe iid copies X i for 1 i n of the random vector X R p . The vector X has a distribution P C 0 , where P C 0 is entailed by the SEM C 0 = ( G 0 , S 0 , ε ) C Add . The goal is to recover the underlying graph G 0 using only the observations { X i } i = 1 n . The key that makes the estimation of G 0 possible is that it is identifiable, i.e., if C 1 , C 2 C Add have distinct associated graphs G 1 G 2 , then P C 1 P C 2 . This follows from Corollary 31 in the study of Peters et al. [1]. The property is not enjoyed by the traditional linear Gaussian model.

Remark 1

An SEM ( G , S , ε ) is interpreted as the specification of a data generating mechanism behind X i . In this perspective, the set C 0 is promoted from being a simple index set for the associated statistical model to the collection of possible underlying mechanisms that generate the observed data. The expression in (1) is therefore interpreted as an assignment rather than an equation. In other words, the value of X j is a direct effect of the outcomes of X pa ( j ) and ε j by the mechanism f j . This interpretation allows for causal statements and we may construct interventional and counterfactual distributions [27].

In Section 2.2, we describe a general estimation technique that is employed to estimate the underlying graph G 0 . In Section 2.3, we provide theoretical results that underpin the estimation procedure described in Section 2.2.

2.2 General estimation procedure

We fix an SEM C 0 = ( G 0 , S 0 , ε ) C Add which we regard as giving rise to the true distribution of our observations. We define an ordering of the variables ( X 1 , , X p ) to be a permutation π Π , where Π is the set of all permutations of the indices { 1 , , p } . For any ordering π Π , there is an associated “fully connected” DAG G π which is unique and characterized by having an edge π ( i ) π ( j ) if and only if i < j . Figure 1 illustrates the fully connected DAG corresponding to an ordering π . As in the study of Bühlmann et al. [2], we define for the fixed DAG G 0 the corresponding set of true orderings Π 0 Π as

Π 0 { π Π : G 0 G π } .

In words, π Π is a true ordering if the associated fully connected DAG G π contains all the edges in the true underlying DAG G 0 .

Figure 1 
                  The left panel illustrates the fully connected graph 
                        
                           
                           
                              
                                 
                                    G
                                 
                                 
                                    π
                                 
                              
                           
                           {{\mathcal{G}}}^{\pi }
                        
                      corresponding to the permutation 
                        
                           
                           
                              π
                           
                           \pi 
                        
                      given by 
                        
                           
                           
                              π
                              
                                 (
                                 
                                    1
                                 
                                 )
                              
                              =
                              2
                              ,
                              π
                              
                                 (
                                 
                                    2
                                 
                                 )
                              
                              =
                              3
                              ,
                              π
                              
                                 (
                                 
                                    3
                                 
                                 )
                              
                              =
                              1
                           
                           \pi \left(1)=2,\pi \left(2)=3,\pi \left(3)=1
                        
                     , and 
                        
                           
                           
                              π
                              
                                 (
                                 
                                    4
                                 
                                 )
                              
                              =
                              4
                           
                           \pi \left(4)=4
                        
                     . The right panel displays the same graph 
                        
                           
                           
                              
                                 
                                    G
                                 
                                 
                                    π
                                 
                              
                           
                           {{\mathcal{G}}}^{\pi }
                        
                      but using the 
                        
                           
                           
                              
                                 
                                    X
                                 
                                 
                                    π
                                    
                                       (
                                       
                                          j
                                       
                                       )
                                    
                                 
                              
                           
                           {X}_{\pi \left(j)}
                        
                      notation. The different edge types represent different 
                        
                           
                           
                              
                                 
                                    g
                                 
                                 
                                    j
                                 
                                 
                                    π
                                 
                              
                           
                           {g}_{j}^{\pi }
                        
                     -functions. For example, the two dashed edges encode the function 
                        
                           
                           
                              
                                 
                                    g
                                 
                                 
                                    3
                                 
                                 
                                    π
                                 
                              
                           
                           {g}_{3}^{\pi }
                        
                     , which is a function of 
                        
                           
                           
                              
                                 
                                    X
                                 
                                 
                                    π
                                    
                                       (
                                       
                                          1
                                       
                                       )
                                    
                                 
                              
                           
                           {X}_{\pi \left(1)}
                        
                      and 
                        
                           
                           
                              
                                 
                                    X
                                 
                                 
                                    π
                                    
                                       (
                                       
                                          2
                                       
                                       )
                                    
                                 
                              
                           
                           {X}_{\pi \left(2)}
                        
                     .
Figure 1

The left panel illustrates the fully connected graph G π corresponding to the permutation π given by π ( 1 ) = 2 , π ( 2 ) = 3 , π ( 3 ) = 1 , and π ( 4 ) = 4 . The right panel displays the same graph G π but using the X π ( j ) notation. The different edge types represent different g j π -functions. For example, the two dashed edges encode the function g 3 π , which is a function of X π ( 1 ) and X π ( 2 ) .

We define the functions g j π for 1 j p as the conditional expectation of X π ( j ) given the parent variables X π ( 1 ) , , X π ( j 1 ) of X π ( j ) in G π ,

(3) g j π ( ) E C 0 ( X π ( j ) ( X π ( 1 ) , , X π ( j 1 ) ) = ) .

The subscript C 0 indicates that we take the conditional expectation with respect to P C 0 . In Figure 1, the edges in G π associated with g j π for different j are depicted in different line types. The domain of g j π is R j 1 , and g 1 π E C 0 ( X π ( 1 ) ) . Finally, we define the constants

(4) σ j 2 , π E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 ] , j = 1 , , p .

The general estimation procedure of G 0 is then first to pick a true ordering such that

(5) π * argmin π Π j = 1 p log σ j 2 , π .

In Section 2.3, we will show why this implies π * Π 0 . With this choice, we have identified a graph G π * such that G 0 G π * . We prune the fully connected graph G π * by deleting the edge X π * ( k ) X π * ( j ) if the function g j π * is constant as a function of the k th component. In Section 2.3, it is shown how this pruning process leads to the correct graph G 0 . Now, as we do not have access to the constants σ j 2 , π and the functions g j π for π Π , these quantities need to be estimated. This estimation process is described in Section 3.

2.3 Ordering identification and graph pruning

With C 0 C Add remaining fixed and the definitions given in Section 2.2, for any ordering π we may form an SEM C π = ( G π , S π , ε π ) C with structural equations given by

(6) X j π = g j π ( X 1 π , , X j 1 π ) + ε j π , j = 1 , , p ,

where ε j π N ( 0 , σ j 2 , π ) . We let P π denote the probability measure associated with the SEM C π , i.e. P π is the probability measure associated with the random vector with the π ( j ) ’th element given by X j π . Note that we do not necessarily have C π C Add when π Π 0 . However, in Proposition 1, we show that we may associate C π C Add with each π Π 0 .

Proposition 1

Let C 0 C Add . If π is a correct ordering, i.e., π Π 0 , then for the structural equations in C π it holds

X j π = k : π ( k ) pa ( π ( j ) ) g π ( j ) , π ( k ) ( X k π ) + ε j π , j = 1 , , p ,

where ε j π N ( 0 , σ j 2 , π ) with σ j 2 , π = σ π ( j ) 2 > 0 .

In the following, we apply Proposition 1 to show that for a true ordering π Π 0 , ( X π ( 1 ) , , X π ( p ) ) = d ( X 1 π , , X p π ) , and thus, the entailed P π from the SEM C π is equal to the probability measure entailed by C 0 .

Proposition 2

Let C 0 C Add . If π Π 0 , then P π = P C 0 .

We define

(7) ξ min π Π 0 K L ( P C 0 , P π ) .

If two measures P 1 and P 2 have densities p 1 and p 2 with a common support, we may define the KL-divergence from P 1 to P 2 by K L ( P 1 , P 2 ) E P 1 [ log ( p 1 ( X ) p 2 ( X ) ) ] . Note that ξ depends on the underlying SEM C 0 , but we suppress this dependency in the notation. By Proposition 2, we have that K L ( P C 0 , P π ) = 0 for all π Π 0 . Below in Proposition 4, we show that K L ( P C 0 , P π ) > 0 for all π Π 0 so that ξ > 0 . For the latter, we use a slight modification of Corollary 31 in the study of Peters et al. [1] which is based on the assumption that g j π are two times differentiable. This smoothness requirement for g j π is met for π Π 0 by Proposition 1. The next proposition extends it to all π Π .

Proposition 3

Let C 0 C Add . For any ordering π Π and 1 j p , there exists a version of g j π that is two times differentiable in each argument.

The following proposition is a consequence of Proposition 3 and a modification of Corollary 31 in the study of Peters et al. [1].

Proposition 4

For any C 0 C Add it holds that ξ > 0 .

Proposition 4 gives a way of separating correct orderings from incorrect orderings in the sense that G 0 G π K L ( P C 0 , P π ) = 0 . Below we derive a closed form expression of ξ defined at (7).

Proposition 5

Fix an SEM C 0 C Add . For any π 0 Π 0 and π 1 Π it holds that

K L ( P π 0 , P π 1 ) = 1 2 j = 1 p log σ j 2 , π 1 1 2 j = 1 p log σ j 2 , π 0 .

In particular, ξ = min π 1 Π 0 K L ( P π 0 , P π 1 ) .

Combining Propositions 25, we obtain that G 0 G π * , i.e., π * Π 0 if and only if π * fulfills (5). A slight variant of this characterization of a true ordering was presented in the study of Bühlmann et al. [2].

Assume now that π is chosen such that G 0 G π . By Proposition 1, if there exists an edge from X π ( k ) to X π ( j ) in G π but no edge from X π ( k ) to X π ( j ) exists in the true G 0 , then the corresponding function g j π is a constant function of X π ( k ) . By the same proposition, if there exists an edge from X π ( k ) to X π ( j ) in G 0 , then the g j π is equal to a non-constant function of X π ( k ) . Thus, removing the edge X π ( k ) X π ( j ) from G π if g j π is constant as a function of its k th component leads to the true underlying G 0 .

3 Estimation of a true ordering

In this section, we detail how a true ordering π * at (5) is estimated from iid copies X i for 1 i n of X . In order to construct an estimator of π * at (5), we need to estimate the conditional expectations g j π at (3) for all permutations π Π . Here, we describe how these are estimated by an SBF procedure. With the estimators, say g ˆ j π , we may estimate the variances σ j 2 , π at (4) by their empirical versions,

(8) σ ˆ j 2 , π 1 n i = 1 n ( X π ( j ) i g ˆ j π ( X π ( 1 ) i , , X π ( j 1 ) i ) ) 2 .

With these estimates, we may construct the estimator of a true ordering π * by

(9) π ˆ argmin π Π j = 1 p log σ ˆ j 2 , π .

With the ordering π ˆ , we estimate the true graph G 0 by performing variable selection among X π ˆ ( 1 ) , , X π ˆ ( j 1 ) for each j = 1 , , p .

3.1 SBF

We describe the SBF technique in a general regression setting. Let D = D 1 × × D d R d for some compact intervals D j R . Suppose that Z = ( Z 1 , , Z d ) is a random vector taking values in D with a density q with respect to the Lebesgue measure. Let m : R d R be the regression function that relates Z to a response variable Y by

(10) Y = m ( Z ) + ε ,

where ε is a random variable with E ( ε Z ) = 0 . The structural equations at (6) can be expressed as (10) by a specialization of ( Z , Y , ε ) . Let ( q ) be the space additive functions f L 2 ( q ) of the form f ( z ) = f 1 ( z 1 ) + + f d ( z d ) . Put

(11) m + proj ( m ( q ) ) ,

the projection of m onto ( q ) . Thus, there exist m j : R R such that m + ( z ) = m 1 ( z 1 ) + + m d ( z d ) . Define μ j E ( Y Z j = ) . Let q j and q j k be the densities of Z j and ( Z j , Z k ) , respectively. By the projection property of m + , we have E ( m ( Z ) m + ( Z ) Z j = ) = 0 , so that (10) implies a set of integral equations,

(12) m j ( z j ) = μ j ( z j ) k j d D k m k ( z k ) q j k ( z j , z k ) q j ( z j ) d z k , z j D j , 1 j d .

Assume now that we observe iid copies ( Z i , Y i ) of ( Z , Y ) for 1 i n . Replacing μ j ( z j ) , q j ( z j ) , and q j k ( z j , z k ) with nonparametric estimators obtained from the observations ( Z i , Y i ) gives rise to the SBF estimator of m + . Below, we describe how the nonparametric estimators are constructed.

We define first the normalized kernel weight functions K h j : D j × D j R by

K h j ( z j , u j ) D j K v u j h j d v 1 K z j u j h j .

Here, the function K is taken to be a symmetric, bounded, nonnegative baseline kernel function with compact support. From these, we estimate the marginal densities q j and q j k , respectively, by

(13) q ˆ j ( z j ) 1 n i = 1 n K h j ( z j , Z j i ) , q ˆ j k ( z j , z k ) 1 n i = 1 n K h j ( z j , Z j i ) K h k ( z k , Z k i ) .

From the marginal density estimators q ˆ j , we construct the Nadaraya-Watson (NW) estimators of the conditional expectations E ( Y Z j = z j ) ,

(14) μ ˆ j ( z j ) 1 q ˆ j ( z j ) 1 n i = 1 n Y i K h j ( z j , Z j i ) .

We then define the NW-SBF estimator of m + by m ˆ + m ˆ 1 + + m ˆ d where { m ˆ j } j = 1 d solves the system of integral equations

(15) m ˆ j ( z j ) = μ ˆ j ( z j ) k j d D k m ˆ k ( z k ) q ˆ j k ( z j , z k ) q ˆ j ( z j ) d z k , 1 j d .

The existence, uniqueness, and consistency of m ˆ + are ensured under weak conditions. We note that (15) is obtained by substituting the estimators at (13)–(14) for q j , q j k , and μ j in (12). When m is an additive function, i.e., m ( q ) , then m + = m so that the SBF estimator m ˆ + is considered an estimator of the underlying regression function m at (10). We make the following conditions for the statistical properties of m ˆ + .

Assumption (B1)

The joint density q of Z is continuous and bounded away from zero and infinity on the domain D .

Assumption (B2)

The marginal regression functions μ j are continuous on D j .

Assumption (B3)

E Y α < holds for some α > 2 .

Assumption (C1)

The baseline kernel function K has compact support, is bounded, symmetric about zero and Lipschitz continuous.

Assumption (C2)

The bandwidths satisfy that h j 0 and n τ j h j are bounded away from zero for some τ j > 0 with τ j < ( α 2 ) α for all 1 j d and τ j + τ k < 1 for all 1 j k d , where α is the constant in Assumption (B3)

In the SBF literature, it is typically assumed that q is partially continuously differentiable on D and each component m j of m + is twice continuously differentiable. In our work, however, we make weaker assumptions as stated in Assumptions (B1) and (B2), since we need only the consistency of q ˆ j , q ˆ j k and μ ˆ j as the estimators of q j , q j k and μ j . The assumptions stated in Assumptions (B3), (C1) and (C2) are commonly made in the SBF literature.

Proposition 6

Under the conditions in Assumptions (B1), (C1), and (C2), there exist a unique solution m ˆ + that solves (15) with probability tending to one. Also, under the conditions in Assumptions (B1)–(B3) and (C1)–(C2) it holds that the SBF estimator m ˆ + converges uniformly to m + at (11), i.e. sup z D m ˆ + ( z ) m + ( z ) = o p ( 1 ) .

There is a geometric interpretation of the SBF estimator m ˆ + . Let ( q ˆ ) be the empirical version of ( q ) where q is replaced by q ˆ such that q ˆ ( z ) n 1 i = 1 n j = 1 d K h j ( z j , Z j i ) . Then, in the n -product space L 2 ( q ˆ ) × × L 2 ( q ˆ ) endowed with a suitable norm, the n -copies of the SBF estimator, ( m ˆ + , , m ˆ + ) , is the projection of ( Y 1 , , Y n ) onto ( q ˆ ) × × ( q ˆ ) . The imposed additive structure frees the estimator from the curse of dimensionality which the full-dimensional NW estimator is plagued by. We refer to the literature [3] and [8] for a more comprehensive discussion on the SBF estimator.

3.2 Estimation of a true ordering

In this subsection, we describe an ordering estimator as prescribed at (9) and show that it is a true ordering with probability tending to one under weak conditions. Recall that the SBF technique is applicable to compactly supported covariates. Since the variables in X are not compactly supported, we consider a cartesian product C = C 1 × × C p R p , where C j are some compact intervals with non-empty interior, and then estimate the functions g j π defined at (3) on the respective sets

(16) C j π k = 1 j 1 C π ( k )

by the SBF method. To this end, we construct the random variables

( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ; X ˜ π ( j ) ) = d ( X π ( 1 ) , , X π ( j 1 ) ; X π ( j ) ) k = 1 j 1 { X π ( k ) C π ( k ) } ,

and we note that g j π ( ) = E C 0 [ X ˜ π ( j ) ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) = ] on C j π . We may construct an iid sequence

(17) { ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i , X ˜ π ( j ) i ) } i 1 ,

where ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i , X ˜ π ( j ) i ) = d X ˜ π ( 1 ) , , X ˜ π ( j 1 ) , X ˜ π ( j ) . The construction simply amounts to discarding observations for which ( X π ( 1 ) i , , X π ( j 1 ) i ) C j π . Note that X ˜ π ( j ) is not required to land within the set C π ( j ) . We call the observations at (17) the filtered observations for the response X π ( j ) . Importantly, we note that the indices of the filtered observations at (17) for different j { 1 , , p } are likely to be different. These filtered observations are well-defined since P ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) > 0 .

Using these observations, we obtain the ordering estimator by applying the SBF method described in Section 3.1 to estimating

(18) g j , + π proj ( g j π ( p j , C π ) )

for each π and j , where p j , C π is the joint density of ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) . Note that the functions g j , + π depend on the compact set C , but this dependence is suppressed in the notation. Furthermore, if π is a true ordering and the underlying SEM C 0 C Add with (2), then g j π is additive by Proposition 1 so that g j , + π = g j π .

With the filtered observations for the response X π ( j ) we let g ˆ j , C π denote the SBF estimators of g j , + π at (18), where X ˜ π ( j ) takes the role of the response Y and ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) takes the role of the covariate Z . This involves constructing the marginal density estimators for the variables in ( X ˜ k π : 1 k j 1 ) according to (13) and the NW estimators of E C 0 [ X ˜ π ( j ) X ˜ π ( k ) = ] for k { 1 , , j 1 } according to (14), and then solving the resulting SBF equation at (15). The procedure yields the estimators σ ˆ j , C 2 , π of σ j , + 2 , π E C 0 [ ( X ˜ π ( j ) g j , + π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] according to the formula at (8), i.e.

(19) σ ˆ j , C 2 , π = 1 I j π i I j π ( X ˜ π ( j ) i g ˆ j , C π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2 ,

where I j π { i : ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) ) C j π } . This yields the ordering estimator

(20) π ˆ C argmin π Π j = 1 p log σ ˆ j , C 2 , π .

We include the superscript C in the notation π ˆ C to indicate that the ordering estimator depends on the compact set C that we filter the observations by.

We note that, for each 1 j p , the density of ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) is bounded away from zero and infinity, and thus, it satisfies the condition in Assumption (B1) with Z = ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) . Furthermore, the condition in Assumption 3 implies that the marginal regression functions μ k satisfy Assumption (B2) with Y = X ˜ π ( j ) and Z = ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) . The latter can be proved similarly as in the proof of Proposition 3. Finally, Assumption (B3) holds for such a choice of ( Y , Z ) by Assumption 1 and the Gaussian error assumption. We formally state the results in the following proposition.

Proposition 7

Let C 0 C Add . For any ordering π Π and j { 1 , , p } , Assumptions (B1)–(B3) are satisfied for X ˜ π ( j ) as the response variable Y and ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) as the predictor vector Z .

A consequence of Proposition 7 is that we need only Assumptions (C1) and (C2) for the uniform convergence result in Proposition 6 with m + = g j , + π and m ˆ + = g ˆ j , C π . For the theoretical properties of the ordering estimator π ˆ C , we start with its existence. The following proposition is a direct consequence of the first part of Proposition 6

Proposition 8

Assume that { X i } 1 i n are generated from P C 0 with C 0 C Add , and assume that Assumptions (C1) and (C2) hold. For any choice of compact set C = C 1 × × C p R p where each C j is a non-trivial interval, the ordering estimator π ˆ C in (20) exists with probability tending to one.

According to our discussion in Section 2.3, the key element for the consistency of π ˆ C is the consistency of σ ˆ j , C 2 , π as an estimator of σ j 2 , π for all j . Our next result demonstrates that they are consistent for any non-trivial compact set C when π is a true ordering. When π is not a true ordering, σ ˆ j , C 2 , π cannot be consistent as an estimator of σ j 2 , π since g ˆ j , C π does not converge to g j π but to its closest additive approximation g j , + π according to Proposition 6. In the latter case, however, we may pick a large enough compact set C for any arbitrarily small positive number η > 0 such that each σ ˆ j , C 2 , π converges to a constant greater than σ j 2 , π η .

Theorem 1

Assume that { X i } 1 i n are generated from P C 0 with C 0 C Add . Assume also that Assumptions (C1) and (C2) hold.

(i) Case 1 (true orderings): Suppose that π Π 0 . If the sets C π ( 1 ) , , C π ( j 1 ) are compact with non-empty interior, then it holds that σ ˆ j , C 2 , π P σ j 2 , π as n , where σ ˆ j , C 2 , π is given by (19).

(ii) Case 2 (non-true orderings): For any η > 0 there exists a compact set C * = C 1 * × × C p * R p , with C j * being non-trivial intervals such that, for any π Π 0 and j { 1 , , p } , it holds that σ ˆ j , C 2 , π P b j > σ j 2 , π η whenever C j contains C j * for all 1 j p .

Both cases in Theorem 1 are the consequences of the second part of Proposition 6 regarding the uniform convergence of the SBF estimator on compact domains. Theorem 1, in turn, implies that we can construct a compact set C large enough so that we obtain consistency of the ordering estimator π ˆ C . This is demonstrated in the following theorem.

Theorem 2

Assume that { X i } 1 i n are generated from P C 0 with C 0 C Add . Assume also that Assumptions (C1) and (C2) hold. Then, there exists a non-trivial compact set C * R p such that for any C C * it holds that P ( π ˆ C Π 0 ) 1 as n .

We note that the conditions for Thorem 2 are weaker than those for Theorem 1 in Bühlmann et al. [2], which in part relies on an empirical process argument.

In the practical implementation of π ˆ C , we need to choose a compact set C . We suggest to take C j as the smallest closed interval that contains all observations X j i across i . We make this choice across all j { 1 , , p } . There is a slight gap between practice and theory. In practice, the set C is chosen after the observation has been made in order for C to include all observations. In theory, C is chosen before we make the observation, and then when we estimate g j , + π we filter out the observations for which ( X π ( 1 ) i , , X π ( j 1 ) i ) C j π . This kind of gap, however, is inevitable not only in additive regression but also in nonparametric regression problems in general.

4 Ordering estimation with heterogeneous noise

In this section, we explore the setting where the noise is heterogeneous. In Section 4.1, we introduce a heterogeneous structural equation family which will serve as the statistical model for the observations. In Section 4.2, we state an identifiability result regarding the variable ordering. Finally, in Section 4.3, we present a consistency result for the ordering estimator which is valid under some extra conditions. We note that most of the theory here mirrors that of the previous sections, with a few exceptions that we highlight. In particular, the uniform convergence result of Proposition 6, which is the key theorem underpinning the results of Section 3.2, does not rely on the homogeneity of the error variances.

4.1 Model set-up

We generalize the structural equation family C Add as follows. We let C Add * C denote the heterogeneous additive family, characterized by letting C = ( G , S , ε ) C Add * if the structural equations in S are of the form

(21) X j = f j ( X pa ( j ) , ε j ) = k pa ( j ) g j , k ( X k ) + σ j ( X pa ( j ) ) ε j .

In the model (21), the errors ε j N (0, 1) are independent and σ j : R pa ( j ) R are functions specifying the variance. We make the same assumptions on the functions g j , k as in Assumptions 13. However, we replace Assumption 4 by a more general one.

Assumption 4’

The functions σ j are continuous and there exist constants 0 < c C < such that σ j ( x pa ( j ) ) [ c , C ] for all x pa ( j ) R pa ( j ) .

It follows that { P C : C C Add } { P C : C C Add * } since Assumption ( A 4 ) holds for homogeneous variances σ j 2 > 0 .

4.2 Identifiability of the true variable ordering

In a similar manner as in the case with homogeneous noise, for a fixed C 0 C Add * , we can define an SEM C π for each π Π as follows. We define the conditional expectations g j π ( ) as at (3) but now we define the variance functions σ j 2 , π : R pa ( j ) ( 0 , ) by

(22) σ j 2 , π ( ) Var C 0 ( X π ( j ) ( X π ( 1 ) , , X π ( j 1 ) ) = ) = E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 ( X π ( 1 ) , , X π ( j 1 ) ) = ] .

The associated SEM C π C is then the SEM with the DAG given by G π and structural equations of the form X j π = g j π ( X 1 π , , X j 1 π ) + σ j π ( X 1 π , , X j 1 π ) ε j π with ε j π N ( 0 , 1 ) . We let ξ be defined as at (7). We have the following variant of Proposition 4.

Proposition 9

If C 0 C Add * and the functions g j , k are not of the form g j , k = Q 2 Q 1 where Q 1 is a polynomial of degree two with Q 1 > 0 and Q 2 is a polynomial of degree two or less, then ξ > 0 .

The proof is based on an identifiability result from the study of Immer et al. [12]. The moral of Proposition 9 is that only in pathological cases we have ξ = 0 , and assuming ξ > 0 is not restrictive in practice. There are other ways of imposing restrictions on the functions g j , k that also imply ξ > 0 , see [12] and [28]. Furthermore, we have the following closed-form expression of ξ , which resembles the one in Proposition 5.

Proposition 10

Fix an SEM C 0 C Add * . For any π 0 Π 0 and π 1 Π , it holds that

K L ( P π 0 , P π 1 ) = 1 2 j = 1 p E C 0 [ log σ j 2 , π 1 ( X π 1 ( 1 ) , , X π 1 ( j 1 ) ) ] 1 2 j = 1 p E C 0 [ log σ j 2 , π 0 ( X π 0 ( 1 ) , , X π 0 ( j 1 ) ) ] .

In particular, ξ = min π 1 Π 0 K L ( P π 0 , P π 1 ) .

In a similar way as to how it was shown for the homogeneous variance case, we have that the ordering π 0 Π is a correct ordering if and only if

(23) π 0 argmin π Π j = 1 p E C 0 [ log σ j 2 , π ( X π ( 1 ) , , X π ( j 1 ) ) ] .

4.3 Consistency of the ordering estimator

We do not have access to the functions σ j 2 , π in (23). However, by Jensen’s inequality,

E C 0 [ log σ j 2 , π ( X π ( 1 ) , , X π ( j 1 ) ) ] log E C 0 [ σ j 2 , π ( X π ( 1 ) , , X π ( j 1 ) ) ] , j = 1 , , p .

We consider an alternative ordering π ¯ that minimizes the upper bound of the sum at (23),

(24) π ¯ argmin π Π j = 1 p log E C 0 [ σ j 2 , π ( X π ( 1 ) , , X π ( j 1 ) ) ] .

The estimated ordering π ˆ C defined at (20) converges to π ¯ in the case of heteroscedastic noises, which leads to the following consistency result.

Theorem 3

Assume that { X i } 1 i n are generated from P C 0 with C 0 C Add * . Assume also that Assumptions (C1) and (C2) hold. If π ¯ Π 0 for all choices of π ¯ at (24), then there exists a non-trivial compact set C * R p such that for any C C * it holds that P ( π ˆ C Π 0 ) 1 as n , where π ˆ C is the ordering estimator defined at (20).

As shown in Proposition 9, requiring ξ > 0 is a mild condition. In Appendix A.16, we investigate the condition π ¯ Π 0 in a simulated example. In particular, we shall see how the heterogeneity of the variance and the non-linearity of the functions g j , k affect the condition π ¯ Π 0 .

5 Simulation study

The structure of this section is as follows. Section 5.1 details the simulation setup of the subsequent numerical experiments. In Section 5.2, we compare our proposed SBF estimator against some of the state-of-the-art algorithms introduced in Section 1 for recovering the variable ordering. In Section 5.3, we include the variable selection step in the comparison, where the goal is the recovery of the underlying DAG, not only of the variable ordering. Section 5.4 contains some remarks on the comparison of different methods.

Throughout all the experiments in this section, we let the number of observed variables p be 10 and the experiments are all repeated 100 times. Some additional simulation results with p = 20 are contained in Appendix A.17. Extending the method to an even larger p would require either a preliminary neighborhood selection such as in the study of Bühlmann et al. [2], or a more efficient search algorithm (e.g., the hill climbing algorithm).

5.1 Simulation set-up

When evaluating the quality of the variable ordering estimate we use the structural intervention distance (SID) introduced in the study of Peters and Bühlmann [29]. This is a pre-distance measure over DAGs which has the property that if G 1 G 2 then SID ( G 1 , G 2 ) = 0 . The other graph distance measure we use is the structural Hamming distance (SHD). The SHD between G 1 and G 2 is the number of arrow flips, deletions, and insertions that are required to transform G 1 into G 2 . We note that an estimated variable ordering consists of a fully connected graph G π ˆ . It is thus not appropriate to use the SHD to compare G π ˆ with the true underlying graph G 0 . Instead, we shall only use the SHD when comparing the pruned estimate of G π ˆ with the true underlying G 0 . We let { X 1 , X 2 , , X p 1 , X p } be the true variable ordering.

Data generation. When generating a graph consistent with this ordering, there are a total number of p ( p 1 ) 2 possible edges that can be included. We include each of such edges with some probability ρ [ 0 , 1 ] , which results in a graph with an expected number of included edges equal to ρ p ( p 1 ) 2 . After generating the graph, the non-linear functions associated with the included edges are generated as realizations of Gaussian processes with RBF kernels [30] with lengthscale equal to 0.4. These processes are sampled using the Python module scikit-learn from the study of Pedregosa et al. [31]. Furthermore, the generated noise variables are simulated as Gaussian random variables with variances equal to the outcome of a uniform random variable on the interval [0.4, 0.8]. This simulation set-up is similar to that of Rolland et al. [18].

We also include the results for the case where the noise is heterogeneous. In this case, we simulate data according to the model in (21). As in the homogeneous variance case, the functions f j , k are realizations of Gaussian processes. The variance functions σ j ( x ) are given in an additive form, σ j ( x ) = 1 + k pa ( j ) U k , j σ j , k ( x k ) , where U k , j are iid Bernoulli random variables with P ( U k , j = 1 ) = 0.3 and each component function is given by σ j , k ( u ) = ( 1 + e u + 2 ) 1 ( 1 + e u 2 ) 1 + 1 . The resulting data exhibits heterogeneous noise in the sense that points lying far from the origin are associated with higher noise. The presence of the Bernoulli variables U k , j ensures that not all the edges are associated with heterogeneous variance.

Estimation of variable ordering and underlying graphs. For the estimation of the variable ordering, by the estimator defined in (20), we need to cycle through all the possible p ! orderings in Π to identify the ordering estimate. However, this quickly becomes computationally infeasible as p increases. As in the study of Bühlmann et al. [2], we perform a greedy search. Starting with an empty DAG, we iteratively add the edge that yields the biggest increase in log-likelihood. Once an edge has been added, we update the set of possible edges to add in the next step. This update step is performed in order to ensure that the graph estimate remains acyclic.

For the implementation of the SBF optimization loop, we transform the data such that each variable lies within the unit interval. The fitted functions are then scaled back to the original position after estimation is done. When performing variable selection for the graph pruning step (see the last part of Section 2.3) we apply the functional lasso-method described in Lee et al. [32] which is a generalization of the lasso method of Tibshirani [33] to the SBF setting. Finally, for the SBF method the Epanechnikov kernel is used as the baseline kernel function and the bandwidth parameter is chosen according to Silverman’s rule of thumb in each dimension, see Shimizu et al. [34].

Other methods in comparison. We compare our method with the methods, CAM in Bühlmann et al. [2], SCORE in Rolland [18], LiNGAM in the study of Shimizu et al. [35], RESIT in the study of Peters et al. [1] and GES in the study of Chickering [16]. Note that all these methods except the GES consist of first establishing an estimate of true variable ordering and then performing variable selection. The GES does not perform the estimation of variable ordering, but instead directly estimates the Markov equivalence class of the underlying DAG.

The CAM is fitted in R using the CAM-library provided by the authors. The SCORE is fitted in Python using the code made publicly available on the authors’ Github page. The LiNGAM and RESIT are fitted using the Causal Discovery library provided by Ikeuchi et al. [36]. The method of GES is fitted using the ges package in Python.

We note that for variable selection, the CAM and SCORE both rely on the gam-function from the mgcv-library in R, see Wood [37]. We remove the edges that correspond to p -values larger than α = 0.01 . This choice is made since it showed better performance in practice compared to the default setting of α = 0.001 . For the penalization-parameter in the functional lasso method we set λ = 0.5 , as this generally showed good performance in various settings. The GES method does not provide a single DAG estimate but instead a Markov equivalence class of DAG estimates. In order to obtain a single DAG estimate from the equivalence class we orient the undirected edges so that they are consistent with the true ordering.

We note that comparing the implied ordering estimates of the GES with the other methods in terms of the SID to the true underlying DAG can be slightly misleading. This is because the estimates by the GES are already pruned, and thus, we expect these estimates to be further from the true orderings than the ordering estimates provided by the other methods.

5.2 Recovery of the variable ordering

We set the probability ρ of including each of the possible p ( p 1 ) 2 edges equal to ρ = 2 ( p 1 ) . This results in a graph with the expected number of included edges equal to p . In this sparse setting, the results of fitting the methods outlined in the previous subsection can be seen in Table 1 where the values of SID ( G 0 , G π ˆ ) are reported.

Table 1

SID from the true ordering to the estimated ordering when the underlying graph is sparse

Homogeneous variance Heterogeneous variance
Method n = 50 n = 100 n = 500 n = 50 n = 100 n = 500
SBF (our method) 3.2  ±  0.7 0.4  ±  0.2 0.0  ±  0.0 5.5  ±  1.3 0.9 ± 0.2 0.0  ±  0.0
CAM 4.9  ±  1.2 0.4  ±  0.1 0.0  ±  0.0 7.1  ±  1.2 0.8  ±  0.2 0.1  ±  0.0
SCORE 6.7  ±  0.9 3.5  ±  0.5 1.4  ±  0.1 5.9  ±  1.1 3.0  ±  0.4 1.2  ±  0.1
LiNGAM 31.1  ±  2.4 33.1  ±  1.4 36.0  ±  0.6 34.5  ±  2.5 34.3  ±  1.4 34.1  ±  0.6
RESIT 20.6  ±  2.1 17.8  ±  1.3 11.1  ±  0.5 27.4  ±  2.2 27.6  ±  1.6 29.1  ±  0.7
GES* 29.2  ±  2.9 25.5  ±  1.3 22.3  ±  0.6 31.5  ±  2.4 29.1  ±  1.6 25.6  ±  0.7

The smallest values among different methods are bold faced.

We see that the SBF approach performs well in establishing a correct variable ordering in this setting. Only in the case with heterogeneous variance and sample size being n = 100 does the CAM method beat the SBF approach. As noted in the previous subsection, we expect the ordering estimates associated with the GES method to be further from the ground truth than the other methods. This is also shown in Table 1, where we have marked this method with an to indicate this limitation. In addition to the average distances, we have included in the table the standard errors of the estimates.

We note that when the sample size n is equal to 50, the SBF method performs well in both the homogeneous and heterogeneous settings, beating the other models with a large margin. When the sample size increases, both the SBF and CAM methods perform exceptionally well in recovering the ordering, both with an SID below 1. We also glean from Table 1 that the task is harder in the heterogeneous variance cases.

5.3 Recovery of the underlying graph

In this subsection, we set out to do a complete recovery of the underlying graph rather than only estimating the variable ordering. To encompass different settings, we shall in our experiment report results for both the settings where the underlying graph is sparse and for the setting where it is dense. In the sparse setting, we set the probability of including each of the possible p ( p 1 ) 2 edges equal to ρ = 2 ( p 1 ) . In the dense setting, we set the probability of including each of the possible p ( p 1 ) 2 edges equal to ρ = 6 ( p 1 ) . The two settings result in graphs with the expected numbers of edges equal to p and 3 p , respectively.

The results can be seen in Table 2 for the sparse graph setting, and in Table 3 for the dense setting. The SBF method clearly demonstrates good performance throughout the various settings, where it shows large reductions in the distance to the true underlying graph with respect to both the SID and SHD-measure. We glean from Tables 2 and 3 that it is generally harder to estimate the graph in the dense setting, as the graph estimates in general have greater distances to the true underlying graph in comparison to the sparse setting. Only in the sparse setting with homogeneous variance and sample size n = 100 does the CAM method beat the SBF with respect to the SID measure.

Table 2

SID and SHD from the true underlying graph to the estimated graph in the sparse setting

Homogeneous variance Heterogeneous variance
SID SHD SID SHD
Method n = 50 n = 100 n = 50 n = 100 n = 50 n = 100 n = 50 n = 100
SBF (our method) 6.6 ± 0.9 6.5  ±  0.6 4.1 ± 0.3 2.1 ± 0.1 6.5  ±  0.8 5.4  ±  0.5 5.3  ±  0.4 2.3  ±  0.2
CAM 16.9  ±  1.4 6.0  ±  0.5 6.6  ±  0.4 2.7  ±  0.2 18.8  ±  1.5 7.2  ±  0.6 6.8  ±  0.4 2.9  ±  0.2
SCORE 19.1  ±  1.5 10.8  ±  0.9 7.3  ±  0.4 3.9  ±  0.2 19.9  ±  1.6 11.2  ±  1.0 7.3  ±  0.4 4.1  ±  0.2
LiNGAM 30.6  ±  2.1 31.7  ±  1.5 10.0  ±  0.4 10.1  ±  0.3 29.4  ±  2.0 30.3  ±  1.3 9.8  ±  0.4 10.0  ±  0.2
RESIT 28.6  ±  2.0 31.1  ±  1.4 16.3  ±  0.4 17.6  ±  0.3 28.3  ±  1.9 29.6  ±  1.2 16.3  ±  0.4 17.0  ±  0.3
GES 27.7  ±  2.3 25.8  ±  1.6 12.2  ±  0.5 10.7  ±  0.4 25.7  ±  2.0 25.5  ±  1.4 11.3  ±  0.5 10.3  ±  0.4

The smallest values among different methods are bold faced.

Table 3

SID and SHD from the true underlying graph to the estimated graph in the dense setting

Homogeneous variance Heterogeneous variance
SID SHD SID SHD
Method n = 50 n = 100 n = 50 n = 100 n = 50 n = 100 n = 50 n = 100
SBF (our method) 28. ±  1.7 19.1  ±  0.9 12.6  ±  0.6 6.6  ±  0.3 31.0  ±  1.7 20.0  ±  0.9 16.0  ±  0.6 8.5  ±  0.3
CAM 55.8  ±  1.4 30.1  ±  0.9 23.8  ±  0.6 12.8  ±  0.4 59.6  ±  1.5 39.2  ±  0.9 25.2  ±  0.6 17.0  ±  0.4
SCORE 56.7  ±  1.2 35.8  ±  0.8 24.2  ±  0.5 14.8  ±  0.4 58.9  ±  1.4 41.0  ±  0.8 25.1  ±  0.5 17.5  ±  0.4
LiNGAM 74.3  ±  0.9 74.5  ±  0.7 30.4  ±  0.4 29.9  ±  0.3 75.2  ±  1.0 75.8  ±  0.6 30.0  ±  0.4 30.4  ±  0.3
RESIT 72.9  ±  1.1 74.1  ±  0.7 31.9  ±  0.6 31.6  ±  0.4 72.7  ±  1.1 74.6  ±  0.7 30.6  ±  0.5 31.9  ±  0.4
GES 72.6  ±  1.3 73.2  ±  0.9 31.1  ±  0.7 31.1  ±  0.5 73.8  ±  1.2 73.1  ±  0.9 30.7  ±  0.6 30.6  ±  0.5

The smallest values among different methods are bold faced.

5.4 Some remarks

The simulation results show that the SBF method has the best performance among all, followed by the CAM and SCORE methods. We note that the LINGAM and GES methods do not satisfy the nonlinearity requirement and a natural consequence is that these two methods have poorer performance than the first three. The RESIT method uses an independence testing when identifying the variable ordering. It is seen in the study of Bühlmann et al. [2] that the independence testing is inferior to the general greedy ordering search framework that is used in the CAM and in our method. The SCORE algorithm is developed with the intention of optimizing for the computational speed, and thus it may not beat the existing methods in terms of accuracy. In comparison with the CAM method, our approach of using the functional Lasso in the variable selection step, in particular, turns out to be beneficial. This can be seen by comparing Table 1 with Table 2. In Table 1 where we only estimate the ordering and do not perform graph pruning, we see that the CAM and SBF estimates exhibit a similar performance although the SBF method is better with a slight edge in the n = 50 case. However, in Table 2 where we also perform graph pruning, we find that the SBF method is superior to the CAM by a larger margin.

6 Causal pathways for prostate cancer

Identifying causal genes in disease development is crucial for understanding the mechanisms that drive pathological changes. In particular, this may enable early diagnosis, disease prevention, and development of specific gene therapies. It is an active area of research, see, e.g., [3840].

In the study of Wang et al. [41], the gene expression levels of more than 13,000 mRNA genes were sampled across 125 patients with varying levels of prostate cancer. They identified 20 hub genes that were highly interconnected with other genes. These hub genes were suspected to form a causal pathway for the development of prostate cancer.

The dataset with ID GSE20161 was obtained from NCBI’s Gene Expression Omnibus database. The data were extracted from the database in R using the GEOquery-package by Davis and Meltzer [42] from the Bioconductor website. The GEO database with ID GPL6102 was used to translate the probe IDs to the gene names.

In the study of Wang et al. [41], the specific preprocessing of the data can be found. We standardize the data to have mean zero and variance one. Further, in the SBF estimation scheme, the data are transformed so that the gene expression levels are within the unit interval. The fitted functions and data are then scaled back to the standardized scaling after estimation is done.

In Table 4, the estimated variable ordering before performing variable selection can be seen. This ordering estimate suggests that the CDC2-gene plays the role of the root node, not having any of the other hub genes affecting its value. In contrast, it is seen that CDCA2 is estimated to be the last gene in the ordering. This suggests that performing an intervention on this variable has no effect on the other variables under observation.

Table 4

Estimated variable ordering by the SBF method before performing variable selection

Order Gene Order Gene Order Gene Order Gene
1 CDC2 6 NCAPG 11 ATAD2 16 TOP2A
2 CCNA2 7 TTK 12 MELK 17 HJURP
3 CDCA5 8 CCNB2 13 PLK4 18 Ska1
4 ORC1L 9 CEP55 14 DTL 19 AURKA
5 KIF15 10 TPX2 15 KIF23 20 CDCA2

These are the hub genes suspected to form a causal pathway for the development of prostate cancer. The CDC2 hub-gene is seen to constitute a root node, and according to this estimate modifying the CDC2 would affect other genes downstream.

In addition to estimating the variable ordering, we also perform variable selection to obtain a final graph estimate G ˆ of the causal mechanism that drives the development of prostate cancer. As in Section 5.3, variable selection is performed with the SBF extension of the Lasso-method.

After performing variable selection, only 32 edges remained out of the 190 edges in the estimated fully connected DAG. In the following, we develop a measure to gauge the causal strength of each of the 32 included edges. In Figure 2, the final graph estimate G ˆ is displayed, where we have only included the 20 edges with the highest causal strengths out of the 32 edges.

Figure 2 
               Estimated causal graph 
                     
                        
                        
                           
                              
                                 G
                              
                              
                                 ˆ
                              
                           
                        
                        \hat{{\mathcal{G}}}
                     
                   of the hub-genes after variable selection. To avoid clutter, we have only included the 20 edges associated with the largest causal strength.
Figure 2

Estimated causal graph G ˆ of the hub-genes after variable selection. To avoid clutter, we have only included the 20 edges associated with the largest causal strength.

In order to assess the strengths of each of the 32 included edges in G ˆ , we define the function S : D R defined over the set D of all DAGs, which assigns a score S ( G ) to a DAG G according to S ( G ) = j = 1 p log σ ˆ j , C 2 , G , where σ ˆ j , C 2 , G is given by (19) for an ordering π such that G G π with g j π and X ˜ pa ( j ) π being restricted to include only the parents of X ˜ j π in G . We then calculate the drop in score incurred by removing the edge e = ( i , j ) from the estimated graph G ˆ , divided by the drop in score incurred by removing all the edges. We denote this drop in score by L ˆ e . Letting G denote the graph with no edges, and letting G ˆ e denote G ˆ but with the edge e removed, we have

(25) L ˆ e S ( G ˆ ) S ( G ˆ e ) S ( G ˆ ) S ( G ) .

Given that our estimate G ˆ is correct, i.e. G ˆ = G 0 , a large L ˆ e for an edge e = ( i , j ) indicates that the variable X i explains a large amount of the variability in X j relative to other causal relationships. We can therefore think of edges e with large L ˆ e as having greater causal strengths than edges e for which L ˆ e is small. However, as G ˆ is only an estimate of G 0 , and that the data itself have been used in the estimation of G ˆ , care needs to be taken in the interpretation of the numbers L ˆ e .

In Table 5, the L ˆ e score can be found for the 20 edges found in Figure 2. It is seen that the edges CCNA2 CDCA5 and CDC2 CCNA2 carry the largest causal strengths among the edges, with a large drop-off in L ˆ e -score to the other edges. We also note that the node CCNA2 affects a large number of the other hub genes, with five of its edges ranking among the top 20 in terms of causal strength, as measured by the score L ˆ e . Meanwhile, the only hub gene affecting CCNA2 is the root node CDC2.

Table 5

Strength of the 20 edges associated with the largest causal strengths from the graph G ˆ measured in terms of the scores L ˆ e defined in equation (25)

Parent Child L ˆ e Parent Child L ˆ e Parent Child L ˆ e Parent Child L ˆ e
CCNA2 CDCA5 3.96 CCNA2 MELK 1.05 CEP55 MELK 0.80 CCNB2 CEP55 0.73
CDC2 CCNA2 3.21 CCNA2 DTL 1.02 TTK CCNB2 0.77 CCNA2 TTK 0.72
KIF15 NCAPG 1.50 ATAD2 MELK 0.94 HJURP CDCA2 0.76 KIF15 CEP55 0.72
ORC1L KIF15 1.23 CEP55 TOP2A 0.93 CCNA2 HJURP 0.74 TOP2A AURKA 0.71
CDCA5 ORC1L 1.07 AURKA CDCA2 0.83 ORC1L HJURP 0.74 CCNB2 AURKA 0.69

Finally, we also test the robustness toward the choice of the underlying compact subset C on which the conditional expectations are estimated. To facilitate this analysis, we limit ourselves to analyzing the relationship between the genes CDC2 and CCNA2 that constitute the first and second variables in the ordering estimate from Table 4. For q [ 0 , 1 ] , we let C q = C q 1 × C q 2 R 2 , where C q j denotes q ’th interquantile range of the observations X j i , where 1 i n . In words, if q = 0.5 , then if Q 0.25 and Q 0.75 denote the 0.25 and 0.75 quantiles of the observations X j i , we have C q j = C 0.5 j = [ Q 25 , Q 75 ] .

We then investigate how well we recover the relationship π 0 CDC2 CCNA2 found in G ˆ over the inverse relationship π 1 CCNA2 CDC2 for varying choices of q [ 0 , 1 ] . To investigate this, we construct the quantities L ˆ 2 ( q ) given by

(26) L ˆ 2 ( q ) = j = 1 2 log σ ˆ j , C q 2 , π 1 j = 1 2 log σ ˆ j , C q 2 , π 0 .

Note that L ˆ 2 ( ) reflects the score gap between the orderings π 0 and π 1 . In order for this two-variable subset analysis to be consistent with G ˆ for all choices of q we require L ˆ 2 ( ) to be positive, since only in this case we will recover the ordering π 0 over the ordering π 1 for all choices of q . However, in Table 6, we see that we only recover the ordering π 0 for q { 0.8 , 0.9 , 1.0 } . This underscores the need for being careful in concluding G ˆ as the true underlying graph G 0 , since the gap L ˆ 2 ( q ) is only positive for larger values of q .

Table 6

Consistency of the choice of compact C in terms of the gap L ˆ 2 ( q ) defined in (26)

q 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
L ˆ 2 ( q ) 0.4199 0.3892 0.3443 0.2504 0.2344 0.1342 0.0704 0.0098 0.0208 0.0025

7 Conclusion

We applied the non-parametric SBF approach to fitting non-linear SEMs in order to obtain an estimate of the variable ordering. The estimator possesses strong theoretical guarantees of convergence toward the true variable ordering associated with the data compared to existing estimators. Furthermore, it shows strong performance in a simulation setup against the state-of-the-art methods in the field of causal discovery. We also extended the model and estimator to the case where the noise is heterogeneous.

Due to the generality of the SBF estimator, it would be interesting to extend the framework to cases where the observed variables are non-Euclidean. In particular, the setting where the variables are Hilbert-space valued has direct applications to the field of neuroscience. Here, the neuron activity is frequently recorded over time, resulting in function-valued data. Such an extension could provide significant insights into brain function and neural interactions.

The setting where p is large involves computational complexity as we mention at the beginning of Section 5. An ad hoc procedure that has shown success in practice is to perform a preliminary neighborhood selection [2]. This strategy consists of first screening candidate parents for each variable. This can be done by regressing each variable on all the others and using a variable selection method (e.g., Lasso) to keep only those predictors that are selected, which yields a small candidate parent set for every node. Then, an ordering search is run only over the candidate parents. Because the strategy leaves far fewer potential parents for each node, the search becomes much faster. In general, dealing with a large p case is challenging. Solving the computational challenges associated with the large p case together with developing relevant theory for diverging p could be an interesting direction for future study.

  1. Funding information: This research was supported by the National Research Foundation (NRF) grant funded by the Korea government (MSIP) (No. RS-2024-00338150).

  2. Author contributions: Both authors have accepted responsibility for the entire content of this manuscript and consented to its submission to the journal, reviewed all the results, and approved the final version of the manuscript. The study and manuscript have received significant contributions from both authors.

  3. Conflict of interest: The authors state no conflict of interest.

  4. Data availability statement: Empirical results of the current study can be generated using the code from https://github.com/AsgerMorville/sbf-causal-discovery/. The real world dataset was obtained from NCBI’s Gene Expression Omnibus database with ID GSE20161.

Appendix

A.1 Proof of Proposition 1

For 1 k j 1 , we claim that if X π ( k ) is not a parent of X π ( j ) in G 0 , then X π ( k ) is conditionally independent of X π ( j ) given the parents of X π ( j ) in G 0 . To see why the claim implies the theorem, observe that the claim entails

E C 0 [ X π ( j ) ( X π ( 1 ) , , X π ( j 1 ) ) ] = E C 0 [ X π ( j ) X pa ( π ( j ) ) ] a.s.

By this with (2) and (6), we obtain that g j π can be picked such that

(A1) g j π ( ) = k : π ( k ) pa ( π ( j ) ) g π ( j ) , π ( k ) ( ) .

Thus,

σ j 2 , π = E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 ] = E C 0 [ ( X π ( j ) k : π ( k ) pa ( π ( j ) ) g π ( j ) , π ( k ) ( X π ( k ) ) ) 2 ] = σ π ( j ) 2 ,

which shows that ε j π has the same distribution as ε π ( j ) . In addition, (A1) also shows that

X j π = k : π ( k ) pa ( π ( j ) ) g π ( j ) , π ( k ) ( X k π ) + ε j π , j = 1 , , p .

This gives the theorem. To prove the claim, suppose that it is false, i.e., there exists X π ( k ) with 1 k j 1 such that X π ( k ) is conditionally dependent of X π ( j ) given X pa ( π ( j ) ) while it is not a parent of X π ( j ) in G 0 . Then, X π ( k ) has to be a descendant of X π ( j ) in G 0 and thus there would exist a directed path from X π ( j ) to X π ( k ) in G 0 . As G 0 G π , the same path from X π ( j ) to X π ( k ) also exists in G π . Since X π ( k ) ( X π ( 1 ) , , X π ( j 1 ) ) , i.e., X π ( k ) is a parent of X π ( j ) in G π , we would have a cycle in G π , which is a contradiction.

A.2 Proof of Proposition 2

Fix π Π 0 . Note first that by (2),

(A2) X π ( ) = k : π ( k ) pa ( π ( ) ) g π ( ) , π ( k ) ( X π ( k ) ) + ε π ( ) , ε π ( j ) N ( 0 , σ π ( ) 2 ) ,

for all { 1 , , p } . Similarly, by Proposition 1, we have

(A3) X π = k : π ( k ) pa ( π ( ) ) g π ( ) , π ( k ) ( X k π ) + ε π , ε π N ( 0 , σ π ( ) 2 ) , = 1 , , p .

For Z = ( Z 1 , , Z p ) , let π ( Z ) ( Z π ( 1 ) , , Z π ( p ) ) . According to this notation, the probability measure associated with ( X π ( 1 ) , , X π ( p ) ) can be expressed as P C 0 π 1 and the one associated with ( X 1 π , , X p π ) as P π π 1 . We prove P π π 1 = P C 0 π 1 , which implies the proposition. The asserted identity follows from (A2) and (A3). Indeed,

d ( P π π 1 ) ( y 1 , , y p ) = j = 1 p 1 2 π σ π ( j ) exp 1 2 σ π ( j ) 2 y j k : π ( k ) pa ( π ( j ) ) g π ( j ) , π ( k ) ( y k ) 2 d y j = d ( P C 0 π 1 ) ( y 1 , , y p ) ,

where the first equality follows from (A3) and the second from (A2).

A.3 Proof of Proposition 3

The density p X of the variables ( X 1 , , X p ) is given by

p X ( x 1 , , x p ) = j = 1 p 1 2 π σ j exp 1 2 σ j 2 ( x j g j ( x pa ( j ) ) ) 2 .

Here, we have defined g j by

g j ( x pa ( j ) ) m pa ( j ) g j , m ( x m ) .

Since the joint density of the variables ( X 1 , , X p ) is strictly positive, there exists a version of g j π such that

(A4) g j π ( x π ( 1 ) , , x π ( j 1 ) ) E C 0 [ X π ( j ) ( X π ( 1 ) , , X π ( j 1 ) ) = ( x π ( 1 ) , , x π ( j 1 ) ) ] = x π ( j ) p j π ( x π ( 1 ) , , x π ( j ) ) d x π ( j ) p j π ( x π ( 1 ) , , x π ( j ) ) d x π ( j )

for every ( x π ( 1 ) , , x π ( j 1 ) ) R j 1 . Here, p j π : R j R denotes the marginal density of the variables ( X π ( 1 ) , , X π ( j ) ) , i.e. it is the density given by marginalizing over the variables ( X π ( j + 1 ) , , X π ( p ) ) . To obtain our result, we thus need to show that both the numerator and the denominator in the expression for g j π is two times differentiable in each argument. We show

( x π ( 1 ) , , x π ( j 1 ) ) p j π ( x π ( 1 ) , , x π ( j ) ) d x π ( j )

is two times differentiable in each argument, and note that the same property can be shown for ( x 1 π , , x j 1 π ) x π ( j ) p j π ( x π ( 1 ) , , x π ( j ) ) d x π ( j ) with an identical technique. We note that

p π ( j ) ( x π ( 1 ) , , x π ( j ) ) d x π ( j ) = p X ( x π ( 1 ) , , x π ( p ) ) d ( x π ( j ) , , x π ( p ) ) ,

and it is therefore enough to show that for any choice of marginalizing covariates ( x 1 , , x d ) , the function

x k p X ( x 1 , , x p ) d ( x 1 , , x d ) , k { 1 , , d }

is twice differentiable. This, in turn, holds by Assumption 3.

A.4 Proof of Proposition 4

Fix a variable ordering π Π 0 . Recall that the structural functions

g j π : R j 1 R

that relate X j π with its parents X pa ( j ) π = ( X 1 π , , X j 1 π ) in the SEM C π can be chosen to be two times differentiable in each component. We may trim the fully connected graph G π as follows. For each j { 1 , , p } , if g j π is constant as a function of component k { 1 , , j 1 } , then delete the edge X π ( k ) X π ( j ) . We call the trimmed graph G 1 , and we thus have G 1 G π . The result now follows from Lemma A1. To see why, note that G 1 G 0 since G 0 G π as π Π 0 . This entails P π P C 0 by Lemma A1, since P π can be expressed with respect to G 1 .

Lemma A1

(Slight modification of Corollary 31 in [1]) Assume that X = ( X 1 , , X p ) is generated by an SEM with an associated DAG G 0 and structural equations of the form

X j = g j ( X pa ( j ) ) + ε j , ε j N ( 0 , σ j 2 ) .

Assume that the functions g j are two times differentiable and non-linear in each argument, and denote the distribution of X by P X . Let now Y = ( Y 1 , , Y p ) be generated by an SEM with associated DAG G 1 such that G 1 G 0 and with the structural equations of the form

Y j = f j ( Y pa ( j ) ) + ε j , ε j N ( 0 , σ ˜ j 2 ) .

Assume that the functions f j are two times differentiable and non-constant in each argument. Then, P X P Y .

A.5 Proof of Proposition 5

For any π Π , the probability measure P π is associated with a density of the form

p π ( x 1 π , , x p π ) = ϕ ( x 1 π ; σ 1 2 , π ) ϕ ( x 2 π g 2 π ( x 1 π ) ; σ 2 2 , π ) ϕ ( x p π g p π ( x 1 π , , x p 1 π ) ; σ p 2 , π ) ,

where ϕ ( x ; σ 2 ) denotes the normal density with mean zero and variance σ 2 evaluated at x . In addition, as P π = P C 0 when π Π 0 by Proposition 2, it holds that

(A5) K L ( P π 0 , P π 1 ) = E C 0 log p π 0 ( X π ( 1 ) , , X π ( p ) ) p π 1 ( X π ( 1 ) , , X π ( p ) ) = E C 0 [ log p π 0 ( X π ( 1 ) , , X π ( p ) ) ] E C 0 [ log p π 1 ( X π ( 1 ) , , X π ( p ) ) ]

when π 0 Π 0 . Now, by expression of p π ( x 1 π , , x p π ) above, we have the closed-form expressions for these expectations as

E C 0 [ log p π ( X π ( 1 ) , , X π ( p ) ) ] = j = 1 p 1 2 σ j 2 , π E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 ] 1 2 log 2 π 1 2 log σ j 2 , π = 1 2 j = 1 p log σ j 2 , π 1 2 + 1 2 log 2 π p .

By (A5), we thus have

K L ( P π 0 , P π 1 ) = 1 2 j = 1 p log σ j 2 , π 1 1 2 j = 1 p log σ j 2 , π 0 .

since the two constant terms cancel out.

A.6 Proof of Proposition 7

The density of ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) is continuous since the density of ( X π ( 1 ) , , X π ( j 1 ) ) is continuous by Assumption 3. It follows that Assumption (B1) holds since ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) is restricted to a compact subset C j π . Assumption (B2) holds due to Assumption 3, and Assumption (B3) holds due to Assumption 1 and the Gaussian error assumption.

A.7 Proof of Proposition 8

This follows directly from Propositions 7 and 6.

A.8 Proof of Theorem 1

Proof of Case 1: π ∈ Π0

Fix C = C 1 × × C p R p . Recall that we define C j π C π ( 1 ) × × C π ( j 1 ) . Next, we claim that σ j 2 , π = E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] . This holds since

g j π ( X π ( 1 ) , , X π ( j 1 ) ) = k : π ( k ) pa ( π ( j ) ) g π ( j ) , π ( k ) ( X π ( k ) ) , π Π 0

as proven in Proposition 1, and thus

σ j 2 , π = E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 ] = E C 0 [ ε π ( j ) 2 ] = E C 0 [ ε π ( j ) 2 ( X π ( 1 ) , , X π ( j 1 ) ) C j π ] = E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] ,

where for the third equality we have used that ( X π ( 1 ) , , X π ( j 1 ) ) ε π ( j ) for π Π 0 . Now, recall that the variance estimator σ ˆ j , C 2 , π is given on the form

σ ˆ j , C 2 , π = 1 I j π i I j π ( X ˜ π ( j ) i g ˆ j , C π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2 .

By adding and subtracting g j π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) inside the square and expanding, we obtain

σ ˆ j , C 2 , π = 1 I j π i I j π ( X ˜ π ( j ) i g j π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2 + 1 I j π i I j π ( g j π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) g ˆ j , C π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2 + 2 I j π i I j π ( X ˜ π ( j ) i g j π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) ( g j π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) g ˆ j , C π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) .

Note that by the law of large numbers, the first term on the right-hand side converges to its expectation,

1 I j π i I j π ( X ˜ π ( j ) i g j π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2 P E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] = σ j 2 , π .

Both of the last two terms converge to zero, as by Proposition 6, we have

sup x C j π g ˆ j , C π ( x ) g j π ( x ) = o p ( 1 )

and thus,

0 1 I j π i I j π ( g j π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) g ˆ j , C π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2 ( sup x C j π g ˆ j , C π ( x ) g j π ( x ) ) 2 = o p ( 1 ) .

This gives that

σ ˆ j , C 2 , π P E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] = σ j 2 , π .

Proof of Case 2: π ∉ Π0

It is enough to show that for any η > 0 there exists a non-trivial compact set C * R p such that for any compact C C * , π Π 0 and j { 1 , , p } , there exists a constant b s.t.

σ ˆ j , C 2 , π b where b > σ j 2 , π η .

Fix η > 0 . Note that

(A6) σ j 2 , π = E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 1 ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) ] + E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 1 ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) ] .

By Hölder’s inequality, the second term on the right-hand side of (A6) is bounded by

( E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 4 ] ) 1 2 ( P C 0 ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) 1 2 .

Using that the fourth moments are finite, i.e. Assumption 1, we thus obtain

E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 1 ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) ] M ( P C 0 ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) ) 1 2 .

For the first term on the right-hand side of (A6), note that by definition

E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 1 ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) ] = E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] P C 0 ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) .

Let now C R p be any compact set such that M ( P C 0 ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) ) 1 2 < η . For this choice, it holds that

σ j 2 , π P C 0 ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( 1 ) ) ) 2 ] + η < E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] + η .

If we show that σ ˆ j , C 2 , π b for some positive constant b such that

(A7) b E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] ,

then we are done. Recall that σ ˆ j , C 2 , π is defined by

σ ˆ j , C 2 , π = 1 I j π i I j π ( X ˜ π ( 1 ) i g ˆ j , C π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2

and that g j , + π denotes the projection of g j π onto the additive subspace as defined in (18), i.e.

g j , + π = proj ( g j π ( p j , C π ) ) .

It then holds that

σ ˆ j , C 2 , π = 1 I j π i I j π ( X ˜ π ( j ) i g j , + π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2 + 1 I j π i I j π ( g j , + π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) g ˆ j , C π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2 + 2 I j π i I j π ( X ˜ π ( j ) i g j , + π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) ( g j , + π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) g ˆ j , C π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) ,

where I j π { i : ( X π ( 1 ) i , , X π ( j 1 ) i ) C j π } . Note that, by the law of large numbers, the first term on the right-hand side of the above equation converges to its expectation,

1 I j π i I j π ( X ˜ π ( j ) i g j , + π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2 P E C 0 [ ( X ˜ π ( j ) g j , + π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ]

since P C 0 ( ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) C j π ) > 0 . Both of the the last two terms converge to zero, since by Proposition 6 we have

sup x C j π g ˆ j , C π ( x ) g j , + π ( x ) = o p ( 1 ) ,

and thus

0 1 I j π i I j π ( g j , + π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) g ˆ j , C π ( X ˜ π ( 1 ) i , , X ˜ π ( j 1 ) i ) ) 2 ( sup x C j π g ˆ j , C π ( x ) g j , + π ( x ) ) 2 = o p ( 1 ) .

This gives that

σ ˆ j , C 2 , π E C 0 [ ( X ˜ π ( j ) g j , + π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ]

Now, since g j π ( ) = E C 0 ( X ˜ π ( j ) ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) = ) on C j π , it holds that

E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] E C 0 [ ( X ˜ π ( j ) g j , + π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] .

This proves (A7) and the second part of the theorem.□

A.9 Proof of Theorem 2

It suffices to prove the following result: there exists a non-trivial compact set C * such that, for any compact set C containing C * , π 0 Π 0 and π 1 Π 0 , it holds that

(A8) j = 1 p log σ ˆ j , C 2 , π 0 P 0 , j = 1 p log σ ˆ j , C 2 , π 1 P 1 for some 0 , 1 with 0 < 1 .

Let

(A9) δ * min π 0 Π 0 , π 1 Π 0 j = 1 p log σ j 2 , π 1 j = 1 p log σ j 2 , π 0 = 2 ξ > 0 .

The equality holds by Proposition 5 and the inequality holds by Proposition 4. By the continuity of log , note that for any s , ε > 0 , we can find μ > 0 such that log ( s μ ) > log s ε . Let { η j } j = 1 p be positive numbers such that

log ( σ j 2 , π η j ) > log σ j 2 , π δ * p , j = 1 , , p .

Put η min 1 j p η j . We apply Theorem 1-(ii) for such η . According to the theorem, there exists a non-trivial compact set C * such that, for any compact set C containing C * and for any π 1 Π 0 , we obtain

(A10) j = 1 p log σ ˆ j , C 2 , π 1 P 1 > j = 1 p log ( σ j 2 , π 1 η ) j = 1 p log ( σ j 2 , π 1 η j ) > j = 1 p log σ j 2 , π 1 δ * .

Now, by Theorem 1-(i) and the continuous mapping theorem, it holds that, for any π 0 Π 0 ,

j = 1 p log σ ˆ j , C 2 , π 0 P j = 1 p log σ j 2 , π 0 0 as n .

For the limiting values 0 and 1 , it follows from the definition of δ * at (A9) that

1 > j = 1 p log σ j 2 , π 1 δ * j = 1 p log σ j 2 , π 0 = 0 .

This completes the proof of (A8).

A.10 Proof of Proposition 9

The proof of Proposition 9 runs along the same lines as the proof of Proposition 4, now using the identifiability result in Theorem 4 of [12].

A.11 Proof of Proposition 10

The proof of Proposition 10 runs along the same lines as the proof of Proposition 5 and thus is omitted here.

A.12 Proof of Theorem 3

Define first σ ¯ j 2 , π E C 0 [ σ j 2 , π ( X pa ( j ) π ) ] , and let Π ¯ 0 denote the set of π Π that minimizes the expression at equation (24), i.e.

π Π ¯ 0 π argmin π Π j = 1 p log σ ¯ j 2 , π ¯

Note that under the hypothesis that π ¯ Π 0 for all choices of π ¯ in equation (24), we have that Π ¯ 0 Π 0 . It thus suffices to prove that there exists a non-trivial compact set C * such that for any compact set C containing C * , π 0 Π ¯ 0 and π 1 Π ¯ 0 , it holds that

(A11) j = 1 p log σ ˆ j , C 2 , π 0 P 0 j = 1 p log σ ˆ j , C 2 , π 1 P 1 for some 0 , 1 with 0 < 1 .

Define

(A12) δ * min π 0 Π ¯ 0 , π 1 Π ¯ 0 j = 1 p log σ ¯ j 2 , π 1 j = 1 p log σ ¯ j 2 , π 0 > 0 .

By the continuity of log , for any ε , s > 0 there exists μ > 0 such that when u s < μ then log ( u ) log ( s ) < ε . Let then { η j π } j = 1 p be positive numbers such that whenever u σ ¯ j 2 , π < η j π , then

(A13) log u log ( σ ¯ j 2 , π ) < δ * 2 p , j = 1 , , p .

Put η min 1 j p , π Π η j π . We claim that there exists a non-trivial compact set C * such that, for any compact set C containing C * , it holds that

(A14) σ ˆ j , C 2 , π P b 0 , j ( σ ¯ j 2 , π η , σ ¯ j 2 , π + η ) , π Π ¯ 0 , σ ˆ j , C 2 , π P b 1 , j > σ ¯ j 2 , π η , π Π ¯ 0 ,

where b 0 , j and b 1 , j are specified later. The assertion (A11) follows from these two claims. To see why, let π 0 Π ¯ 0 and π 1 Π ¯ 0 . Note that the convergence at (A14) gives that

j = 1 p log σ ˆ j , C 2 , π 0 P j = 1 p log b 0 , j 0 , j = 1 p log σ ˆ j , C 2 , π 1 P j = 1 p log b 1 , j 1 .

Then, by the inequalities at (A13) and at (A14), we obtain that

0 = j = 1 p log b 0 , j < j = 1 p log σ ¯ j 2 , π 0 + δ * 2 j = 1 p log σ ¯ j 2 , π 1 δ * 2 < j = 1 p log ( σ ¯ j 2 , π 1 η ) < j = 1 p log b 1 , j = 1 ,

where for the second inequality we have used the definition of δ * at (A12).

We prove the first claim at (A14). Note that

σ ¯ j 2 , π = E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 1 ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) ] + E C 0 [ ( X π ( j ) g j π ( X π ( 1 ) , , X π ( j 1 ) ) ) 2 1 ( ( X π ( 1 ) , , X π ( j 1 ) ) C j π ) ] .

The value of the second term on the right-hand side can be made arbitrarily small for any π Π by taking C j π sufficiently large. Also, the first term can be made arbitrarily close to E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] for any π Π by taking C j π sufficiently large. These observations give that there exists a non-trivial compact set C * such that, for any compact set C containing C * , it holds that

b 0 , j E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] ( σ ¯ j 2 , π η , σ ¯ j 2 , π + η ) , j = 1 , , p .

Now, σ ˆ j , C 2 , π P E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] by Proposition 6, since π Π ¯ 0 Π 0 and thus g j π = g j , + π on C j π for π Π ¯ 0 . This shows the first part of (A14).

Next, we prove the second part of (A14). By the same argument as in the case π Π ¯ 0 , we may pick a large enough C j π such that

E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] ( σ ¯ j 2 , π η , σ ¯ j 2 , π + η ) .

However, here we do not necessarily have σ ˆ j , C 2 , π P E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] , since we may have g j π g j , + π . Instead, we obtain

σ ˆ j , C 2 , π P b 1 , j E C 0 [ ( X ˜ π ( j ) g j , + π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] E C 0 [ ( X ˜ π ( j ) g j π ( X ˜ π ( 1 ) , , X ˜ π ( j 1 ) ) ) 2 ] > σ ¯ j 2 , π η .

A.13 Proof of Proposition 6

By (15), the SBF estimator m ˆ ( z ) = m ˆ 1 ( z 1 ) + + m ˆ d ( z d ) has the implicit representation given by

(A15) m ˆ j ( z j ) = μ ˆ j ( z j ) k j D k m ˆ k ( z k ) q ˆ j k ( z j , z k ) q ˆ j ( z j ) d z k , 1 j d .

Under Assumptions (B1), (C1) and (C2) there exists a constant 0 < C < such that

sup ( z j , z k ) D j × D k q ˆ j , k ( z j , z k ) q ˆ j ( z j ) q ˆ k ( z k ) < C

with probability tending to one. The existence of a solution to the backfitting equations in (A15) then follows from Theorem 2.1 in the study of Park [43], see also Theorem 3.4 in the study of Jeon and Park [8].

Now, we prove that, uniformly for z j D j ,

(A16) m ˆ j ( z j ) m j ( z j ) = μ ˆ j ( z j ) μ j ( z j ) k j D k ( m ˆ k ( z k ) m k ( z k ) ) q ˆ j k ( z j , z k ) q ˆ j ( z j ) d z k + o p ( 1 ) , 1 j d .

To prove (A16), suppose that the baseline kernel function K is supported on [ s K , s K ] for some s K > 0 . Put D j , n ( z j ) { v : z j 2 h j v D j } [ s K , s K ] . Define D j o , the “interior” of D j , by z j D j o if and only if D j , n ( z j ) = [ s K , s K ] . Note that

K h j ( z j , u j ) = 1 h j K z j u j h j for all z j D j o and u j D j , D j K h j ( z j , u j ) d u j = 1 for all z j D j o .

Also, we have

(A17) Leb ( D j \ D j o ) = 4 s K h j = o ( 1 ) as n ,

where “Leb” denotes the Lebesgue measure on R . The stochastic expansion (A16) follows from (12), (A15), and (A17) if we prove

(A18) sup z j D j , z k D k o q ˆ j k ( z j , z k ) q ˆ j ( z j ) q j k ( z j , z k ) q j ( z j ) = o p ( 1 ) , sup z j D j , z k D k \ D k o q ˆ j k ( z j , z k ) q ˆ j ( z j ) q j k ( z j , z k ) q j ( z j ) = O p ( 1 ) .

To show the convergence results in (A18), we observe that

sup z j D j q ˆ j ( z j ) q j ( z j ) D j K h j ( z j , u j ) d u j = o p ( 1 ) , sup ( z j , z k ) D j × D k q ˆ j k ( z j , z k ) q j k ( z j , z k ) D j K h j ( z j , u j ) d u j D k K h k ( z k , u k ) d u k = o p ( 1 ) .

This implies

sup ( z j , z k ) D j × D k q ˆ j k ( z j , z k ) q ˆ j ( z j ) q j k ( z j , z k ) q j ( z j ) D k K h k ( z k , u k ) d u k = o p ( 1 ) .

From this and the facts that D k K h k ( z k , u k ) d u k = 1 for all z k D k o and it is bounded by an absolute positive constant for all z k D k \ D k o , we obtain (A18) and thus (A16).

We define now the sqaces L 2 ( q ˆ ) and ( q ˆ ) as the empirical versions of the spaces L 2 ( q ) and ( q ) , where q is reqlaced by q ˆ . We further define the space of univariate functions,

L 2 ( q ˆ j ) { f L 2 ( q ˆ ) : f ( z ) = f j ( z j ) for some univariate f : D j R } .

Since L 2 ( q ˆ ) is a Hilbert space and L 2 ( q ˆ j ) are closed subspaces of L 2 ( q ˆ ) , see [8], we may form the empirical orthogonal projections

Π ˆ j : L 2 ( q ˆ ) L 2 ( q ˆ j )

that project functions f L 2 ( q ˆ ) orthogonally onto L 2 ( q ˆ j ) for j { 1 , , d } . We denote by Ψ ˆ j = I Π ˆ j the orthogonal projection onto the orthogonal complement of L 2 ( q ˆ j ) and we define the sequential projection operator T ˆ by

T ˆ = Ψ ˆ d Ψ ˆ d 1 Ψ ˆ 1 .

Then, from (A16), it can be shown that, uniformly in z D , m ˆ m + can be represented by

m ˆ ( z ) m + ( z ) = s = 0 T ˆ s τ ˆ ( z ) + o p ( 1 ) ,

where τ ˆ is the function given by

τ ˆ = μ ˆ d μ d + Ψ ˆ d ( μ ˆ d 1 μ d 1 ) + + Ψ ˆ d Ψ ˆ 2 ( μ ˆ 1 μ 1 ) ,

see, e.g. [3]. It is therefore enough to show

(A19) sup z D s = 0 T ˆ s τ ˆ ( z ) = o p ( 1 ) .

For an element f L 2 ( q ) and an operator L : ( q ) ( q ) , define

f q = f ( z ) 2 q ( z ) d z , L op = sup { L f q : f ( q ) , f q 1 } .

Consider now the following four claims:

(A20) sup z D τ ˆ ( z ) = o p ( 1 )

(A21) sup z D T ˆ τ ˆ ( z ) = o p ( 1 )

(A22) sup z D T ˆ v ( z ) = O p ( 1 ) for any v ( q ˆ ) st. v q 1

(A23) s = 1 T ˆ s τ ˆ q = o p ( 1 )

Note that in (A22), we suppress the dependence of v on n in the notation. If we show these claims hold, then (A19) follows. To see this, note that

sup z D s = 0 T ˆ s τ ˆ ( z ) sup z D τ ˆ ( z ) + sup z D T ˆ τ ˆ ( z ) + sup z D s = 2 T ˆ s τ ˆ ( z ) .

The first and second term, on the RHS are o p ( 1 ) by (A20) and (A21), respectively. For the last term, note that

sup z D s = 2 T ˆ s τ ˆ ( z ) = sup z D T ˆ s = 1 T ˆ s τ ˆ ( z ) s = 1 T ˆ s τ ˆ q s = 1 T ˆ s τ ˆ q .

The first factor in the quantity on the right is O p ( 1 ) by (A22) and the second factor is o p ( 1 ) by (A23). Thus, (A19) follows.

A.14 Proof of (A20)

By ordinary kernel smoothing theory, due to Assumptions (B1)–(B3) and (C1)–(C2),

(A24) sup z j D j μ ˆ j ( z j ) μ j ( z j ) = o p ( 1 ) for all 1 j d .

Letting f j = μ ˆ j μ j , we have that τ ( z ) is a sum of functions on the form

Π ˆ k 1 Π ˆ k l f j where 0 l d 1 and k l j ,

where j ranges through 1 to d . It is thus enough to show that for any fixed j ,

(A25) sup z k 1 D k 1 Π ˆ k 1 Π ˆ k l f j ( z k 1 ) = o p ( 1 ) where 0 l d 1 and k l j

When l = 0 , this follows directly from (A24). When l = 1 , this follows from Lemma A2. When l = 2 , we have that by Hölders inequality,

sup z k D k Π ˆ k Π ˆ k f j ( z k ) = sup z k D k D k ( Π ˆ k f j ) ( z k ) q ˆ k , k ( z k , z k ) q ˆ k ( z k ) d z k Π ˆ k f j q sup z k D k D k q ˆ k , k ( z k , z k ) q ˆ k ( z k ) q k ( z k ) 2 q k ( z k ) d z k 1 2 Π ˆ k f j q M ,

where M is a constant. By Lemma A2, Π ˆ k f j q = o p ( 1 ) , which shows (A25) when l = 2 . Using that Π ˆ k op 1 with probability tending to one, we can continue this way until l = d 1 , thus conclude that (A25) holds for all l { 0 , , d 1 } .

A.15 Proof of (A21)

Note that by Lemma A4, T ˆ τ ˆ ( z ) is a sum of functions on the form

Π ˆ k Π ˆ k 1 Π ˆ k l f j where 0 l d 1 and k , k l j .

where j ranges through 1 to d . Thus (A21) can be shown in a similar way as (A20).

A.16 Proof of (A22)

Since v ( q ˆ ) and v q 1 for all n 1 , we may represent v by

v ( z ) = v 1 ( z 1 ) + + v d ( z d ) ,

where for each j { 1 , , d } we have v j q 1 . By Lemma A4, T ˆ v ( z ) is a sum of functions on the form

Π ˆ k Π ˆ k 1 Π ˆ k l v j where 0 l d 1 and k , k l j ,

where j ranges through 1 to d . We then show that for each l { 0 , , d 1 } ,

sup z k D k Π ˆ k Π ˆ k 1 Π ˆ k l v j ( z k ) = O p ( 1 ) .

For l = 0 this follows by Lemma A3. For l = 1 this follows by a similar argument as in (A20), i.e.

sup z k D k Π ˆ k Π ˆ k v j ( z k ) = sup z k D k D k ( Π ˆ k v j ) ( z k ) q ˆ k , k ( z k , z k ) q ˆ k ( z k ) d z k Π ˆ k v j q sup z k D k D k q ˆ k , k ( z k , z k ) q ˆ k ( z k ) q k ( z k ) 2 q k ( z k d z k ) 1 2 Π ˆ k v j q M ,

where now we have Π ˆ k v j q = O p ( 1 ) by Lemma A3. We may continue in this manner until l = d 1 , thus showing (A22).

A.17 Proof of (A23)

Note that by Lemma A5, with probability tending to one it holds that

T ˆ s + 1 τ ˆ q γ s T ˆ τ ˆ q ,

for some γ ( 0 , 1 ) . Thus

s = 1 T ˆ s τ ˆ q s = 0 γ s T ˆ τ ˆ q = 1 1 γ T ˆ τ ˆ q ,

with probability tending to one. We show that T ˆ τ ˆ p = o p ( 1 ) . By Lemma A4, T ˆ τ ˆ is a sum of functions of the form

Π ˆ k Π ˆ k 1 Π ˆ k l f j where 0 l d 1 and k , k l j .

where j ranges through 1 to d . Similarly to the other proofs, we show that

(A26) Π ˆ k Π ˆ k 1 Π ˆ k l f j q = o p ( 1 )

for each l { 0 , , d 1 } . For l = 0 (A26) holds by Lemma A2. For general l { 1 , , d 1 }

Π ˆ k Π ˆ k 1 Π ˆ k l f j q Π ˆ k op Π ˆ k l f j q Π ˆ k l f j q = o p ( 1 ) .

since Π ˆ k op 1 with probability tending to one. This concludes the proof, and in Lemma A.15 we have listed the required lemmas for the above proof.

A.18 Proof of Lemma A1

Note that the proof of this slightly more general statement is almost identical to the proof of Corollary 31 in [1], but we include it here for completeness. Assume for contradiction that the distribution of X can be represented by both SEMs. By Proposition 29 (i) in [1] we may pick a node pair ( i , j ) such that X i X j in G 0 and X i X i in G 1

and such that the following holds:

  1. The parents of X i in G 1 (excluding X j ) are all non-descendants of X i in G 0 .

  2. The parents of X j in G 0 (excluding X i ) are all non-descendants of X j in G 1 .

See Figure A1 for an illustration of this choice of X i and X j . The reason for picking such node pair is as follows. Let S denote the parents of X i in G 1 excluding X j and let Q denote the parents of X i in G 0 excluding X i . With this choice we may form the conditional distribution of ( X i , X j ) ( S , Q ) = ( s , q ) . By the representation using G 1 , we have

X i = f ( X j , S ) + ε i , ε i N ( 0 , σ 2 ) .

By a disintegration argument one can show that the conditional distribution

f ( X j , S ) + ε i ( S , Q ) = ( s , q )

is equal to the distribution on the form

f ( X j , s ) + ε i , ε i N ( 0 , σ 2 ) .

The proof relies on the fact that ε i is independent of the conditioning pair ( S , Q ) . This independence follows by the choice (a) above of the conditioning variables. A similar argument holds for g ( X i , q ) + ε j , which relies on the choice in (b). Thus we have the following representations

(A27) X j = g ( X i , q ) + ε j , ε j X i ,

(A28) X i = f ( X j , s ) + ε i , ε i X j .

Both ε i and ε j follow a Gaussian distribution. Thus, with a change of notation, we may write (A27) and (A28) as

(A29) Y = g ( X ) + ε y , ε y X ,

(A30) X = f ( Y ) + ε x , ε x Y .

By the lemma assumptions, g is a non-linear transformation and f is a non-constant transformation. Note that by the independence of ε x and Y , p Y , ε x ( y , e x ) = p Y ( y ) p ε x ( e x ) and thus

(A31) 2 log p Y , ε x ( y , e x ) e x y = 2 log p Y ( y ) + 2 log p ε x ( e x ) e x y = 0

Now, if ( Y , ε x ) = ( y , e x ) then according to the representation in (A29), we have X = f ( y ) + e x . The representation in (A30) then gives ε y = y g ( f ( y ) + e x ) . Thus, we have

p Y , ε x ( y , e x ) = p X , ε y ( f ( y ) + e x , y g ( f ( y ) + e x ) ) p X , ε y ( x ( y , e x ) , e y ( y , e x ) ) .

Combining this equation with (A31), and using that X and ε y are independent, we obtain

(A32) 0 = 2 log p X , ε y ( x ( y , e x ) , e y ( y , e x ) ) e x y = 2 log p X ( x ( y , e x ) ) e x y + 2 log p ε y ( e y ( y , e x ) ) e x y

By the chain rule, it follows that

2 log p X ( x ( y , e x ) ) e x y = 2 log p X ( x ) 2 x f ( y ) .

It also follows that

2 log p ε y ( e y ( y , e x ) ) e x y = 1 σ y 2 g ( x ) 2 f ( y ) + 1 σ y 2 g ( x ) + y g ( x ) σ y 2 g ( x ) f ( y ) .

Combining these expressions yield

0 = 2 log p X ( x ) 2 x f ( y ) 1 σ y 2 g ( x ) 2 f ( y ) + 1 σ y 2 g ( x ) + y g ( x ) σ y 2 g ( x ) f ( y ) .

If f ( y ) = 0 for some y R , then this expression implies that g ( x ) = 0 for all x R which is a contradiction. Thus f ( y ) 0 for all y R . We can exchange the roles of y and x to also obtain that g ( x ) 0 for all x R . We may thus divide the above expression by f ( y ) g ( x ) to obtain

0 = 2 log p X ( x ) 2 x 1 g ( x ) 1 σ y 2 g ( x ) + 1 σ y 2 1 f ( y ) + y g ( x ) σ y 2 g ( x ) g ( x )

Taking the derivative of this expression with respect to y yields that for some constant a ,

g ( x ) g ( x ) = a 1 for all x R .

and this implies that g ( x ) = a 2 exp ( a 1 x ) with a 1 , a 2 0 since g ( x ) is non-linear. Since g ( x ) g ( x ) = a 1 , it follows that

1 f ( y ) + a 1 y = a 3 for all y R

for some constant a 3 . As a 1 g ( x ) = g ( x ) + a 4 for some constant a 4 , we have

0 = 2 log p X ( x ) 2 x 2 σ y 2 g ( x ) 2 + a 3 a 4 σ y 2 g ( x ) .

This expression implies log p X ( x ) for either x or x , which is a contradiction. This proves the theorem.

Figure A1 
                     Illustration of the choice of 
                           
                              
                              
                                 
                                    
                                       X
                                    
                                    
                                       i
                                    
                                 
                              
                              {X}_{i}
                           
                         and 
                           
                              
                              
                                 
                                    
                                       X
                                    
                                    
                                       j
                                    
                                 
                              
                              {X}_{j}
                           
                         in the proof of Lemma A1.
Figure A1

Illustration of the choice of X i and X j in the proof of Lemma A1.

A.19 Some lemmas from the SBF literature

This section contains lemmas from the theory of SBF that are needed in the proof of Proposition 6. As these lemmas are well known, see, e.g., [3] and [8], the proofs are omitted.

Lemma A2

If f j L 2 ( q ˆ j ) satisfies sup z j D j f j ( z j ) = o p ( 1 ) , then it holds

sup z k D k Π ˆ k f j ( z k ) = o p ( 1 ) and Π ˆ k f j q = o p ( 1 ) .

Lemma A3

If v j L 2 ( q ˆ j ) satisfies v j q 1 , then it holds

sup z k D k Π ˆ k v j ( z k ) = O p ( 1 ) and Π ˆ k v j q = O p ( 1 ) .

Lemma A4

The sequential projection operator T ˆ has the property that if f ( ) = f 0 + f 1 ( ) + + f d ( ) ( q ˆ ) then

T ˆ f = j = 1 d g j

where each g j ( q ˆ ) is a sum of functions on the form

Π ˆ j Π ˆ j 1 Π ˆ j r f k , k = 1 , , d ,

where j 1 < < j r < j , 0 r j 1 , and j r , j k .

Lemma A5

Under Assumption (B1), for the empirical projection operators Π ˆ j and Ψ ˆ j it holds that

Π ˆ j op 1 , Ψ ˆ j op 1

with probability tending to one. Furthermore, for the sequential projection operator T ˆ , there exists γ ( 0 , 1 ) st.

T ˆ op < γ

with probability tending to one.

A.20 The role of heterogeneity and non-linearity

To understand the consequences of heterogeneous variance, we will in this section see an example where the estimator in (24) might fail to converge to the true variable ordering. We simulate a two-variable set up, where X 1 N ( 0 , 1 ) and

(A33) X 2 = 0.5 X 1 + σ v ( X 1 ) ε , ε N ( 0 , 1 ) ,

where

σ v ( x ) = 0.1 + v x 2 .

for various settings of v [ 0 , 1 ] . The v -parameter represents the degree of heterogeneity, and high values of v are associated with a high degree of heterogeneity. Letting π 0 denote the correct ordering, i.e. X 1 X 2 and π 1 the incorrect ordering, i.e. X 2 X 1 , we will for each v [ 0 , 1 ] numerically calculate

(A34) ξ ˜ j = 1 2 log σ j 2 , π 1 j = 1 2 log σ j 2 , π 0

We require that ξ ˜ > 0 for our estimator to be consistent; however we shall see that this is not always the case. When v = 0 no heterogeneity is present and the joint distribution of ( X 1 , X 2 ) is a Gaussian distribution due to the linear relationship between X 1 and X 2 . This implies that both conditional distributions X 1 X 2 and X 2 X 1 are Gaussian. This implies that we can represent the joint distribution by (6) for both π = π 0 and π = π 1 correctly, which in turn implies that ξ ˜ = 0 when v = 0 .

It turns out that when we increase v , the degree of heterogeneity, ξ ˜ decreases below zero, see the left panel of Figure A2. This implies that the estimator π ¯ from (24) has π ¯ Π 0 , leading asymptotically to the incorrect ordering estimate X 2 X 1 .

Figure A2 
                     On the left panel, 
                           
                              
                              
                                 
                                    
                                       ξ
                                    
                                    
                                       ˜
                                    
                                 
                              
                              \tilde{\xi }
                           
                         is plotted as a function of 
                           
                              
                              
                                 v
                              
                              v
                           
                         in the model (A33). On the right panel, 
                           
                              
                              
                                 
                                    
                                       ξ
                                    
                                    
                                       ˜
                                    
                                 
                              
                              \tilde{\xi }
                           
                         is plotted as a function of 
                           
                              
                              
                                 u
                              
                              u
                           
                         in the model (A35).
Figure A2

On the left panel, ξ ˜ is plotted as a function of v in the model (A33). On the right panel, ξ ˜ is plotted as a function of u in the model (A35).

Note however that in this setting the structural function g 2,1 that relates X 1 with X 2 is linear. To see what happens when this relationship gradually becomes non-linear, we construct the family of models where X 1 N ( 0 , 1 ) as before, but where

(A35) X 2 = 0.5 X 1 + u X 1 2 + σ 0.4 ( X 1 ) ε , ε N ( 0 , 1 )

and where u [ 0,0.4 ] . The parameter u determines the degree of non-linearity of the function g 2,1 that relates X 1 with X 2 . When u = 0 this reduces to the setting where g 2,1 is linear as in the model in equation (A33), and thus ξ ˜ is negative, which we see from the left panel of Figure A2 for v = 0.4 . However, when the degree of non-linearity u is increased, we see in the right panel of Figure A2 that ξ ˜ rapidly increases to be positive, and thus we regain consistency of our ordering estimator. This example shows the role of the non-linearity of the functions g j k ; the higher degree of non-linearity, the higher we believe that the variable ordering is identifiable. See Schultheiss and Bühlmann [44] for a further discussion of the possible pitfalls of the approach of using Gaussian likelihood scoring for ordering estimation when the noise is non-Gaussian.

A.17 Simulation results for large p

We include here the simulation results for the case where p = 20 in Table A1. We have limited ourselves to the case where we have a sparse underlying graph and homogeneous variance. For computational reasons, the RESIT method has been left out of this simulation study. We see the same pattern emerge as in Table 2, where the SBF method outperforms the other methods followed by the CAM method. Additionally, we see that the general distance to the true graphs are larger for this p = 20 case compared to the p = 10 case. This is to be expected as the number of potential errors become larger in this larger DAG space.

Table A1

SID and SHD from the true graph to the estimated graph, in the sparse setting (homogeneous error variance) with p = 20

Homogeneous variance
SID SHD
Method n = 50 n = 100 n = 50 n = 100
SBF (our method) 25.1  ±  2.7 17.0  ±  1.2 13.5  ±  0.7 4.6  ±  0.2
CAM 59.7  ±  4.2 25.9  ±  1.7 23.5  ±  0.8 11.6  ±  0.4
SCORE 63.7  ±  4.5 45.5  ±  2.5 23.6  ±  0.8 14.4  ±  0.4
LiNGAM 89.0  ±  5.0 85.7  ±  3.8 20.3  ±  0.6 19.8  ±  0.4
GES 90.0  ±  5.9 80.0  ±  3.9 33.9  ±  0.9 27.4  ±  0.6

A.18 Simulation results for a different choice of bandwidth

Throughout the numerical studies in Sections 5 and 6, we used Silvermans rule of thumb in order to automatically select the bandwidth. Such a particular bandwidth selector was chosen since it is computationally fast and yields good results. Other choices are possible. An example is the one introduced in Section 4.2 of [45]. We used the latter bandwidth selector to see how sensitive the SID and SHD performances are to the choice of bandwidth parameter. We actually multiplied the latter bandwidth by 0.5 to obtain better results. The results are reported in Table A2. We find that the two bandwidth selectors are comparable in the SID and SHD performances.

Table A2

Comparison of two bandwidth selectors for the SBF method

Homogeneous variance, sparse setting
SID SHD
Method n = 50 n = 100 n = 50 n = 100
SBF, Silverman’s bandwidth 7.3  ±  0.7 5.5  ±  0.6 4.0  ±  0.3 1.9  ±  0.2
SBF, Fan and Gijbels’ bandwidth 8.2  ±  0.8 7.2  ±  0.7 3.8  ±  0.3 2.4  ±  0.2

A.19 Comparison with the CAM and SCORE methods for the gene data

In this section, we compare the results of running the SBF method on the gene data set in Section 6 with those of running the CAM and SCORE methods on the same data set. In order to facilitate the comparison, we perform the same sub-sampling approach as the one outlined in the study of Bühlmann et al. [2]. We therefore sample n 2 = 63 data points and estimate the underlying DAG using only these data points. Following Bühlmann et al. [2], this is repeated 100 times, and the 20 most frequently included edges for each of the methods are listed in Table A3, Table A4 and Table A5.

Table A3

The 20 most frequent included edges for the SBF method

Edge Edge Edge Edge
1. CCNA2 CDC2 6. PLK4 CDCA2 11. CEP55 MELK 16. KIF15 NCAPG
2. CEP55 TOP2A 7. CEP55 ATAD2 12. CCNA2 MELK 17. CCNB2 AURKA
3. CEP55 Ska1 8. CCNA2 DTL 13. TOP2A AURKA 18. HJURP Ska1
4. CEP55 KIF23 9. CCNA2 CDCA5 14. DTL MELK 19. CCNA2 HJURP
5. KIF15 ORC1L 10. HJURP ORC1L 15. TTK CCNB2 20. CDCA5 ORC1L
Table A4

The 20 most frequent included edges for the CAM method

Edge Edge Edge Edge
1. CCNA2 CDC2* 6. KIF15 ORC1L* 11. MELK PLK4 16. CEP55 Ska1*
2. CCNA2 CDCA5* 7. DTL MELK* 12. ORC1L KIF15 17. TOP2A AURKA*
3. CCNB2 AURKA* 8. KIF15 NCAPG* 13. AURKA CDCA2 18. TTK NCAPG
4. CCNB2 CDC2 9. NCAPG KIF15 14. CDCA5 ORC1L* 19. ORC1L HJURP
5. CEP55 TOP2A* 10. NCAPG CDCA5 15. MELK DTL 20. HJURP TOP2A
Table A5

The 20 most frequent included edges for the SCORE method

Edge Edge Edge Edge
1. CCNA2 CDC2* 6. CDCA5 CCNA2 11. ORC1L CDCA5 16. DTL MELK*
2. CCNB2 CDC2 7. ORC1L KIF15 12. AURKA CDC2 17. DTL KIF15
3. CCNA2 CDCA5* 8. KIF15 ORC1L* 13. DTL TPX2 18. MELK PLK4
4. MELK DTL 9. CEP55 TOP2A* 14. AURKA CDCA2 19. CEP55 Ska1*
5. KIF15 NCAPG* 10. CCNB2 AURKA* 15. CCNA2 CDCA2 20. HJURP ORC1L*

In Table A4 and Table A5, we put the -symbol to mark the edges that are also chosen by the SBF method as one of the top 20 edges. We see that, for the CAM method, 10 out of the 20 most frequently included edges are also present in the list of the 20 edges for the SBF method in Table A3. Meanwhile, for the SCORE method, 9 out of the 20 are present in the list of Table A3. In particular, the eight edges in Table A6 are found by all the three methods. This hints that these eight edges are in fact causal pathways for prostate cancer.

Table A6

The 8 commonly chosen edges by the SBF, CAM and SCORE methods

Edge Edge
1. CCNA2 CDC2 2. CCNA2 CDCA5 3. CCNB2 AURKA 4. CEP55 TOP2A
5. KIF15 ORC1L 6. DTL MELK 7. KIF15 NCAPG 8. CEP55 Ska1

References

[1] Peters J, Mooij JM, Janzing D, Schölkopf B. Causal discovery with continuous additive noise models. J Mach Learn Res. 2014;15:2009–53. Suche in Google Scholar

[2] Bühlmann P, Peters J, Ernest J. CAM: Causal additive models, high-dimensional order search and penalized regression. Ann Statist. 2014;42:2526–56. 10.1214/14-AOS1260Suche in Google Scholar

[3] Mammen E, Linton O, Nielsen J. The existence and asymptotic properties of a backfitting projection algorithm under weak conditions. Ann Statist. 1999;27:1443–90. 10.1214/aos/1017939138Suche in Google Scholar

[4] Yu K, Park BU, Mammen E. Smooth backfitting in generalized additive models. Ann Statist. 2008;36:228–60. 10.1214/009053607000000596Suche in Google Scholar

[5] Lee YK, Mammen E, Park BU. Backfitting and smooth backfitting for additive quantile models. Ann Statist. 2010;38:2857–83. 10.1214/10-AOS808Suche in Google Scholar

[6] Yu K, Mammen E, Park BU. Semi-parametric regression: Efficiency gains from modeling the nonparametric part. Bernoulli. 2011;17(2):736–48. 10.3150/10-BEJ296Suche in Google Scholar

[7] Lee YK, Mammen E, Park BU. Flexible generalized varying coefficient regression models. Ann Statist. 2012;40:1906–33. 10.1214/12-AOS1026Suche in Google Scholar

[8] Jeon JM, Park BU. Additive regression with Hilbertian responses. Ann Statist. 2020;48:2671–97. 10.1214/19-AOS1902Suche in Google Scholar

[9] Jeon JM, Park BU, Van Keilegom I. Additive regression for non-Euclidean responses and predictors. Ann Statist. 2021;49:2611–41. 10.1214/21-AOS2048Suche in Google Scholar

[10] Lin Z, Müller HG, Park BU. Additive models for symmetric positive-definite matrices and Lie groups. Biometrika. 2023;110:361–79. 10.1093/biomet/asac055Suche in Google Scholar

[11] Cho S, Jeon JM, Kim D, Yu K, Park BU. Partially Linear Additive Regression with a General Hilbertian Response. J Am Stat Assoc. 2023;119(546):942–56. 10.1080/01621459.2022.2149407Suche in Google Scholar

[12] Immer A, Schultheiss C, Vogt JE, Schölkopf B, Bühlmann P, Marx A. On the identifiability and estimation of causal location-scale noise models. Proceedings of the 40th International Conference on Machine Learning. JMLR.org; 2023. p. 14316–32. Suche in Google Scholar

[13] Spirtes P, Glymour C, Richard NS. Causation, prediction, and search. Cambridge: MIT Press; 2000. 10.7551/mitpress/1754.001.0001Suche in Google Scholar

[14] Colombo D, Maathuis M, Kalisch M, Richardson T. Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann Statist. 2011;40:294–321. 10.1214/11-AOS940Suche in Google Scholar

[15] Heckerman D, Geiger D. Learning Bayesian networks: A unification for discrete and Gaussian domains. Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc.; 1995. p. 274–84. Suche in Google Scholar

[16] Chickering DM. Learning equivalence classes of Bayesian-network structures. J Mach Learn Res. 2002;2:445–98. Suche in Google Scholar

[17] Teyssier M, Koller D. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence. AUAI Press; 2005. p. 584–90. Suche in Google Scholar

[18] Rolland P, Cevher V, Kleindessner M, Russell C, Janzing D, Schölkopf B, et al. Score matching enables causal discovery of nonlinear additive noise models. Proceedings of the 39th International Conference on Machine Learning. JMLR.org; 2022. p. 18741–53. Suche in Google Scholar

[19] Wang X, Du Y, Zhu S, Ke L, Chen Z, Hao J, et al. Ordering-based causal discovery with reinforcement learning. International Joint Conference on Artificial Intelligence. AAAI Press; 2021. p. 3566–73.10.24963/ijcai.2021/491Suche in Google Scholar

[20] Zheng X, Aragam B, Ravikumar P, Xing EP. DAGs with NO TEARS: continuous optimization for structure learning. Proceedings of the 32nd International Conference on Neural Information Processing Systems. Curran Associates Inc.; 2018. p. 9492–503. Suche in Google Scholar

[21] Lachapelle S, Brouillard P, Deleu T, Lacoste-Julien S. Gradient-based neural DAG learning. Eigth International Conference on Learning Representations. Curran Associates Inc.; 2020. Suche in Google Scholar

[22] Zhu S, Ng I, Chen Z. Causal discovery with reinforcement learning. Eigth International Conference on Learning Representations. Curran Associates Inc.; 2020. Suche in Google Scholar

[23] Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A. A linear non-Gaussian acyclic model for causal discovery. J Mach Learn Res. 2006;7:2003–30. Suche in Google Scholar

[24] Zhang K, Hyvärinen A. On the identifiability of the post-nonlinear causal model. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. AUAI Press; 2009. p. 647–55. Suche in Google Scholar

[25] Peters J, Bühlmann P. Identifiability of Gaussian structural equation models with equal error variances. Biometrika. 2014;101(1):219–28. 10.1093/biomet/ast043Suche in Google Scholar

[26] Glymour C, Zhang K, Spirtes P. Review of causal discovery methods based on graphical models. Front Genetics. 2019:10:00524. 10.3389/fgene.2019.00524Suche in Google Scholar PubMed PubMed Central

[27] Bongers S, Forré P, Peters J, Mooij JM. Foundations of structural causal models with cycles and latent variables. Ann Statist. 2021;49:2885–915. 10.1214/21-AOS2064Suche in Google Scholar

[28] Strobl EV, Lasko TA. Identifying patient-specific root causes with the heteroscedastic noise model. J Comput Sci. 2023;72:102099. 10.1016/j.jocs.2023.102099Suche in Google Scholar

[29] Peters J, Bühlmann P. Structural intervention distance for evaluating causal graphs. Neural Comput. 2015;27(3):771–99. 10.1162/NECO_a_00708Suche in Google Scholar PubMed

[30] Broomhead DS, Lowe D. Multivariable functional interpolation and adaptive networks. Complex Syst. 1988;2:321–55. Suche in Google Scholar

[31] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30. Suche in Google Scholar

[32] Lee ER, Park S, Mammen E, Park BU. Efficient functional Lasso kernel smoothing for high-dimensional additive regression. Ann Statist. 2024;52:1741–73. 10.1214/24-AOS2415Suche in Google Scholar

[33] Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B (Methodological). 1996;58(1):267–88. 10.1111/j.2517-6161.1996.tb02080.xSuche in Google Scholar

[34] Silverman BW. Density estimation for statistics and data analysis. London: Chapman & Hall; 1986. Suche in Google Scholar

[35] Shimizu S, Inazumi T, Sogawa Y, Hyvärinen A, Kawahara Y, Washio T, et al. DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. J Mach Learn Res. 2011;12:1225–48. Suche in Google Scholar

[36] Ikeuchi T, Ide M, Zeng Y, Maeda TN, Shimizu S. Python package for causal discovery based on LiNGAM. J Mach Learn Res. 2023;24:1–8. Suche in Google Scholar

[37] Wood S. Generalized additive models: An introduction with R. 2nd edition. London: Chapman and Hall/CRC; 2017. 10.1201/9781315370279Suche in Google Scholar

[38] Wang L, Wu M, Wu Y, Zhang X, Li S, He M, et al. Prediction of the disease causal genes based on heterogeneous network and multi-feature combination method. Comput Biol Chem. 2022;97:107639. 10.1016/j.compbiolchem.2022.107639Suche in Google Scholar PubMed

[39] Kim YA, Wuchty S, Przytycka TM. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Comput Biol. 2017;7:1–13. 10.1371/journal.pcbi.1001095Suche in Google Scholar PubMed PubMed Central

[40] Auer PL. Finding causal genes underlying risk for coronary artery disease. Nature Genetics. 2022;54(12):1768–9. 10.1038/s41588-022-01094-zSuche in Google Scholar PubMed

[41] Wang L, Tang H, Thayanithy V, Subramanian S, Oberg AL, Cunningham JM, et al. Gene networks and microRNAs implicated in aggressive prostate cancer. Cancer Res. 2009;69:9490–7. 10.1158/0008-5472.CAN-09-2183Suche in Google Scholar PubMed PubMed Central

[42] Davis S, Meltzer P. GEOquery: a bridge between the gene expression omnibus (GEO) and BioConductor. Bioinformatics. 2007;23:1846–7. 10.1093/bioinformatics/btm254Suche in Google Scholar PubMed

[43] Park BU. Nonparametric additive regression. Proceedings of the International Congress of Mathematicians. Birkhäuser Basel; 2018. p. 2995–3018. 10.1142/9789813272880_0169Suche in Google Scholar

[44] Schultheiss C, Bühlmann P. On the pitfalls of Gaussian likelihood scoring for causal discovery. J Causal Inference. 2023;11(1):1–11. 10.1515/jci-2022-0068Suche in Google Scholar

[45] Fan J. Local polynomial modelling and its applications. 1st edition. New York: Routledge; 1996. Suche in Google Scholar

Received: 2024-07-05
Revised: 2025-04-30
Accepted: 2025-07-10
Published Online: 2025-10-16

© 2025 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

  1. Research Articles
  2. Decision making, symmetry and structure: Justifying causal interventions
  3. Targeted maximum likelihood based estimation for longitudinal mediation analysis
  4. Optimal precision of coarse structural nested mean models to estimate the effect of initiating ART in early and acute HIV infection
  5. Targeting mediating mechanisms of social disparities with an interventional effects framework, applied to the gender pay gap in Western Germany
  6. Role of placebo samples in observational studies
  7. Combining observational and experimental data for causal inference considering data privacy
  8. Recovery and inference of causal effects with sequential adjustment for confounding and attrition
  9. Conservative inference for counterfactuals
  10. Treatment effect estimation with observational network data using machine learning
  11. Causal structure learning in directed, possibly cyclic, graphical models
  12. Mediated probabilities of causation
  13. Beyond conditional averages: Estimating the individual causal effect distribution
  14. Matching estimators of causal effects in clustered observational studies
  15. Ancestor regression in structural vector autoregressive models
  16. Single proxy synthetic control
  17. Bounds on the fixed effects estimand in the presence of heterogeneous assignment propensities
  18. Minimax rates and adaptivity in combining experimental and observational data
  19. Highly adaptive Lasso for estimation of heterogeneous treatment effects and treatment recommendation
  20. A clarification on the links between potential outcomes and do-interventions
  21. Valid causal inference with unobserved confounding in high-dimensional settings
  22. Spillover detection for donor selection in synthetic control models
  23. Causal additive models with smooth backfitting
  24. Review Article
  25. The necessity of construct and external validity for deductive causal inference
Heruntergeladen am 26.10.2025 von https://www.degruyterbrill.com/document/doi/10.1515/jci-2024-0035/html?lang=de
Button zum nach oben scrollen