The variance of causal effect estimators for binary v-structures

Jack Kuipers; Giusi Moffa

doi:10.1515/jci-2021-0025

Article Open Access

The variance of causal effect estimators for binary v-structures

Jack Kuipers and Giusi Moffa

Published/Copyright: May 25, 2022

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Causal Inference Volume 10 Issue 1

Abstract

Adjusting for covariates is a well-established method to estimate the total causal effect of an exposure variable on an outcome of interest. Depending on the causal structure of the mechanism under study, there may be different adjustment sets, equally valid from a theoretical perspective, leading to identical causal effects. However, in practice, with finite data, estimators built on different sets may display different precisions. To investigate the extent of this variability, we consider the simplest non-trivial non-linear model of a v-structure on three nodes for binary data. We explicitly compute and compare the variance of the two possible different causal estimators. Further, by going beyond leading-order asymptotics, we show that there are parameter regimes where the set with the asymptotically optimal variance does depend on the edge coefficients, a result that is not captured by the recent leading-order developments for general causal models. As a practical consequence, the adjustment set selection needs to account for the relative magnitude of the relationships between variables with respect to the sample size and cannot rely on purely graphical criteria.

Keywords: causality; covariate adjustment; structure learning; Bayesian networks; probability theory

MSC 2010: 62H22

1 Introduction

As graphical representations of multivariate probability distributions, Bayesian networks are widely used statistical models with an underlying directed acyclic graph (DAG) structure. When taking DAGs to represent causal diagrams [1,2, 3,4], we may use a machinery based on the “do” calculus [5] to estimate potential intervention effects of any variable on any other. Different graphical criteria exist to identify valid adjustment sets, among which the back-door criterion [6] is probably the most well known, and with more generalised strategies developed more recently [7,8].

A valid adjustment set Z for the effect of X on Y is such that for any probability distribution p compatible with the underlying graphical structure, the probability distribution of Y after intervening on X (setting it to some value) satisfies [9]

(1) p ( Y ∣ do X ) = p ( Y ∣ X ) if Z = ∅ ∫ z p ( Y ∣ X , z ) p ( z ) d z otherwise .

For linear Gaussian models, the marginalisation can be simply estimated by regressing Y on X and Z and extracting the coefficient of X , hence the naming of “adjustment” sets. This also holds for linear non-Gaussian causal models [10].

The set of parents of X always satisfies the back-door criterion and is therefore a valid adjustment set, but there may be many more depending on the graphical structure of the DAG [8]. Although all valid adjustment sets provide consistent estimators of the causal effects, for finite-sized data, different adjustment sets can lead to different numerical estimates, and with different precisions.

In evaluating the variance of different estimators, the remarkable result that the asymptotically optimal adjustment set can be determined solely based on graphical criteria regardless of the edge coefficients has recently been obtained [10]. Even more recently, this has been extended to non-parametric estimators [11], and the asymptotically optimal set has been further characterised [12].

To explore the precision of causal estimators for non-linear models, we consider the simplest such case: a DAG with three nodes of binary variables organised in a v-structure with the outcome Y of interest as a collider with parents Z and X (Figure 1) and with the latter being the exposure whose effect we wish to estimate. For binary data and relatively small networks, one can explicitly marginalise over the remaining nodes in the DAG and its parameters [13] to derive interventional distributions as follows:

(2) p ( Y ∣ do ( X ) ) = ∑ Z p ( Y , Z ∣ do ( X ) ) ,

and estimate causal effects from them.

Figure 1

A v-structure on three nodes.

In the simple case of a v-structure (as in Figure 1), there is no confounding of the effect of X on Y (there are no common parents) so that the empty set constitutes a valid (and minimal) adjustment set, and the interventional distribution is simply

(3) p ( Y ∣ do ( X ) ) = p ( Y ∣ X ) .

A valid expression for computing the total causal effect of X on Y , in accordance with equation (1), is then

(4) F R = p ( Y ∣ do ( X = 1 ) ) − p ( Y ∣ do ( X = 0 ) ) = p ( Y ∣ X = 1 ) − p ( Y ∣ X = 0 ) ,

where we used the subscript R for raw, to highlight the fact that the formula only involves raw (or observed) conditional probabilities of Y on X , which in this simple scenario are sufficient to identify the desired causal effects.

However, by definition, the conditional distribution of Y on X is also

(5) p ( Y ∣ X ) = ∑ Z p ( Y , Z ∣ X ) = ∑ Z p ( Y ∣ X , Z ) p ( Z ) .

Therefore, another valid expression for the total causal effect of X on Y is

(6) F M = P ( Y ∣ X = 1 , Z = 1 ) P ( Z ) + P ( Y ∣ X = 1 , Z = 0 ) ( 1 − P ( Z ) ) − P ( Y ∣ X = 0 , Z = 1 ) P ( Z ) − P ( Y ∣ X = 0 , Z = 0 ) ( 1 − P ( Z ) ) ,

where we used the subscript M to highlight the fact that the formula derives from explicitly marginalising Z out from the joint distribution p ( Y , Z ∣ X ) . In contrast, one could interpret the formula based on raw conditionals as performing the marginalisation implicitly (with the observations already providing a marginalised sample).

A more general way of understanding equation (6) is by observing that in the case of the v-structure Z also constitutes a valid adjustment set (albeit not a minimal one). Then starting from the joint interventional distribution p ( Y , Z ∣ do ( X ) ) , the interventional distribution of Y when intervening on X is also

(7) p ( Y ∣ do ( X ) ) = ∑ Z p ( Y ∣ do ( X ) , Z ) p ( Z ∣ do ( X ) ) = ∑ Z p ( Y ∣ X , Z ) p ( Z ) ,

with the latter equality justified by structural and invariance properties and also in agreement with the standardisation formula in equation (1).

Since we see that adjustment by Z is valid, but not necessary, it is natural to ask whether the two estimators differ in terms of precision. To answer the question we compute the variance, for finite sample sizes, of the two different estimators corresponding to the implicit and explicit marginalisation as outlined earlier.

It is instructive to also consider the DAG with the edge from Z → Y deleted. Since Y would then be independent of Z , the marginalisation would reduce to the raw conditionals. The estimator using raw conditionals is therefore the same whether the edge from Z → Y is present, while the approach using marginalisation would give different estimates for the two cases. Intuitively we would expect that the extent by which estimates differ will depend on the strength of the relationship between Y and Z . The underlying rationale is similar to that for the standard practice of adjusting for baseline covariates in models of the outcome in randomised controlled trials [14], where the prognostic factors Z and the (randomised) treatment X can be seen as forming a v-structure with the outcome Y as the collider.

The questions of whether adjusting for baseline covariates is justified or even desirable is indeed a recurrent one with a long history in the context of randomised controlled trials [15,16, 17,18,19, 20,21]. An extensive body of literature exists with several proposals to exploit covariate adjustment in order to build more efficient estimators [22,23,24, 25,26,27]. Health authority guidelines also typically recommend adjustment for the sake of improving precision [28,29,30]; however, in the case of non-linear models, special care must be taken to account for the potential non-collapsibility of effect measures [31,32,33] and that the estimand may change depending on the method used for adjustment [34]. Recent work [26] examines the performance of covariate-adjusted estimators in clinical trials with binary, ordinal, or time-to-event analysis. For binary outcomes, they present simulation results where the baseline covariate represents categories of age and the estimand is the risk difference, for a magnitude of the prognostic value of the covariate mimicking the association found in observational data. Our study considers the question from a theoretical and causal diagram perspective, in a slightly simplified scenario where the treatment, outcome, and prognostic factors are all binary variables, but where we evaluate how things change with the strength of the prognostic value of the covariate.

Therefore, we consider the v-structure since it provides the simplest example of a causal diagram where there is a choice between different adjustment sets. If we add an edge in the graph of Figure 1 connecting X and Z we end up with no choice about adjustment sets: in particular, if we add an edge from X → Z , then Z is not a valid adjustment set and the empty set is the only choice; conversely, if we add an edge from Z → X , then Z is a confounder (a common parent), and it must be adjusted for, making it the only valid adjustment set with the empty set no longer valid.

2 Causal estimates for a binary v-structure

For both causal estimators, we will use the maximum likelihood estimates of probabilities from the observed data. We consider the DAG in Figure 1 with the following probability tables:

(8) p ( X = 1 ) = p X , p ( Y = 1 ∣ X = 0 , Z = 0 ) = p Y , 0 , p ( Y = 1 ∣ X = 1 , Z = 0 ) = p Y , 2 p ( Z = 1 ) = p Z , p ( Y = 1 ∣ X = 0 , Z = 1 ) = p Y , 1 , p ( Y = 1 ∣ X = 1 , Z = 1 ) = p Y , 3 .

When we generate data, as a collection of N binary vectors, from the DAG in Figure 1, instead of forward sampling along the topological order for this small example, we can sample directly from a multinomial with probabilities

(9) If we represent with N i the number of sampled binary vectors indexed by i = 4 X + 2 Z + Y , then the estimator of F from the raw conditionals is simply

(10) R = R 1 − R 0 , R 1 = N 5 + N 7 N 4 + N 5 + N 6 + N 7 , R 0 = N 1 + N 3 N 0 + N 1 + N 2 + N 3 .

By using the marginalisation, we would have the following estimator:

(11) M = M 1 − M 0 , M 1 = M 11 + M 10 , M 0 = M 01 + M 00 ,

with the terms separated for later ease

(12) M 11 = N 7 ( N 6 + N 7 ) ( N 2 + N 3 + N 6 + N 7 ) N , M 01 = N 3 ( N 2 + N 3 ) ( N 2 + N 3 + N 6 + N 7 ) N M 10 = N 5 ( N 4 + N 5 ) ( N 0 + N 1 + N 4 + N 5 ) N , M 00 = N 1 ( N 0 + N 1 ) ( N 0 + N 1 + N 4 + N 5 ) N .

These estimators, as they rely on observed data frequencies, are non-parametric and fit in the recent general framework for arbitrary graphs [11]. The key advance of our derivation with respect to their result is that we consider terms beyond the leading-order asymptotics and compute the variance of the estimators for arbitrary sample sizes, which further enables us to perform more detailed asymptotic analyses.

2.1 Raw conditionals

To compute E [ R ] , we need to average over a multinomial sample

(13) E [ R ] = ∑ N ! N 0 ! ⋯ N 7 ! p 0 N 0 ⋯ p 7 N 7 R ,

for which we use that fact that ( p 0 + … + p 7 ) N generates the probability distribution when we perform a multinomial expansion. To obtain the terms needed for the expectation, we define

(14) S N = { [ p 0 + p 2 + ( p 1 + p 3 ) w ] x + [ p 4 + p 6 + ( p 5 + p 7 ) v ] z } N ,

with the introduction of four auxiliary generating variables w , x , v , z and whose expansion is

(15) S N = ∑ N ! N 0 ! ⋯ N 7 ! p 0 N 0 ⋯ p 7 N 7 w N 1 + N 3 x N 0 + N 1 + N 2 + N 3 v N 5 + N 7 z N 4 + N 5 + N 6 + N 7 .

Setting all the generating variables to 1 removes them from consideration and the generating function simplifies to the value 1:

(16) S N ∣ w = x = 1 v = z = 1 = 1 .

The advantage of using generating functions [35] is that we can express expectations in terms of differential and integral operators. For example, the operator v ∂ ∂ v acting on S N will bring down a factor of ( N 5 + N 7 ) from the power of v through differentiation, and we then multiply by v to leave the power unchanged (a useful feature to apply multiple operators later). The effect of the operator is easiest to see when we apply it to the expanded form of S N from equation (15):

(17) v ∂ ∂ v S N = ∑ N ! N 0 ! ⋯ N 7 ! p 0 N 0 ⋯ p 7 N 7 ( N 5 + N 7 ) w N 1 + N 3 x N 0 + N 1 + N 2 + N 3 v N 5 + N 7 z N 4 + N 5 + N 6 + N 7 .

Removing the generating variables after applying the operator leads to

(18) v ∂ ∂ v S N w = x = 1 v = z = 1 = ∑ N ! N 0 ! ⋯ N 7 ! p 0 N 0 ⋯ p 7 N 7 ( N 5 + N 7 ) = E [ N 5 + N 7 ] ,

which is an expectation over the multinomial probability distribution of a binary v-structure. To actually perform the differentiation, we employ the compact form of S N from equation (14) to easily obtain the result of N ( p 5 + p 7 ) .

The integral operator ∫ d z 1 z will introduce a denominator of ( N 4 + N 5 + N 6 + N 7 ) , so by combining operators, we derive our first expectation of interest:

(19) E [ R 1 ] = ∫ v z ∂ ∂ v S N d z w = x = 1 v = z = 1 = v ( p 5 + p 7 ) p 4 + p 6 + ( p 5 + p 7 ) v S N w = x = 1 v = z = 1 ,

which follows from applying the operators to equation (14). When we substitute for the generating variables (which sets S N = 1 ) and perform the same steps for R 0 , we obtain

(20) E [ R ] = p 5 + p 7 p 4 + p 5 + p 6 + p 7 − p 1 + p 3 p 0 + p 1 + p 2 + p 3 = p 5 + p 7 p X − p 1 + p 3 1 − p X .

2.1.1 The variance

To compute the variance

(21) V [ R ] = V [ R 1 ] − 2 C [ R 1 , R 0 ] + V [ R 0 ] ,

we first show that the covariance is 0

(22) E [ R 1 R 0 ] = ∫ w x ∂ ∂ w ∫ v z ∂ ∂ v S N d z d x w = x = 1 v = z = 1 = ∫ w x ∂ ∂ w v ( p 5 + p 7 ) p 4 + p 6 + ( p 5 + p 7 ) v S N d x w = x = 1 v = z = 1 = v ( p 5 + p 7 ) p 4 + p 6 + ( p 5 + p 7 ) v ⋅ w ( p 1 + p 3 ) p 0 + p 2 + ( p 1 + p 3 ) w S N w = x = 1 v = z = 1 = E [ R 1 ] E [ R 0 ] .

The last equality follows by comparing the values of E [ R 1 ] and E [ R 0 ] computed earlier.

The more tricky terms are as follows:

(23) E [ R 1 2 ] = ∫ v z ∂ ∂ v ∫ v z ∂ ∂ v S N d z d z w = x = 1 v = z = 1 = ∫ v z ∂ ∂ v v ( p 5 + p 7 ) p 4 + p 6 + ( p 5 + p 7 ) v S N d z w = x = 1 v = z = 1

= v ( p 5 + p 7 ) p 4 + p 6 + ( p 5 + p 7 ) v ∫ ( p 4 + p 6 ) p 4 + p 6 + ( p 5 + p 7 ) v S N z + v z ∂ ∂ v S N d z w = x = 1 v = z = 1 = ( p 5 + p 7 ) ( p 4 + p 6 ) p X 2 ∫ ( 1 − p X + p X z ) N z d z z = 1 + E [ R 1 ] 2 .

The remaining integral can be expressed in terms of hypergeometric functions:

(24) ∫ ( 1 − p X + p X z ) N z d z z = 1 = ∑ k = 1 N N k 1 k p X k ( 1 − p X ) N − k = N p X ( 1 − p X ) N − 1 F [ 1 , 1 , 1 − N ] , [ 2 , 2 ] , − p X 1 − p X ,

where the first step follows from performing a binomial expansion and integrating term by term, while the second step follows from the definition of the hypergeometric function where we use the notation F ( [ a 1 , … , a p ] , [ b 1 , … , b q ] , t ) for the generalised hypergeometric function p F q ( a 1 , … , a p ; b 1 , … , b q ; t ) with square brackets to help delineate the different arguments. By repeating the calculations for V [ R 0 ] , we obtain

(25) V [ R ] = ( p 5 + p 7 ) ( p 4 + p 6 ) p X N ( 1 − p X ) N − 1 F [ 1 , 1 , 1 − N ] , [ 2 , 2 ] , − p X 1 − p X + ( p 1 + p 3 ) ( p 0 + p 2 ) ( 1 − p X ) N p X N − 1 F [ 1 , 1 , 1 − N ] , [ 2 , 2 ] , − 1 − p X p X .

We discuss bounds on this variance in Appendix A.

2.2 Marginalisation

To compute the expected value E [ M ] , we define

(26) T N = { a p 0 s + a p 1 s t + b p 2 u + b p 3 u v + a p 4 w + a p 5 w x + b p 6 y + b p 7 y z } N ,

where we include extra generating variables for all terms in our estimators for which we need the ten generating variables a , b , s , t , u , v , w , x , y , z . Then

(27) E [ M 11 ] = 1 N ∫ d y b z y ∂ 2 ∂ b ∂ z T N a = b = 1 s = t = 1 u = v = 1 w = x = 1 y = z = 1 = b z p 7 ( p 2 u + p 3 u v + p 6 y + p 7 y z ) ( p 6 + p 7 z ) T N − 1 a = b = 1 s = t = 1 u = v = 1 w = x = 1 y = z = 1 = p 7 ( p 6 + p 7 ) p Z ,

and similarly for the other terms, leading to

(28) E [ M ] = p 7 ( p 6 + p 7 ) p Z + p 5 ( p 4 + p 5 ) ( 1 − p Z ) − p 3 ( p 2 + p 3 ) p Z − p 1 ( p 0 + p 1 ) ( 1 − p Z ) .

To compute the variance, we reapply the operators of equation (27), as detailed in Appendix B.

2.3 Numerical checks

Code to evaluate the variance of the two estimators through simulation, as well as to evaluate the analytical results, is hosted at https://github.com/jackkuipers/Vcausal. As an example, for p X = 1 3 , p Z = 2 3 , p Y , 0 = 1 6 , p Y , 1 = 1 2 , p Y , 2 = 1 3 , p Y , 3 = 5 6 and N = 100 , we obtained Monte Carlo estimates of the standard deviation of R and M as 0.101929 and 0.0924014, respectively, from 40 million repetitions. This agrees with the respective analytical results of 0.101932 and 0.0924017.

By having the exact analytical result for any parameter combinations, we can avoid expensive Monte Carlo simulations. This allows us to plot the variances over ranges of parameter values and examine their asymptotic behaviour.

2.4 Relative difference in variances

To explore the difference in variances, we focus mainly on the case where the effect of Z on Y is the same for each X to remove one parameter degree of freedom from the seven in total. In this case, we write the probabilities as follows:

(29) p Y , 0 = q 0 − C , p Y , 1 = q 0 + C , p Y , 2 = q 1 − C , p Y , 3 = q 1 + C ,

where C is a measure of the effect of Z on Y (the same for each X ) and the causal effect of X on Y is q 1 − q 0 .

We plot the relative difference in variances of the two estimators, Δ = V [ M ] − V [ R ] V [ R ] . In Figure 2, we leave p X free, set p Z = 2 3 and set q 0 = 1 3 , q 1 = 2 3 and plot Δ for N = 100 and N = 400 . In the plot for N = 400 , we also scaled C by dividing by 2. The behaviour and rescaled plots are very similar, suggesting a N − 1 2 scaling.

Figure 2

The relative difference in variance of the two estimators for two sample sizes. (a) N = 100 and (b) N = 400 .

With our general result, we can further allow for an interaction between Z and X and have a different effect of Z on Y for each X . To account for the interaction strength, we introduce parameter D , which we multiply by p Z or ( 1 − p Z ) in the following parameterisation:

(30) p Y , 0 = q 0 − C − D p Z , p Y , 1 = q 0 + C + D ( 1 − p Z ) , p Y , 2 = q 1 − C + D p Z , p Y , 3 = q 1 + C − D ( 1 − p Z ) ,

so that the causal effect of X on Y remains unchanged as q 1 − q 0 . The effect of Z on Y in the probability tables is then 2 C + D for X = 0 and 2 C − D for X = 1 , making D a measure of the interaction effect between Z and X . In Figure 3, we again leave p X and C free, set p Z = 2 3 , q 0 = 1 3 , q 1 = 2 3 , N = 100 and plot Δ for D = 1 8 and D = 1 4 . We observe, as in Figure 2, a central region where R is the more efficient estimator and tails for larger C , where M is the better estimator, but now with a rotation dependent on D and p X .

$Figure 3 The relative difference in variance of the two estimators for two interaction strengths. (a) D = 1 8 D=\frac{1}{8} and (b) D = 1 4 D=\frac{1}{4} .$

Figure 3

The relative difference in variance of the two estimators for two interaction strengths. (a) D = 1 8 and (b) D = 1 4 .

3 Asymptotic behaviour

To examine the asymptotic behaviour of the causal effect estimators in more detail, we return here to the setting of equation (29) with the same effect of Z on Y for each X (or with D = 0 ), and we expand the hypergeometric function as in Appendix C. We treat the general case with interactions ( D ≠ 0 ) in Appendix D, while without interactions ( D = 0 ) we obtain the following formula for the variance of R :

(31) V [ R ] ⋅ N = ( q 1 + ( 2 p Z − 1 ) C ) ( 1 − q 1 − ( 2 p Z − 1 ) C ) p X 1 + ( 1 − p X ) N p X + ( q 0 + ( 2 p Z − 1 ) C ) ( 1 − q 0 − ( 2 p Z − 1 ) C ) ( 1 − p X ) 1 + p X N ( 1 − p X ) + O ( N − 2 ) ,

and for the variance of M ,

(32) V [ M ] ⋅ N = q 1 ( 1 − q 1 ) − C 2 p X 1 + 2 ( 1 − p X ) N p X − ( 2 q 1 − 1 ) ( 2 p Z − 1 ) C p X + q 0 ( 1 − q 0 ) − C 2 1 − p X 1 + 2 p X N ( 1 − p X ) − ( 2 q 0 − 1 ) ( 2 p Z − 1 ) C ( 1 − p X ) + O ( N − 2 ) .

To extract the asymptotic behaviour of the difference in variances of the two estimators, we consider C ∼ N − 1 2 to obtain

(33) ( V [ M ] − V [ R ] ) ⋅ N = q 1 ( 1 − q 1 ) ( 1 − p X ) N p X 2 + q 0 ( 1 − q 0 ) p X N ( 1 − p X ) 2 − 4 p Z ( 1 − p Z ) p X ( 1 − p X ) C 2 + O N − 3 2 ,

with root

(34) C ∗ = 1 4 N p Z ( 1 − p Z ) q 1 ( 1 − q 1 ) p X ( 1 − p X ) 2 + q 0 ( 1 − q 0 ) ( 1 − p X ) p X 2 1 2 ,

so that

(35) lim N → ∞ V [ M ] − V [ R ] < 0 , C > C ∗ lim N → ∞ V [ M ] − V [ R ] > 0 , C < C ∗ .

Note that although we used the scaling C ∼ N − 1 2 to extract this result, it holds more generally. For example, for fixed C ≠ 0 , it is trivial to see that C > C ∗ for some N and so that V [ M ] will become lower than V [ R ] in the limit N → ∞ . The asymptotically optimal adjustment set therefore uses marginalisation rather than raw conditioning, in line with previous results [10,11] from the leading-order asymptotics. For fixed C = 0 , however, raw conditioning would be better. It is exactly by treating subleading terms, as we do here, that we can examine where the transition occurs and how it depends on the coefficients. For weaker effects of the edge from Z → Y , with C ≲ N − 1 2 , the raw conditional can give a more precise estimate of the causal effect of X on Y .

The result in equation (35) therefore shows that the optimal adjustment set, for the effect of X on Y , does not depend solely on graphical criteria, but crucially on the relative effect size of the influence of the other node Z on the outcome Y . The result details, for large but finite sample sizes, the crossover in efficiency from R to M as the relative effect of C increases. Therefore, starting from plausible parameter values, we can use equations (34) and (35) to guide our choice of adjustment.

Since for many practical purposes the sample size may be determined by other considerations and we can never take the limit N → ∞ , the asymptotic regime developed here accounting for the relative scale of effects compared to the sample size is most relevant. Although derived just for the binary v-structure, this is a counterexample showing that recent leading-order aysmptotic results [11] cannot directly extend outside their particular asymptotic limit.

4 Implications for causal discovery

With a larger sample size, we may be able to detect and quantify smaller causal effects. Therefore, we wish to get a feeling for the strength of the edge Z → Y we would detect from the data, or equivalently for which values of C we would infer the presence of the edge. To do so, we calculate the expected difference in maximised log-likelihoods when including the edge compared to a DAG with the edge deleted:

(36) E [ Δ l ] = 1 2 + N 7 ln ( q 1 + C ) + N 6 ln ( 1 − q 1 − C ) − N 7 ln ( q 1 ) − N 6 ln ( 1 − q 1 ) + … = 1 2 + N p X p Z ( q 1 + C ) ln 1 + C q 1 + N p X p Z ( 1 − q 1 − C ) ln 1 − C 1 − q 1 + … = 1 2 + N 2 p X q 1 ( 1 − q 1 ) + ( 1 − p X ) q 0 ( 1 − q 0 ) C 2 + O ( C 3 ) ,

where the 1 2 comes from Wilk’s theorem [36] for the additional parameter when maximising all the probabilities relative to evaluating with the restriction C = 0 .

The change is AIC is then

(37) E [ Δ AIC ] = 2 − 2 E [ Δ l ] = 1 − N p X q 1 ( 1 − q 1 ) + 1 − p X q 0 ( 1 − q 0 ) C 2 + O ( C 3 ) .

There is therefore an asymptotic regime where the edge is strong enough to detect on average using the AIC, but the estimator from raw conditionals that does not use the edge has lower variance:

(38) N ( C ∗ ) 2 ≥ N C 2 ≥ p X q 1 ( 1 − q 1 ) + ( 1 − p X ) q 0 ( 1 − q 0 ) − 1 ,

which follows from the Cauchy–Schwarz inequality. The regime only vanishes when p Z = 1 2 and q 1 ( 1 − q 1 ) ( 1 − p X ) 2 = q 0 ( 1 − q 0 ) p X 2 , and the two bounds become equal. Utilising the BIC instead ( E [ Δ BIC ] = E [ Δ AIC ] + log ( N ) ) leads to a large regime where we would not detect the edge on average, but where the estimator using marginalisation that does rely on the edge has lower variance.

5 Discussion

To evaluate the precision of different estimators targeting the same causal effect in causal diagrams, we considered the simple case of a v-structure for binary data and explicitly computed the variance of the two different estimators for the effect of X on the collider Y , with Z as the other parent.

The results involve combinations of hypergeometric functions, suggesting that exact results for larger DAGs may be rather complex. Which estimator has the lower variance depends, among other parameters, on the relative strength of the edge from Z to Y . In general, estimating the causal effect through marginalisation offers better performance in the presence of a stronger direct effect of Z on Y . When the direct effect is weaker instead, ignoring the edge and estimating the causal effect through the raw conditionals provides higher precision.

In light of our results, it is instructive to go back to the parallel with clinical trials. Even if the simple case of three nodes with only binary variables does not cover the more general and realistic scenarios encountered in randomised clinical trials, the example is still enlightening to show that taking a gain in precision for granted under any circumstances may not always be justified. Asymptotic results suggest that the optimal adjustment set in the sense of efficiency should include the prognostic factor Z . In real-world situations, with finite sample sizes, whether adjusting actually improves efficiency may depend on the relative strength of the prognostic value of Z on the outcome Y with respect to the sample size. This would seem to match the intuition that including stronger prognostic factors will indeed benefit precision, while it may be counterproductive to adjust for weaker ones.

By examining the asymptotic regime of large sample sizes, we could confirm the intuition that for edge strengths statistically detectable by the AIC, accounting for the edge in the estimation should generally lead to lower variance. Conversely, that the presence of statistically non-detectable edges should be ignored to achieve a lower variance.

Most importantly, we could also discover an asymptotic regime where raw conditional estimates, ignoring the edge, were more precise in the presence of statistically detectable edges. One way to appreciate the practical relevance of these findings is by observing that we can expect ranges of causal strengths, which become statistically detectable from data before we can gain precision by accounting for them in the estimation. Our detailed asymptotic analysis for the v-structure goes beyond the leading-order asymptotic result where the optimal estimator does not depend on the edge coefficients [10,11].

Outside the asymptotic regime, for finite sample sizes, the gain in precision when using marginalisation and, thus explicitly accounting for the edge presence, appears to be linked to its strength. Although the example considered here is the simplest non-trivial DAG, this finding further supports the idea that learning the full structure of the graph, beyond simply identifying a valid adjustment set, may benefit the precision of causal inference. The practical limitation with observational data is that we can only learn structures up to an equivalence class, so that we need to consider the possible range of causal effects across the whole class [37] or implement Bayesian model averaging across DAGs [13].

If we use a more stringent criterion to decide about the presence of edges, such as the BIC, for example, which implements a stronger penalisation with respect to the AIC, we may end up missing edges too weak to detect on average, but whose presence would improve the precision of the causal estimation through marginalisation. In other words, for moderately weak direct effects, the selection of suitable adjustment sets may be relatively sensitive to the choice of the score. Analogously, we may expect that optimal causal estimation may also be sensitive to the choice of learning algorithm, whether constraint-based search [38,39], score-based search [40] or Bayesian sampling [41,42,43]. Quantifying the extent by which the structure learning affects causal estimation constitutes an interesting line of further investigation.

Finally, a limitation of the current study is its relatively reduced scope, treating a very simple situation with only three binary variables. It would be interesting to see if and how results generalise in the presence of multiple and possibly non-binary covariates as well as to non-binary outcomes. Unfortunately, extensions of our approach to higher dimensions are technically challenging, especially because the number of parameters grows exponentially for larger graphs with more edges. Similarly, extending to non-binary variables requires re-framing the problem in a relatively different setting. Nevertheless, we feel that even this simple example is of practical value to highlight that paying attention to the relative strength of the prognostic value with respect to the sample size is critical to determining the extent of gain in precision. This is echoed in simulations [26], with the discussion of ref. [44] further elaborating on the relative importance of the strength of the covariate associations with the outcomes and the sample size. The conclusions we draw from our exact calculations for a non-parametric estimator of the risk difference appear to match finite sample effects in simulation analyses [44] very closely. Furthermore, it seems natural to expect similar considerations applying to different estimands, such as the risk ratio and the odds ratio, though technically more challenging to treat. Notably, simulation studies provide for a valuable alternative to gain insights about more realistic scenarios, which are technically intractable, as nicely illustrated in ref. [26].

Funding information: None declared.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Conflict of interest: The authors state no conflict of interest.
Data availability statement: Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

Appendix A Bounds on the variance of the R estimator

This hypergeometric function in equation (24) has a maximum value at around 1.5 N , and we note that if we divide by ( k + 1 ) instead of k in the sum we have the simple result

(A1) ∑ k = 0 N N k 1 k + 1 p X k ( 1 − p X ) N − k = 1 p X ( N + 1 ) − ( 1 − p X ) N + 1 p X ( N + 1 ) ,

so that by considering the early terms in the sum, we can bound

(A2) ∑ k = 1 N N k 1 k p X k ( 1 − p X ) N − k > 1 p X ( N + 1 ) , p X > N − 1 + 3 N 2 + 4 N + 1 N ( N + 3 ) ,

which we can loosen to p X > 1 + 3 N . This provides the following lower bound for the variance

(A3) V [ R ] > ( p 5 + p 7 ) ( p 4 + p 6 ) p X 3 ( N + 1 ) + ( p 1 + p 3 ) ( p 0 + p 2 ) ( 1 − p X ) 3 ( N + 1 ) , 1 + 3 N < p X < N − 1 − 3 N .

To obtain a simple upper bound, we can compute

(A4) ∑ k = 1 N N k 1 k p X k ( 1 − p X ) N − k < 2 ∑ k = 1 N N k 1 k + 1 p X k ( 1 − p X ) N − k < 2 p X ( N + 1 ) ,

so that the variance vanishes in the large N limit

(A5) V [ R ] < 2 ( p 5 + p 7 ) ( p 4 + p 6 ) p X 3 ( N + 1 ) + 2 ( p 1 + p 3 ) ( p 0 + p 2 ) ( 1 − p X ) 3 ( N + 1 ) .

B The variance of the M estimator

For computing the variance of M , we need to reapply the operators used to obtain the expected value as in equation (27). If they act on different generating variables, they will simply recreate terms like the mean, so we focus on terms where they repeat.

B.1 A variance

For example:

(A6) E [ M 11 2 ] ⋅ N = ∫ d y b z y ∂ 2 ∂ b ∂ z b z p 7 u ( p 2 + p 3 v ) ( p 6 + p 7 z ) + b z p 7 y T N − 1 a = b = 1 s = t = 1 u = v = 1 w = x = 1 y = z = 1 .

For the linear term in y , it is easiest if we rearrange and integrate first

(A7) b z ∂ 2 ∂ z ∂ b ∫ d y b z p 7 T N − 1 a = b = 1 s = t = 1 u = v = 1 w = x = 1 y = z = 1 = p 6 p 7 ( p 6 + p 7 ) 2 p Z + p 7 2 ( p 6 + p 7 ) + ( N − 1 ) p 7 2 ( p 6 + p 7 ) p Z ,

while for the rest of E [ M 11 2 ] , we first differentiate with respect to b

(A8) b ∂ ∂ b b z p 7 u ( p 2 + p 3 v ) ( p 6 + p 7 z ) T N − 1 a = b = 1 s = t = 1 u = v = 1 w = x = 1 = ( N − 1 ) z p 7 ( p 2 + p 3 ) 2 ( p 6 + p 7 z ) + z p 7 ( p 2 + p 3 ) y T N − 2 + z p 7 ( p 2 + p 3 ) ( p 6 + p 7 z ) T N − 1 .

For the part with the factor of y , we again integrate first with respect to y and then differentiate to obtain

(A9) z ∂ ∂ z ∫ d y ( N − 1 ) z p 7 ( p 2 + p 3 ) T N − 2 a = b = 1 s = t = 1 u = v = 1 w = x = 1 y = z = 1 = ( p 2 + p 3 ) p 6 p 7 ( p 6 + p 7 ) 2 + ( N − 1 ) p 7 2 ( p 6 + p 7 ) ,

on the rest we apply the operator for z

(A10) z ∂ ∂ z … z = 1 = p 6 p 7 ( p 2 + p 3 ) ( p 6 + p 7 ) 2 T N − 1 + ( N − 1 ) p 7 2 ( p 2 + p 3 ) ( p 6 + p 7 ) y + p 6 p 7 ( p 2 + p 3 ) 2 ( p 6 + p 7 ) 2 T N − 2 + ( N − 1 ) ( N − 2 ) y p 7 2 ( p 2 + p 3 ) 2 ( p 6 + p 7 ) T N − 3 .

The linear terms in y give the following:

(A11) p 7 2 ( p 6 + p 7 ) 2 ( p 2 + p 3 ) + ( N − 1 ) p 7 2 ( p 6 + p 7 ) 2 ( p 2 + p 3 ) 2 ,

while the integrals lead to

(A12) p 6 p 7 ( p 2 + p 3 ) ( p 6 + p 7 ) ( N − 1 ) ( 1 − p 6 − p 7 ) N − 2 F [ 1 , 1 , 2 − N ] , [ 2 , 2 ] , − p 6 + p 7 1 − p 6 − p 7 + p 6 p 7 ( p 2 + p 3 ) 2 ( p 6 + p 7 ) ( N − 1 ) ( N − 2 ) ( 1 − p 6 − p 7 ) N − 3 F [ 1 , 1 , 3 − N ] , [ 2 , 2 ] , − p 6 + p 7 1 − p 6 − p 7 .

Combining all the terms, subtracting the mean part squared and simplifying slightly, we obtain

(A13) V [ M 11 ] ⋅ N = p 6 p 7 ( p 2 + p 3 ) ( p 6 + p 7 ) ( N − 1 ) ( 1 − p 6 − p 7 ) N − 2 F [ 1 , 1 , 2 − N ] , [ 2 , 2 ] , − p 6 + p 7 1 − p 6 − p 7 + p 6 p 7 ( p 2 + p 3 ) 2 ( p 6 + p 7 ) ( N − 1 ) ( N − 2 ) ( 1 − p 6 − p 7 ) N − 3 F [ 1 , 1 , 3 − N ] , [ 2 , 2 ] , − p 6 + p 7 1 − p 6 − p 7 + p 6 p 7 ( p 6 + p 7 ) 2 ( p 2 + p 3 + p Z ) + p 7 2 ( p 6 + p 7 ) 2 p Z ( 1 − p Z ) .

B.2 The covariances

For the covariances where separate generating variables are used

(A14) E [ M 11 M 10 ] = 1 N ∫ d w a x w ∂ 2 ∂ a ∂ w b z p 7 ( p 2 u + p 3 u v + p 6 y + p 7 y z ) ( p 6 + p 7 z ) T N − 1 a = b = 1 s = t = 1 u = v = 1 w = x = 1 y = z = 1 ,

it is easy to see that the operators act on T N − 1 rather than the prefactor, so we repeat the calculation for the mean with N replaced by ( N − 1 ) to obtain

(A15) C [ M 11 , M 10 ] = − 1 N E [ M 11 ] E [ M 10 ] , C [ M 01 , M 10 ] = − 1 N E [ M 01 ] E [ M 10 ] C [ M 11 , M 00 ] = − 1 N E [ M 11 ] E [ M 00 ] , C [ M 01 , M 00 ] = − 1 N E [ M 01 ] E [ M 00 ] .

The more complicated cases are where the generating variables reoccur

(A16) E [ M 11 M 01 ] = 1 N ∫ d u b v u ∂ 2 ∂ b ∂ v b z p 7 u ( p 2 + p 3 v ) ( p 6 + p 7 z ) + b z p 7 y T N − 1 a = b = 1 s = t = 1 u = v = 1 w = x = 1 y = z = 1 .

For the term linear in u , we first integrate then differentiate with respect to v . For the other term, we first differentiate then integrate to give

(A17) E [ M 11 M 01 ] = 1 N ∂ ∂ b b p 3 p 7 ( p 6 + p 7 ) + b p 3 p 7 ( p 2 + p 3 ) T N − 1 a = b = 1 s = t = 1 u = v = 1 w = x = 1 y = z = 1 = E [ M 11 ] E [ M 01 ] + p 3 p 7 N ( p 2 + p 3 ) ( p 6 + p 7 ) p Z ( 1 − p Z ) ,

and

(A18) C [ M 11 , M 01 ] = 1 N p 3 p 7 ( p 2 + p 3 ) ( p 6 + p 7 ) p Z ( 1 − p Z ) C [ M 01 , M 00 ] = 1 N p 1 p 5 ( p 0 + p 1 ) ( p 4 + p 5 ) p Z ( 1 − p Z ) .

B.3 The variance

Since the terms from the covariances simplify, the complete variance is

(A19) V ⋅ N = p 6 p 7 ( p 2 + p 3 ) ( p 6 + p 7 ) ( N − 1 ) ( 1 − p 6 − p 7 ) N − 2 F [ 1 , 1 , 2 − N ] , [ 2 , 2 ] , − p 6 + p 7 1 − p 6 − p 7 + p 6 p 7 ( p 2 + p 3 ) 2 ( p 6 + p 7 ) ( N − 1 ) ( N − 2 ) ( 1 − p 6 − p 7 ) N − 3 F [ 1 , 1 , 3 − N ] , [ 2 , 2 ] , − p 6 + p 7 1 − p 6 − p 7 + p 4 p 5 ( p 0 + p 1 ) ( p 4 + p 5 ) ( N − 1 ) ( 1 − p 4 − p 5 ) N − 2 F [ 1 , 1 , 2 − N ] , [ 2 , 2 ] , − p 4 + p 5 1 − p 4 − p 5 + p 4 p 5 ( p 0 + p 1 ) 2 ( p 4 + p 5 ) ( N − 1 ) ( N − 2 ) ( 1 − p 4 − p 5 ) N − 3 F [ 1 , 1 , 3 − N ] , [ 2 , 2 ] , − p 4 + p 5 1 − p 4 − p 5 + p 2 p 3 ( p 6 + p 7 ) ( p 2 + p 3 ) ( N − 1 ) ( 1 − p 2 − p 3 ) N − 2 F [ 1 , 1 , 2 − N ] , [ 2 , 2 ] , − p 2 + p 3 1 − p 2 − p 3 + p 2 p 3 ( p 6 + p 7 ) 2 ( p 2 + p 3 ) ( N − 1 ) ( N − 2 ) ( 1 − p 2 − p 3 ) N − 3 F [ 1 , 1 , 3 − N ] , [ 2 , 2 ] , − p 2 + p 3 1 − p 2 − p 3 + p 0 p 1 ( p 4 + p 5 ) ( p 0 + p 1 ) ( N − 1 ) ( 1 − p 0 − p 1 ) N − 2 F [ 1 , 1 , 2 − N ] , [ 2 , 2 ] , − p 0 + p 1 1 − p 0 − p 1 + p 0 p 1 ( p 4 + p 5 ) 2 ( p 0 + p 1 ) ( N − 1 ) ( N − 2 ) ( 1 − p 0 − p 1 ) N − 3 F [ 1 , 1 , 3 − N ] , [ 2 , 2 ] , − p 0 + p 1 1 − p 0 − p 1 + p 6 p 7 ( p 6 + p 7 ) 2 ( p 2 + p 3 + p Z ) + p 4 p 5 ( p 4 + p 5 ) 2 ( p 0 + p 1 + 1 − p Z )

+ p 2 p 3 ( p 2 + p 3 ) 2 ( p 6 + p 7 + p Z ) + p 0 p 1 ( p 0 + p 1 ) 2 ( p 4 + p 5 + 1 − p Z ) + p 7 ( p 6 + p 7 ) − p 5 ( p 4 + p 5 ) − p 3 ( p 2 + p 3 ) + p 1 ( p 0 + p 1 ) 2 p Z ( 1 − p Z ) .

We note that the hypergeometric functions can be written solely in terms of p X and p Z so that the variance is actually quadratic in p Y , i .

C Asymptotics of the hypergeometric functions

We utilise the following asymptotic expansions of our hypergeometric functions:

(A20) N 2 z 2 ( 1 − z ) N − 1 F [ 1 , 1 , 1 − N ] , [ 2 , 2 ] , − z 1 − z = 1 + ( 1 − z ) N z + …

and

(A21) N ( N − 1 ) z 2 ( 1 − z ) N − 2 F [ 1 , 1 , 2 − N ] , [ 2 , 2 ] , − z 1 − z = 1 + 1 N z + … ( N − 1 ) ( N − 2 ) z 2 ( 1 − z ) N − 3 F [ 1 , 1 , 3 − N ] , [ 2 , 2 ] , − z 1 − z = 1 + 1 N z + …

D Asymptotic behaviour with interactions

When there is an interaction term and D ≠ 0 we repeat the computations of Section 3. The variance of R does not depend on D and we retain equation (31), while for the variance of M , we have

(A22) V [ M ] ⋅ N = q 1 ( 1 − q 1 ) − C 2 p X 1 + 2 ( 1 − p X ) N p X − ( 2 q 1 − 1 ) ( 2 p Z − 1 ) p X C + D ( 1 − p X ) N p X + q 0 ( 1 − q 0 ) − C 2 1 − p X 1 + 2 p X N ( 1 − p X ) − ( 2 q 0 − 1 ) ( 2 p Z − 1 ) ( 1 − p X ) C − D p X N ( 1 − p X ) + p Z ( 1 − p Z ) ( 4 C − D ) D p X − p Z ( 1 − p Z ) ( 4 C + D ) D 1 − p X + ( 2 C − D ) D ( 1 − p X ) N p X 2 − ( 2 C + D ) D p X N ( 1 − p X ) 2 + 2 p Z ( 1 − p Z ) D 2 N 2 + 1 − p X p X 2 + p X ( 1 − p X ) 2 + O ( N − 2 ) ,

to replace equation (32). In the scaling limit C ∼ N − 1 2 and D ∼ N − 1 2 , we obtain

(A23) ( V [ M ] − V [ R ] ) ⋅ N = q 1 ( 1 − q 1 ) ( 1 − p X ) N p X 2 + q 0 ( 1 − q 0 ) p X N ( 1 − p X ) 2 − p Z ( 1 − p Z ) p X ( 1 − p X ) [ 2 C + ( 2 p X − 1 ) D ] 2 + O N − 3 2 ,

where the only difference to equation (33) for D = 0 is the replacement of [ 2 C ] 2 by [ 2 C + ( 2 p X − 1 ) D ] 2 . This explains the rotation seen in Figure 3 with the ridge asymptotically approaching the line 2 C = ( 1 − 2 p X ) D . Along this line, the R estimator has a lower variance than the M estimator in the asymptotic limit N → ∞ , even though there is an edge from Z to Y (with the exception of the special point at C = D = 0 , where Z is independent of { X , Y } ).

References

[1] Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999;10:37–48. 10.1097/00001648-199901000-00008Search in Google Scholar

[2] Pearl J. Causality: models, reasoning and inference. Cambridge, UK: Cambridge University Press; 2000. Search in Google Scholar

[3] Hernán MA, Robins JM. Instruments for causal inference: An epidemiologistas dream? Epidemiology. 2006;17:360–72. 10.1097/01.ede.0000222409.00878.37Search in Google Scholar PubMed

[4] VanderWeele TJ, Robins JM. Four types of effect modification: A classification based on directed acyclic graphs. Epidemiology. 2007;18:561–8. 10.1097/EDE.0b013e318127181bSearch in Google Scholar PubMed

[5] Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82:669–88. 10.1093/biomet/82.4.669Search in Google Scholar

[6] Pearl J. [Bayesian analysis in expert systems]: Comment: graphical models, causality and intervention. Statist Sci. 1993;8:266–9. 10.1214/ss/1177010894Search in Google Scholar

[7] Shpitser I, VanderWeele T, Robins JM. On the validity of covariate adjustment for estimating causal effects. In: Twenty-Sixth Conference on Uncertainty in Artificial Intelligence; 2010. p. 527–36. Search in Google Scholar

[8] Perković E, Textor J, Kalisch M, Maathuis MH. Complete graphical characterization and construction of adjustment sets in Markov equivalence classes of ancestral graphs. J Machine Learn Res. 2017;18:8132–93. Search in Google Scholar

[9] Maathuis MH, Colombo D. A generalized back-door criterion. Ann Statist. 2015;43:1060–88. 10.1214/14-AOS1295Search in Google Scholar

[10] Henckel L, Perković E, Maathuis MH. Graphical criteria for efficient total effect estimation via adjustment in causal linear models; J R Stat Soc B. 2022;84:579–99.10.1111/rssb.12451Search in Google Scholar

[11] Rotnitzky A, Smucler E. Efficient adjustment sets for population average causal treatment effect estimation in graphical models. J Machine Learn Res. 2020;21:1–86. Search in Google Scholar

[12] Witte J, Henckel L, Maathuis MH, Didelez V. On efficient adjustment in causal graphs. Machine Learn Res. 2020;21:1–45. Search in Google Scholar

[13] Moffa G, Catone G, Kuipers J, Kuipers E, Freeman D, Marwaha S, et al. Using directed acyclic graphs in epidemiological research in psychosis: An analysis of the role of bullying in psychosis. Schizophrenia Bulletin. 2017;43:1273–9. 10.1093/schbul/sbx013Search in Google Scholar PubMed PubMed Central

[14] Senn S. Modelling in drug development. In Christie M, Cliffe A, Dawid P, Senn S., editors. Simplicity, complexity and modelling. Hoboken, New Jersey, U.S: Wiley; 2011. 10.1002/9781119951445.ch3.Search in Google Scholar

[15] Altman DG. Comparability of randomised groups. Statistician. 1985;34:125–36. 10.2307/2987510Search in Google Scholar

[16] Senn SJ. Covariate imbalance and random allocation in clinical trials. Statist Med. 1989;8:467–75. 10.1002/sim.4780080410Search in Google Scholar PubMed

[17] Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: Current practice and problems. Statist Med. 2002;21:2917–30. 10.1002/sim.1296Search in Google Scholar PubMed

[18] Rosenberger WF, Sverdlov O. Handling covariates in the design of clinical trials. Statist Sci. 2008;23:404–19. 10.1214/08-STS269Search in Google Scholar

[19] Austin PC, Manca A, Zwarenstein M, Juurlink DN, Stanbrook MB. A substantial and confusing variation exists in handling of baseline covariates in randomized controlled trials: A review of trials published in leading medical journals. J Clin Epidemiol. 2010;63:142–53. 10.1016/j.jclinepi.2009.06.002Search in Google Scholar PubMed

[20] Senn S. Seven myths of randomisation in clinical trials. Statist Med. 2013;32:1439–50. 10.1002/sim.5713Search in Google Scholar PubMed

[21] Wang J. Covariate adjustment for randomized controlled trials revisited. Pharmaceut Statist. 2020;19:255–61. 10.1002/pst.1988Search in Google Scholar PubMed

[22] Tsiatis AA, Davidian M, Zhang M, Lu X. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Statist Med. 2008;27:4658–77. 10.1002/sim.3113Search in Google Scholar PubMed PubMed Central

[23] Zhang M, Tsiatis AA, Davidian M. Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics. 2008;64:707–15. 10.1111/j.1541-0420.2007.00976.xSearch in Google Scholar PubMed PubMed Central

[24] Rosenblum M, van der Laan MJ. Using regression models to analyze randomized trials: Asymptotically valid hypothesis tests despite incorrectly specified models. Biometrics. 2009;65:937–45. 10.1111/j.1541-0420.2008.01177.xSearch in Google Scholar PubMed PubMed Central

[25] Ge M, Durham LK, Meyer RD, Xie W, Thomas N. Covariate-adjusted difference in proportions from clinical trials using logistic regression and weighted risk differences. Therapeutic Innovat Regulat Sci. 2011;45:481–93. 10.1177/009286151104500409Search in Google Scholar

[26] Benkeser D, Díaz I, Luedtke A, Segal J, Scharfstein D, Rosenblum M. Improving precision and power in randomized trials for COVID-19 treatments using covariate adjustment, for binary, ordinal, and time-to-event outcomes. Biometrics. 2021;77:1467–81. 10.1111/biom.13377Search in Google Scholar PubMed PubMed Central

[27] Morris TP, Walker AS, Williamson EJ, White IR. Planning a method for covariate adjustment in individually - randomised trials: A practical guide; Trials. 2022;23:328. 10.1186/s13063-022-06097-zSearch in Google Scholar PubMed PubMed Central

[28] EMA. Guideline on adjustment for baseline covariates in clinical trials; 2015. Search in Google Scholar

[29] FDA, Draft Guidance. Adjusting for covariates in randomized clinical trials for drugs and biological products; 2021. Search in Google Scholar

[30] FDA, Guidance for Industry. COVID-19: Developing drugs and biological products for treatment or prevention; 2021. Search in Google Scholar

[31] Freedman DA. Randomization does not justify logistic regression. Statist Sci. 2008;23:237–49. 10.1017/CBO9780511815874.015Search in Google Scholar

[32] Moore KL, van der Laan MJ. Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation. Statistic Med. 2009;28:39–64. 10.1002/sim.3445Search in Google Scholar PubMed PubMed Central

[33] Daniel R, Zhang J, Farewell D. Making apples from oranges: Comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets. Biometric J. 2021;63:528–57. 10.1002/bimj.201900297Search in Google Scholar PubMed PubMed Central

[34] Permutt T. Do covariates change the estimand? Statistic Biopharmaceutic Res. 2020;12:45–53. 10.1080/19466315.2019.1647874Search in Google Scholar

[35] Wilf HS. Generating functionology. Boca Raton, Florida, USA: CRC Press; 2005. 10.1201/b10576Search in Google Scholar

[36] Wilks SS. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Statistic. 1938;9:60–2. 10.1214/aoms/1177732360Search in Google Scholar

[37] Maathuis MH, Kalisch M, Bühlmann P. Estimating high-dimensional intervention effects from observational data. Ann Statistic. 2009;37:3133–64. 10.1214/09-AOS685Search in Google Scholar

[38] Spirtes P, Glymour CN, Scheines R. Causation, prediction, and search. Cambridge, Massachusetts, USA: MIT Press; 2000. 10.7551/mitpress/1754.001.0001Search in Google Scholar

[39] Kalisch M, Bühlmann P. Estimating high-dimensional directed acyclic graphs with the PC-Algorithm. J Machine Learn Res. 2007;8:613–36. Search in Google Scholar

[40] Chickering DM. Optimal structure identification with greedy search. J Machine Learn Res. 2002;3:507–54. Search in Google Scholar

[41] Friedman N, Koller D. Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learn. 2003;50:95–125. 10.1023/A:1020249912095Search in Google Scholar

[42] Kuipers J, Moffa G. Partition MCMC for inference on acyclic digraphs. J Am Statistic Assoc. 2017;112:282–99. 10.1080/01621459.2015.1133426Search in Google Scholar

[43] Kuipers J, Suter P, Moffa G. Efficient sampling and structure learning of Bayesian networks. J Comput Graphic Statist. 2022. 10.1080/10618600.2021.2020127. Search in Google Scholar

[44] Zhang M, Zhang B. Discussion of “Improving precision and power in randomized trials for COVID-19 treatments using covariate adjustment, for binary, ordinal, and time-to-event outcomes”. Biometrics. 2021;77:1485–8. 10.1111/biom.13492Search in Google Scholar PubMed PubMed Central

Received: 2021-05-17

Revised: 2021-12-16

Accepted: 2022-04-07

Published Online: 2022-05-25

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2021-0025

Keywords for this article

causality; covariate adjustment; structure learning; Bayesian networks; probability theory

Creative Commons

BY 4.0