Energy balancing of covariate distributions

Jared D. Huling; Simon Mak

doi:10.1515/jci-2022-0029

Artikel Open Access

Energy balancing of covariate distributions

Jared D. Huling und Simon Mak

Veröffentlicht/Copyright: 24. April 2024

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Journal of Causal Inference Band 12 Heft 1

Abstract

Bias in causal comparisons has a correspondence with distributional imbalance of covariates between treatment groups. Weighting strategies such as inverse propensity score weighting attempt to mitigate bias by either modeling the treatment assignment mechanism or balancing specified covariate moments. This article introduces a new weighting method, called energy balancing, which instead aims to balance weighted covariate distributions. By directly targeting distributional imbalance, the proposed weighting strategy can be flexibly utilized in a wide variety of causal analyses without the need for careful model or moment specification. Our energy balancing weights (EBW) approach has several advantages over existing weighting techniques. First, it offers a model-free and robust approach for obtaining covariate balance that does not require tuning parameters, obviating the need for modeling decisions of secondary nature to the scientific question at hand. Second, since this approach is based on a genuine measure of distributional balance, it provides a means for assessing the balance induced by a given set of weights for a given dataset. We demonstrate the effectiveness of this EBW approach in a suite of simulation experiments, and in studies on the safety of right heart catheterization and on three additional studies using electronic health record data.

Keywords: energy distance; causal inference; covariate balance; observational data; weighting

MSC 2010: 62D20; 62G05

1 Introduction

Studying the causal effect of a treatment or intervention is a central goal in many scientific disciplines. In randomized controlled trials, estimation of causal effects is possible since randomization ensures that treated units and control units are comparable [1]. However, for many pressing questions, it is impossible or impractical to randomize treatment assignments. Researchers are thus left with existing sources of observational data to answer these questions. In these observational studies, it is of prime interest to make unconfounded comparisons between treatment groups, a common example being estimation of the average treatment effect (ATE). Yet in observational settings, natural selection processes into the treatment, resulting in imbalance in the covariate distributions between treatment groups. This may then introduce substantial bias in naive comparisons of the outcome of interest between groups.

In observational studies based on complex electronic health records (EHRs) or administrative health data, the differences between treated and untreated units are typically substantial and difficult to characterize. One motivating application for our work is the study by Connors et al. [2], which explored the impact of right heart catheterization (RHC), a diagnostic procedure designed to guide therapy, on mortality among intensive care unit (ICU) patients. In this study, patients who received RHC are different from those who did not receive RHC in highly complex ways. As an example among many such differences, both younger patients and older patients are less likely to receive RHC; thus, a simple correction for the average age does not characterize the differences between those who received RHC and those who did not. Similar complex differences between treated and untreated patients can also be observed in three other motivating studies arising from clinical care and EHRs using the MIMIC-III critical care database [3]; we will demonstrate how the proposed method tackles these four challenging applications later.

There is a vast literature on adjustment methods for correcting for imbalances between treated and untreated units to reduce estimation bias in treatment effects. Weighting methods are a class of adjustment methods that control for confounding by re-weighting the treatment and control groups to look similar. Inverse-probability weighting (IPW) methods [4–9], which have origins in survey sampling, are by far the most commonly used weighting approaches. IPW methods model the treatment assignment mechanism (or propensity score, see Rosenbaum and Rubin [10]), and inverse weight each sample by the probability of receiving its assigned treatment. Inverse weighting by the true underlying propensity score controls for confounding, as it re-weights the covariate distributions of the treatment groups to that of the overall population.

In practice, IPW methods require positing and fitting a model for the propensity score. While model fitting is no unfamiliar task for a statistician, it has been noted in the literature that even mild model misspecifications for the propensity score can result in substantial bias in estimating the treatment effect [11]. Hearkening to George Box’s maxim, “all models are wrong, but some are useful,” it is often quite difficult in practice to obtain a useful propensity score model, especially in the presence of many covariates. Recent work has focused on mitigating this issue, by either (i) including conditions on the estimation of the propensity model which encourage moment balance of the covariates [12] or by (ii) altogether avoiding direct modeling of the propensity score and instead estimating weights that explicitly balance moments of the covariates either exactly [13,14] or approximately [15]. A large class of such estimators is explored Chan et al. [14] and further expanded on by Zhao [16] and has a long history [17]. Recent works [18,19] have attempted to mitigate this issue by focusing on nonparametric approaches to moment balancing, yet they require a careful choice of kernel function and tuning of multiple hyperparameters.

To make progress in addressing these issues, we show in this article that estimation bias for the ATE has a link to imbalance in the covariate distributions. This shows that balancing the full covariate distributions (and not just lower order moments) of the treatment groups to the full population provides a robust way for mitigating confounding. Although this is understood in the literature [16,20], in this article, we establish and make use of this link in a general manner, by (i) introducing a metric that evaluates how well a set of weights mitigates imbalance for a given dataset and (ii) developing a new method for estimating weights by minimizing this metric, yielding good distributional balance for a given dataset. The distributional balancing property of these weights allows for robust empirical performance; they tend to perform well in practice without relying on modeling assumptions for the propensity score or which moments are imbalanced. Our approach is designed to yield good distributional balance between treatment groups without the need of carefully tuning hyperparameters. Hirshberg et al. [21] study the theoretical properties of use of integral probability metrics for construction of weights; however, our work focuses on a specific choice that is suitable to wide use in practice. Despite the lack of need for tuning, our proposal works well empirically in a wide variety of settings.

We introduce weights that are explicitly constructed to balance the covariate distributions of the treatment groups to a target distribution (usually the full population). We do so by leveraging the energy distance presented in Székely and Rizzo [22]. The energy distance is a measure based on powers of the Euclidean distance and was originally introduced as a means to replace standard nonparametric goodness-of-fit tests in high dimensions. The energy distance has an exact duality with a norm on the characteristic functions, enabling its use to compare two (or more) distributions or the distributions of two samples. We show that a weighted energy distance still retains this duality, making it a rigorously justified and reliable metric to compare between multiple sets of weights for a given dataset. From this, we propose the so-called energy balancing weights (EBWs), which are defined as the weights which minimize the weighted energy distance between treatment groups and the full sample, subject to constraints that mitigate variability in the weights. We prove that EBWs asymptotically ensure full distributional balance and result in root- n consistent estimation of the ATE. Our emphasis is on the robust performance of our approach in practice, and our asymptotic analysis serves primarily to justify the use of the proposed weights. We analyze four challenging observational studies and use each of these datasets to conduct realistic and highly challenging simulations, demonstrating the effectiveness of EBWs in practice with minimal user input required. Two of the studies we present in this article and the remaining two are presented in the Supplementary Material.

Although we focus primarily on a simple estimand, namely, the average treatment effect (ATE) the proposed weights can be used for a wide variety of causal estimands since the distributional re-balancing property of EBWs enables flexible control of confounding. We show that they can be used for the estimation of a wide variety of causal quantities, such as the ATE and individualized treatment rules (ITRs) [23,24]. With minor modifications, they can also be used for estimation of the average treatment effect on the treated (ATT) and for estimating treatment effects for multi-category treatments. Despite the fact that EBWs are not specifically designed to match low-order moments, in practice, they often result in better marginal mean balance of covariates than propensity score methods even in high dimensions (50–100 covariates), as seen in Section 6. EBWs are also quite stable in practice, rarely resulting in large weights, an issue that plagues standard propensity score methods [11], making it less critical to impose constraints that induce weight variability. EBWs are constructed without using outcome information; however, as for other weighting approaches, variance can be reduced via augmented estimators that make use of outcome regression models, such as the augmented estimators in the studies by Wong and Chan [18], Zhao [16], and Athey et al. [25]. However, we do not explore such techniques here.

The remainder of this article is organized as follows. Section 2 motivates the need for distributional balance and introduces the weighted energy distance. Section 3 presents the proposed EBWs and discusses their computation and asymptotic properties. Section 5 compares the performance of EBWs with other weighting methods in simulation studies. Section 6 discusses an application of EBWs in a study of RHC and further uses these data to explore highly challenging simulations that use the natural data-generating processes of the study. Section 7 further demonstrates the utility of EBWs by analyzing data from a complex application involving EHR data. Section 8 concludes with a discussion and future work.

2 Distributional balance and weighted energy distance

2.1 Setup

Consider a sample { ( Y i , A i , X i ) } i = 1 n of size n from a population, where Y i is the outcome of the i th unit, A i ∈ { 0 , 1 } is a binary indicator of receiving a treatment, and X i ∈ X = R p is a p -dimensional vector of covariates. Further denote n 1 = ∑ i = 1 n A i and n 0 = n − n 1 . In this article, we are interested in estimating the average causal effect of the treatment on the outcome. A formal definition of a causal effect often involves the use of so-called potential outcomes [26–29]. The potential outcome Y ( a ) is the outcome that would have been observed under level a of the treatment. As each individual only receives one level of the treatment at a given time, only one potential outcome, either Y i ( 0 ) or Y i ( 1 ) , for each individual is observable. We assume the standard stable unit treatment value assumption (SUTVA), which posits that the potential outcomes for each unit are unaffected by the potential outcomes of other units and that only one version of the treatment exists. Under SUTVA, the observed outcome is consistent with the potential outcomes in that Y i = Y i ( A i ) . We further assume the assignment mechanism is strongly unconfounded in the sense that { Y ( 0 ) , Y ( 1 ) } ⊥ ⊥ A ∣ X , which requires that there are no unmeasured confounders. The ⊥ ⊥ notation of Dawid [30] denotes (conditional) independence. We further assume positivity (or probabilistic assignment) in that the propensity score π ( x ) ≡ P ( A = 1 ∣ X = x ) [10] satisfies 0 < π ( x ) < 1 , so that everyone has a chance of receiving the treatment. Positivity, together with no unmeasured confounders, constitute strong ignorability [10].

Let us denote μ a ( X i ) ≡ E ( Y ( a ) ∣ X i ) as the conditional mean function and σ a 2 ( X i ) ≡ V ( Y i ( a ) ∣ X i ) as the conditional variance function of the response for a ∈ { 0 , 1 } . We consider scenarios where data have been collected from an observational study, and thus, the treatment groups are not comparable due to imbalances in the distributions of their baseline covariates. These differences can be characterized through π ( x ) or through F a ( x ) ≡ P ( X ≤ x ∣ A = a ) , the cumulative distribution function (CDF) of covariates X conditional on treatment level a .

Using the notation of potential outcomes, the (population) ATE is defined as τ ≡ E ( Y ( 1 ) − Y ( 0 ) ) . This can be rewritten as follows:

(1) τ = ∫ x ∈ X [ μ 1 ( x ) − μ 0 ( x ) ] d F ( x ) ,

where F ( x ) = P r ( X ≤ x ) is the CDF of covariates X marginalized over the treatment groups. In other words, F ( x ) = F 1 ( x ) P 1 + F 0 ( x ) P 0 , where P a ≡ P ( A = a ) ∈ ( 0 , 1 ) is the probability of being assigned treatment level a ∈ { 0 , 1 } .

2.2 Weighted average estimates and distributional balance

We restrict our focus to weighted averages as estimates of the ATE. Given a vector of weights w = ( w 1 , … , w n ) , we study estimators of the form:

(2) τ ^ w = 1 n 1 ∑ i = 1 n w i Y i A i − 1 n 0 ∑ i = 1 n w i Y i ( 1 − A i ) .

The most commonly used example of (2), inverse propensity score weighting, uses w i = ( A i π ( X i ) n 1 ∕ n + ( 1 − A i ) ( 1 − π ( X i ) ) n 0 ∕ n ) − 1 . As presented in Imai and Ratkovic [12] and Li et al. [20], the weights in (2) are often normalized by treatment group, i.e., ∑ i = 1 n w i I ( A i = a ) = n a for a ∈ { 0 , 1 } , to improve precision [11,16] at the cost of a small bias. When these weights are constructed as mentioned earlier and normalized by treatment group, the resulting estimator is called the Hájek estimator [31] and may yield reduced mean squared error over its nonnormalized version [32].

Given any weight vector w , we can express the error of τ ^ w as follows:

(3) τ ^ w − τ = ∫ X μ 1 ( x ) d [ F n , 1 , w − F n ] ( x ) − ∫ X μ 0 ( x ) d [ F n , 0 , w − F n ] ( x )

(4) − ∫ X [ μ 1 ( x ) − μ 0 ( x ) ] d [ F − F n ] ( x )

(5) + 1 n 1 ∑ i = 1 n w i ε i A i − 1 n 0 ∑ i = 1 n w i ε i ( 1 − A i ) ,

where ε i ≡ Y i ( A i ) − μ A i ( X i ) , F n ( x ) = ∑ i = 1 n I ( X i ≤ x ) ∕ n is the empirical CDF (ECDF) of the combined sample { X i } i = 1 n , and F n , a , w ( x ) = ∑ i = 1 n w i I ( X i ≤ x , A i = a ) ∕ n a is the weighted ECDF for treatment level a ∈ { 0 , 1 } . In observational studies, F n is not impacted by the weights, so the error term (4) is irreducible. However, this term goes to 0 as long as the sample is representative of the desired population. Since (5) always has mean 0, the bias of τ ^ w in essence depends on the properties of (3): the difference of integrals with respect to F n , a , w − F n . Thus, the systematic source of error from the weighted estimator τ ^ w can be completely controlled by reducing the imbalance between the weighted ECDFs F n , a , w and the ECDF F n . Unlike the decomposition in Zhao [16], the decomposition of τ ^ w − τ above holds even if the treatment effect is not constant over x . Further, none of the methodology or results of this article require such a constant treatment effect to justify the validity of our methods. We note that the systematic source of bias in equation (3) can be zero without F n , a , w being balanced to F n , depending on μ 0 ( x ) and μ 1 ( x ) . However, knowledge of such an event a priori is impossible without knowledge of the mean potential outcome functions; as such, a measure of distributional imbalance is critical to characterize the degree of bias for a given dataset. The terms in (5) drives the variability of τ ˆ w and can be straightforwardly mitigated with a measure of dispersion of the weights [33]; we discuss this more at the end of Section 3; however, this is not the focus of this article.

The notion that the covariate distributions should be balanced to obtain a good estimate of τ is not new (see Imai and Ratkovic [12] and Li et al. [20], among many others). In fact, it is well understood that the weights resulting from correctly specified propensity score models asymptotically balance covariate distributions. However, slight model misspecifications of the propensity score may result in poor performance of (2) [11]. Further, two units with the same propensity score do not necessarily have the same covariate values, so in finite samples, propensity score weights may not be optimal. Other approaches that aim to estimate weights by balancing prespecified moments of the covariates [12,14] tend to be more robust than directly modeling π ( x ) . Yet, the term (3) makes it explicit that bias is directly related to distributional imbalance and that, holding μ a fixed, weights that yield less distributional imbalance will in general result in less bias. A natural conclusion is that w should be estimated to directly balance each F n , a , w to F n . In the following, we introduce a new distance metric between F n , a , w and F n , which enables one to characterize how well a set of weights re-balances the weighted distribution F n , a , w to F n .

2.3 Weighted energy distance

We introduce next a new measure of the distributional balance induced by a set of weights. This measure is based on the energy distance, which is a metric on distributions [34]. Due to the link between estimation bias and distributional imbalance, our measure can be used to evaluate the degree of bias one expects from a given set of weights and a given dataset. We will later leverage this measure to construct distributional balance weights that minimize this metric for a given dataset.

The energy distance (as surveyed in Székely and Rizzo [34]) is defined as follows. Let G and H be two finite-mean distribution functions on X , and let Z , Z ′ ∼ i . i . d G and V , V ′ ∼ i . i . d H . The energy distance between distributions G and H is defined as follows:

(6) ℰ ( G , H ) ≡ 2 E ‖ Z − V ‖ 2 − E ‖ Z − Z ′ ‖ 2 − E ‖ V − V ′ ‖ 2 ,

where ‖ ⋅ ‖ 2 is the Euclidean norm. When both G and H are ECDFs, i.e., G n is the ECDF of { Z i } i = 1 n ⊆ X and H m is the ECDF of { V i } i = 1 m ⊆ X , the energy distance ℰ ( G n , H m ) can be expressed as follows:

(7) 2 n m ∑ i = 1 n ∑ j = 1 m ‖ Z i − V j ‖ 2 − 1 n 2 ∑ i = 1 n ∑ j = 1 n ‖ Z i − Z j ‖ 2 − 1 m 2 ∑ i = 1 m ∑ j = 1 m ‖ V i − V j ‖ 2 .

The energy distance has been used within a wide variety of statistical methods, e.g., for testing equivalence of distributions, for testing statistical independence [35], and for generating samples from a target distribution [36]. The energy distance is simple to compute as it only involves sums of Euclidean distances, it is nonnegative, and it can be shown to be equivalent to a distance between the characteristic functions of G and H , indicating that a value of 0 occurs if and only if G and H are the same distribution and a positive value indicates how similar they are. The name “energy” comes from the fact that the population energy distance (or limit of the empirical energy distance) has an intricate connection with gravitational potential energy (which depends on the distance between two bodies of mass). Two key attractive properties of the energy distance are that it tends to perform well in measuring the distance between two distributions in moderately high-dimensional settings, and it does not require the choice of a kernel or any tuning parameters.

We propose a weighted modification of this energy distance, which measures the distance between a weighted distribution, i.e., the weighted covariate ECDFs F n , 0 , w and F n , 1 , w for the control and treated, and a target distribution, i.e., the combined covariate ECDF F n . The weighted energy distance between F n , a , w and F n is defined as follows:

ℰ ( F n , a , w , F n ) ≡ 2 n a n ∑ i = 1 n ∑ j = 1 n w i I ( A i = a ) ‖ X i − X j ‖ 2 − 1 n a 2 ∑ i = 1 n ∑ j = 1 n w i w j I ( A i = A j = a ) ‖ X i − X j ‖ 2 − 1 n 2 ∑ i = 1 n ∑ j = 1 n ‖ X i − X j ‖ 2 .

In other words, ℰ ( F n , a , w , F n ) is the energy distance between the ECDF of a sample { X i } i = 1 n and a weighted ECDF of a subsample { X i } i : A i = a .

We show this new weighted energy distance is indeed a distance between F n , a , w and F n . Let ⟨ t , s ⟩ be the inner product of vectors t and s and define the empirical characteristic function (ECHF) of { X i } i = 1 n and the weighted ECHFs of { X i } i : A i = a as follows:

φ n ( t ) ≡ 1 n ∑ i = 1 n exp { i ⟨ t , X i ⟩ } and φ n , a , w ( t ) ≡ 1 n a ∑ i = 1 n w i I ( A i = a ) exp { i ⟨ t , X i ⟩ } ,

respectively. The following proposition establishes the distance property of the weighted energy distance.

Proposition 2.1

Let w be a vector of weights such that ∑ i = 1 n w i I ( A i = a ) = n a for a ∈ { 0 , 1 } and w i > 0 for i = 1 , … , n . Then

(8) ℰ ( F n , a , w , F n ) = ∫ R p ∣ φ n ( t ) − φ n , a , w ( t ) ∣ 2 ω ( t ) d t for a ∈ { 0 , 1 } ,

where ω ( t ) = 1 ∕ ( C p ‖ t ‖ 2 ∣ 1 + p ) , C p = π ( 1 + p ) ∕ 2 ∕ Γ ( ( 1 + p ) ∕ 2 ) is a constant, and Γ ( ⋅ ) is the complete gamma function. Thus, ℰ ( F n , a , w , F n ) ≥ 0 with equality to zero iff φ n , a , w ( t ) = φ n ( t ) for all t .

Thus, the weighted energy distance is a distance between the weighted distribution of interest and the target distribution, and is thus a bona fide measure of distributional balance of covariates induced by a set of weights. This proportion extends the duality results of Proposition 1 from Székely and Rizzo [34] and Theorem 1 from Székely et al. [35] for the weighted energy distance at hand. A subtle point is that Proposition 2.1, combined with the decomposition presented in Section 2.2, makes clear that ℰ ( F n , a , w , F n ) more closely aligns with evaluating how well a set of weights estimates the sample average treatment effect [32] than the population ATE τ that is our main focus. Yet, as long as F n is representative of F , the weighted energy distance still aligns well with the population ATE.

We now show that the weighted energy distance converges to the energy distance when the weights yield a well-defined limiting distribution.

Theorem 2.2

Assume E ( ‖ X ‖ 2 ∣ A = a ) < ∞ and E ‖ X ‖ 2 < ∞ . Further assume, for a sequence of weights { w n } n = 1 ∞ with ∑ i = 1 n w i I ( A i = a ) = n a for a ∈ { 0 , 1 } and w i > 0 for i = 1 , … , n for each n (so that F n , a , w n are well-defined as CDFs), that lim n → ∞ φ n , a , w n ( t ) → φ ˜ a ( t ) almost surely for every t ∈ R p , where φ ˜ a is some integrable characteristic function with associated CDF F ˜ a ( x ) . Then almost surely we have

(9) lim n → ∞ ℰ ( F n , a , w n , F n ) = ℰ ( F ˜ a , F ) .

Theorem 2.2 shows that the weighted energy distance converges to the limiting energy distance. If the limiting distribution implied by a set of weights is F ˜ 0 = F ˜ 1 = F , then ℰ ( F ˜ 0 , F ) + ℰ ( F ˜ 1 , F ) = 0 . Proposition 2.1 and Theorem 2.2 together imply that weights with smaller values of the sum ℰ ( F n , 1 , w , F n ) + ℰ ( F n , 0 , w , F n ) yield better distributional balance of covariates. Due to the link between imbalance and bias (see the following section), this also implies that better balance will yield estimates with smaller values for the term (3).

2.4 Bias and distributional imbalance

We now demonstrate this connection between bias and distributional imbalance (as measured by the weighted energy distance) using two illustrative examples. A more formal presentation is provided later in Section 3.2 when proving asymptotic properties of EBWs.

In the first illustrative example, we generate a one-dimensional covariate of sample size 250, which impacts treatment assignment via a logistic model under each of three scenarios: (1) logit ( π ( X ) ) = − 1 + X , (2) logit ( π ( X ) ) = − 1 + X + 2 X 2 ∕ 3 , and (3) logit ( π ( X ) ) = − 1 + X + 2 X 2 ∕ 3 − X 3 ∕ 3 . In each scenario, the response is generated as Y = X + X 3 − 1 ∕ ( 0.1 + 0.1 X 2 ) + ε , where ε ∼ N ( 0 , 2 ) . For each scenario, we construct IPWs based on three logistic regression models, which consider only a linear term in X (denoted as “IPW (1)”), a linear plus quadratic term (“IPW (2)”), and up to the cubic term (“IPW (3)”), respectively. Thus, for Scenarios 2 and 3, at least one of the fitted models is misspecified. For each set of weights w , we compute ℰ ( F n , 0 , w , F n ) + ℰ ( F n , 1 , w , F n ) , i.e., the sum of the energy distances between each treatment group and the combined sample, and compute the error τ ^ w − τ of (2) for τ .

Figure 1(a) displays the energy distances and biases over 1,000 replications of the experiment. We see that the energy distances are the largest in all scenarios for the unweighted estimator ((2) with all weights equal to 1). For scenarios where the weights are estimated using a misspecified model (IPW (1) in Scenario 2 and IPW (1) and (2) in Scenario 3), the energy distances are much larger than for weights based on correctly (or over-specified) models. Correspondingly, the bias is pronounced for the misspecified models. Thus, the weighted energy distance can be a useful tool to compare between different models, as weights with smaller weighted energy distances tend to yield estimates with smaller error.

Figure 1

(a, left) Energy distances and biases for IPW estimates based on weights from the three fitted logistic regression models; (b, right) Boxplots of the biases for IPW estimates versus weighted energy distance based on weights estimated by several methods, each with different combinations of moments included for balancing or estimation.

In the second example, we consider a two-dimensional example where the true assignment mechanism depends on first and second moments of the covariates. We consider several methods for estimate weights: logistic regression, the method of Imai and Ratkovic [12], and the method of Chan et al. [14], each with different moments included for balancing or estimation. We then compare their weighted energy distances and absolute errors of (2) over 1,000 replications. Figure 1(b) displays the distances and errors for each dataset and method. We see that, in general, weights with lower energy distance have a much smaller magnitude of bias in estimating the ATE.

2.5 Other measures of distributional imbalance

There are, of course, many ways of measuring distance between distributions in the literature, including the Kolmogorov-Smirnov statistic, f -divergences (e.g., the Kullback-Leibler divergence and the Hellinger distance), the Wasserstein distance, and the maximum mean discrepancy (MMD). The energy distance has several advantages for characterizing distributional imbalance. First, while in general there is no uniformly most powerful nonparametric test for the difference between two distributions, the energy distance is often sensitive to differences in distributions, unlike the Kolmogorov-Smirnov statistic. Second, unlike the energy distance, f -divergences such as the Kullback–Leibler divergence and Hellinger distance do not metrize weak convergence; this property ensures that the distance remains stable under perturbations of the support of the distributions being measured [37]. Third, the energy distance is easy and efficient to compute, unlike the Wasserstein distance. It also works reasonably well in moderately high dimensions [36], whereas the Wasserstein distance suffers from the curse of dimensionality, in the sense that empirical Wassterstein distances converge to the true Wasserstein distance at rate O ( n − 1 ∕ p ) [37,38]. Finally, while there is a direct link between MMD and the energy distance [39], the energy distance does not require careful tuning of hyperparameters and tends to work well across a wide variety of scenarios.

3 Energy balancing weights

3.1 Definition

We will now use the proposed weighted energy distance to estimate weights which (i) match the distribution of covariates of the treated group to the distribution of covariates of the full population, and (ii) match the distribution of covariates of the control group to the distribution of covariates of the full population.

To achieve this, we define the EBWs to be

(10) w n e ∈ argmin w = ( w 1 , … , w n ) { ℰ ( F n , 1 , w , F n ) + ℰ ( F n , 0 , w , F n ) } s.t. ∑ i = 1 n w i A i = n 1 , ∑ i = 1 n w i ( 1 − A i ) = n 0 , w i ≥ 0 for i = 1 , … , n .

Thus, the EBWs w n e minimize the statistical energy between the treatment and control groups to the full population. Due to the duality result of Proposition 2.1, minimizing statistical energy is directly equivalent to balancing covariate distributions. The constraints ∑ i = 1 n w i I ( A i = a ) = n a serve several purposes: they (i) preserve the sample size of the weighted pseudo-population to that of the study population, (ii) stabilize the estimated weights, and (iii) ensure that F n , 0 , w and F n , 1 , w are valid distribution functions. Also note that, due to the bias decomposition in (3) and the duality result in Proposition (2.1), the weights w n e are explicitly designed to minimize the key component of the finite-sample bias of τ ^ w n e , in a manner agnostic to the functional forms of μ 0 ( x ) and μ 1 ( x ) . If the functional forms of μ 0 ( x ) and μ 1 ( x ) are known, it is certainly possible to construct a different set of weights to better reduce bias, by emphasizing balance in regions of X where either μ 0 ( X ) or μ 1 ( X ) are pronounced. However, this information is rarely (if ever) known, and hence this is difficult to achieve in practice except by good fortune.

To illustrate the effectiveness of EBWs for distributional balance, we consider data generated under Scenario 3 of the toy example in Figure 1(a). Figure 2 shows the difference between the weighted ECDF using EBWs of the covariate in the treatment group and the ECDF of the combined (i.e., treated and untreated) sample, for varying sample sizes n . This is compared with the weighted ECDF using weights from the true data-generating propensity score, the estimated propensity score under the correct model, and the unweighted ECDF of the treatment group. As sample size increases, the difference vanishes for EBWs, but much more slowly for both the true and estimated propensity score weights. This demonstrates the improved distributional balance provided by the proposed EBWs, which should then translate to a greater reduction of bias.

Figure 2

Displayed are the absolute differences between the ECDF of the combined sample and the weighted ECDF of the covariate in the treatment group based on energy weights. Also displayed the same for the unweighted ECDF of the treated group and the weighted ECDF based on the true and estimated propensity scores.

EBWs can be naturally extended to handle a wide variety of scenarios, such as for treatments with more than two levels, estimation of the ATT, and for the estimation of optimal ITRs. In the Supplementary Material, we show how these extensions are manifested and empirically demonstrate the benefit of using EBWs for ITR estimation.

A few key distinguishing features of our proposal from the works of Wong and Chan [18] and Kallus [19] are (1) its direct focus on distributional balance rather than moment balance. Despite the connection between balancing moments of an infinite dimensional class of functions and distributional balance, this feature helps alleviate modeling decisions about what moments to balance of what space of moments to balance and has advantages in terms of interpretability and (2) by focusing on a measure that does not require tuning parameters to characterize distributional differences, our approach can be applied broadly by practitioners of varying degrees of statistical sophistication.

3.2 Asymptotic properties

Next, we show two desirable properties of the proposed EBWs. We first show that the weighted ECDFs based on EBWs indeed converge to the population CDF F of X .

Theorem 3.1

Assume that E ( ‖ X ‖ 2 ∣ A = a ) < ∞ and E ‖ X ‖ 2 < ∞ and that the assumptions presented in Section 2.1 hold. Let w n e be as defined in (10). Then, for a ∈ { 0 , 1 } ,

(11) lim n → ∞ F n , a , w n e ( x ) ≡ lim n → ∞ 1 n a ∑ i = 1 n w i e I ( X i ≤ x , A i = a ) = F ( x )

almost surely for every continuity point x ∈ X . Furthermore,

lim n → ∞ ℰ ( F n , a , w n e , F n ) = 0

holds almost surely.

Thus, EBWs result in the almost sure convergence of the weighted ECDFs of the treated (and untreated) group to the underlying covariate distribution F .

The consistency of τ ^ w n e follows immediately from Theorem 3.1:

Corollary 3.2

Suppose the conditions of Theorem 3.1 hold, and that the treatment and control mean response functions μ 0 ( x ) and μ 1 ( x ) are bounded and continuous on X . Then τ ^ w n e is a consistent estimator of τ .

Next, we show that the ATE estimator τ ^ w n e is root- n consistent. To do so, we make use of the following lemma, which provides a connection between bias and distributional balance:

Lemma 3.3

Let ℋ be the native space induced by the radial kernel Φ ( ⋅ ) = − ‖ ⋅ ‖ 2 on X , and suppose μ a ( ⋅ ) ∈ ℋ for a ∈ { 0 , 1 } . Then, for any weights w satisfying ∑ i = 1 n w i A i = n 1 , ∑ i = 1 n w i ( 1 − A i ) = n 0 , w i ≥ 0 , we have for a ∈ { 0 , 1 } :

∫ μ a ( x ) d [ F n , a , w − F n ] ( x ) 2 ≤ C a ℰ ( F n , a , w , F n ) ,

where C a ≥ 0 is a constant depending on only μ a .

Lemma 3.3 provides a connection between the systematic bias in (3) and the weighted energy distance ℰ ( F n , a , w , F n ) . Under the mild condition that the conditional mean function μ a ( x ) is found in ℋ (this is discussed further at the end of the section), this lemma shows that the weighted energy distance is a key component in an upper bound on the systematic bias in (3). We note that Lemma 3.3 applies to any set of auto-normalized weights, justifying the use of the weighted energy distance to compare between different sets of weights. The proposed EBWs w n e , which minimize ℰ ( F n , a , w , F n ) , may therefore yield lower estimation bias via this upper bound, which is in line with the empirical observations in Section 2.4. With this, we now state the result on root- n consistency:

Theorem 3.4

Assume the same conditions in Theorem 3.1. Let ℋ be the native space induced by the radial kernel Φ ( ⋅ ) = − ‖ ⋅ ‖ 2 on X . Suppose the following mild conditions hold:

μ 0 ( ⋅ ) ∈ ℋ and μ 1 ( ⋅ ) ∈ ℋ ,
Var [ μ 0 ( X ) ] < ∞ and Var [ μ 1 ( X ) ] < ∞ ,
σ 0 2 ( x ) and σ 1 2 ( x ) are bounded over x ∈ X ,
E [ h 0 2 ( X , X ′ , X ″ , X ′ ′ ′ ) ] < ∞ and E [ h 1 2 ( X , X ′ , X ″ , X ′ ′ ′ ) ] < ∞ , where X , X ′ , X ″ , X ′ ′ ′ ∼ i . i . d . F and, with π 0 ( x ) ≔ 1 − π ( x ) and π 1 ( x ) ≔ π ( x ) , the kernel h a is defined for a = 0 , 1 as follows:
(12) h a ( x , y , z , a ) = 1 π a ( x ) ∣ ∣ x − z ∣ ∣ 2 + 1 π a ( y ) ∣ ∣ y − a ∣ ∣ 2 − 1 π a ( x ) π a ( y ) ∣ ∣ x − y ∣ ∣ 2 − ∣ ∣ a − z ∣ ∣ ,
The EBWs w n e = ( w i , n e ) i = 1 n in (10) satisfy w i , n e ≤ C n 1 ∕ 3 for some constant C > 0 independent of n .

Then the proposed EBW estimator τ ^ w n e is root-n consistent, i.e.,

(13) E X , A , Y [ ( τ ^ w n e − τ ) 2 ] = O ( n − 1 ∕ 2 ) .

We give a brief discussion of Assumptions (A1)–(A5). Assumption (A1) concerns the regularity of the conditional mean functions μ 0 and μ 1 . It can be shown that the Sobolev space W ⌈ ( p + 1 ) ∕ 2 ⌉ , 2 ( X ) – the space of functions with square-integrable r < ⌈ ( p + 1 ) ∕ 2 ⌉ -th differentials – is contained within the native space ℋ ( X ) (Theorem 10.42 of [40]), so (A1) can be viewed as a smoothness assumption on the conditional ATE μ 1 − μ 0 (a similar assumption is made in [18]). Assumptions (A2) and (A3) require the conditional mean functions to have finite variance and the conditional variance function to be bounded, respectively. Assumption (A4) require the kernels h 0 and h 1 to have finite second moments; these kernels can be seen as a “modified” Euclidean kernel weighted by the true propensity scores. Assumption (A5) assumes all EBWs to be bounded above by C n 1 ∕ 3 for some positive constant C > 0 ; this assumption has been used in several weighting-based covariate balancing methods [18,25]. In practice, (A5) can always be checked after optimizing for the EBWs in (10). One can change the optimization procedure (10) to explicitly enforce (A5), but we have never encountered “exploding weights,” which violate (A5) in practice; EBWs in our simulations and data analysis typically satisfy (A5) with C = 1 . As we discuss in Section 3.4, if one is willing to add a penalty on the dispersion of the weights, the explicit assumption on the maximum of the weights is unnecessary and root- n consistency still holds, as shown in the Supplement.

While Theorem 3.4 proves the desired root- n consistency of the proposed EBW estimator τ ^ w n e , it unfortunately does not shed light on its asymptotic variance, which is useful for constructing confidence intervals on τ . We recommend the use of bootstrapped confidence intervals for the EBW estimator. We explore a connection to importance sampling in the Supplementary Material.

3.3 Optimization

The optimization problem (10) for computing EBWs (and the later optimization problem (16) for obtaining three-way EBWs) can be viewed as quadratic programs with linear (in)equality constraints. There has been much work on efficient algorithms for solving such programs [41], including interior point methods [42], augmented Lagrangian techniques [43], and extensions of the simplex algorithm [44]. A recent development is the operator splitting approach in Stellato et al. [45], which provides a reliable alternative for nonpositive definite quadratic programs.

In our implementation, we made use of well-maintained interior-point cone programming solvers in the R package cccp [46] for optimizing the EBW formulations (10) and (16). Such solvers follow a two-stage procedure: finding an initial feasible solution for w , then iteratively refining this solution by traversing the interior of the feasible region. Detailed algorithmic steps can be found in Wright et al. [47]. Interior-point algorithms enjoy empirical and theoretical advantages for solving large-scale quadratic programs (see further discussion in Gondzio [48] on its scalability and complexity), and we have observed excellent performance of such algorithms for EBW optimization. In practice, we recommend that the covariates should be normalized so that each covariate has mean zero and unit variance, before being used as inputs for the optimization problem. An efficient implementation of these methods for EBW optimization is provided in our R package ebw, which will be released on comprehensive R archive network in the near future.

3.4 Controlling weight variability

In our experience, the weights resulting from the optimization criterion (10) rarely, if ever, result in large weights; however, in the literature there has been an emphasis on methods that afford explicit control on the variability of weights [33]. To allow for such within our framework, one can simply add a penalty λ n − 2 ∑ i = 1 n w i 2 to the criterion (10) for some fixed λ > 0 . This penalty can be thought of as a penalty on the inverse of the effective sample size of the weights [33]. In our experience, since the energy distance criterion is not nearly as variable as using a kernelized distance using a universal kernel such as the Gaussian kernel, careful tuning of λ is not at all critical and we advocate for a simple choice such as λ = 1 or not using any penalty at all. On the other hand, if one were to replace the energy distance with a kernelized distance using a universal kernel, the choice of λ becomes critical. In the Supplement, we show analogs of Theorems 3.1 and 3.4 hold when using a penalized energy distance criterion, although the conditions regarding the magnitude of the weights in the latter theorem can be relaxed.

3.5 Three-way EBWs

The EBWs in (10) are designed to balance the distributions of covariates between each treatment group to that of the combined sample { X } i = 1 n . As such, the treatment group should be asymptotically balanced to the control group. Yet for finite samples, EBWs do not necessarily guarantee good distributional balance between the treatment and control arms, which can be important in practice. Consider the following re-expression of the two terms in (3):

(14) ∫ x ∈ X [ μ 1 ( x ) + μ 0 ( x ) ] d [ F n , 1 , w − F n , 0 , w ] ( x )

(15) − ∫ x ∈ X μ 1 ( x ) d [ F n − F n , 0 , w ] ( x ) + ∫ x ∈ X μ 0 ( x ) d [ F n − F n , 1 , w ] ( x ) .

Terms (14) and (15) shed light on how the choice of w impacts estimation error for the ATE. In particular, this error term depends not only on (i) how close the weighted ECDFs F n , 0 , w and F n , 1 , w are to that of the combined sample F n but also (ii) how close the weighted ECDF for the control group F n , 0 , w is to the weighted ECDF of the treated group F n , 1 , w . The EBWs in Section 3 take care of the imbalance in (i), but not the imbalance in (ii) between treatment and control. With finite samples, imbalance in (ii) can result in (14) dominating (15), depending on the properties of μ 0 ( x ) , μ 1 ( x ) , and μ 0 ( x ) + μ 1 ( x ) . Indeed, the importance of this three-way balance is recognized in the literature, and is emphasized in Chan et al. [14] as an important component in constructing globally efficient estimators based on moment balancing. In our case, the target is the three-way distributional balance, i.e., balance in (i) and (ii).

We propose an extension of EBWs, the improved EBWs (iEBWs), to help improve covariate balance between the treatment and control groups. These improved weights are defined as follows:

(16) w n e i ∈ argmin w { ℰ ( F n , 1 , w , F n ) + ℰ ( F n , 0 , w , F n ) + ℰ ( F n , 0 , w , F n , 1 , w ) } s.t. ∑ i = 1 n w i A i = n 1 , ∑ i = 1 n w i ( 1 − A i ) = n 0 , w i ≥ 0 for i = 1 , … , n ,

where

ℰ ( F n , 1 , w , F n , 0 , w ) ≡ 2 n 1 n 0 ∑ i = 1 n ∑ j = 1 n w i w j A i ( 1 − A j ) ‖ X i − X j ‖ 2 − 1 n 1 2 ∑ i = 1 n ∑ j = 1 n w i w j A i A j ‖ X i − X j ‖ 2 − 1 n 0 2 ∑ i = 1 n ∑ j = 1 n w i w j ( 1 − A i ) ( 1 − A j ) ‖ X i − X j ‖ 2

is the energy distance between the weighted ECDFs for treated and control. Thus, the iEBWs w n e i aim to minimize imbalance not only between treatment arms and the full population but also between the treatment arm and the control arm. Note that ℰ ( F n , 1 , w , F n , 0 , w ) still retains the properties of a weighted energy distance, in the sense that ℰ ( F n , 0 , w , F n , 1 , w ) = ∫ R p ∣ φ n , 1 , w ( t ) − φ n , 0 , w ( t ) ∣ 2 ω ( t ) d t , as Proposition 2.1 can be trivially extended to the case where both arguments of the energy distance are weighted.

4 Extensions and applications of EBWs

4.1 Estimation of the ATT

A common target of estimation is the (population) ATT, τ ( 1 ) ≡ E ( Y ( 1 ) − Y ( 0 ) ∣ A = 1 ) , which is the mean difference in potential outcomes among those who are actually treated. Due to the unconfoundedness assumption, we can write

(17) τ ( 1 ) = ∫ x ∈ X [ μ 1 ( x ) − μ 0 ( x ) ] d F 1 ( x ) ,

which suggests that a plug-in estimator can be obtained by replacing F 1 ( x ) with a suitable energy-weighted ECDF. To do so, we define the new weights w n e 1 = ( w 1 e 1 , … , w n e 1 ) , where w i e 1 = 1 for { i : A i = 1 } and

(18) { w i e 1 } i : A i = 0 ∈ argmin w ℰ ( F n , 0 , w , F n , 1 ) s.t. ∑ i = 1 n w i ( 1 − A i ) = n 0 , w i ≥ 0 for i = 1 , … , n ,

where F n , 1 ( x ) = 1 n 1 ∑ i = 1 n A i I ( X i ≤ x ) . Thus, w n e 1 balances the covariate distribution for the control group to the covariate distribution of the treated group. Similar to Theorem 3.1, we have lim n → ∞ F n , 0 , w n e 1 ( x ) = F 1 ( x ) a.s. for every continuity point x ∈ X . With the new weights w n e 1 , the natural plug-in estimate for the ATT τ ( 1 ) is τ ˆ w n e 1 (i.e., the weighted estimator in (2) with w = w n e 1 ).

4.2 Multi-category treatments

EBWs can also be constructed for multi-category treatments. When the treatment A i takes multiple values, i.e. A i ∈ A = { a 1 , … , a K } . Denote n a = ∑ i = 1 n I ( A i = a ) , for a ∈ A . Each unit has K potential outcomes ( Y i ( a 1 ) , … , Y i ( a K ) ) , one for each level. The unconfoundedness assumption in this setting becomes { Y i ( a 1 ) , … , Y i ( a K ) } ⊥ ⊥ A ∣ X . Following Lopez and Gutman [49], the positivity assumption required is now 0 < π ( a , x ) = P ( A = a ∣ X = x ) < 1 for all a ∈ A and all possible x ∈ X . The standard IPW estimator for E [ Y ( a ) ] involves inverse weighting each sample i by π ( A i , X i ) . Instead, we propose to estimate E [ Y ( a ) ] or any causal contrast τ ( a − a ′ ) ≡ E [ Y ( a ) − Y ( a ′ ) ] for a , a ′ ∈ A with EBWs.

We define the EBWs for the multiple treatment case as follows:

w n e m ∈ argmin w = ( w 1 , … , w n ) ∑ a ∈ A ℰ ( F n , a , w , F n ) s.t. ∑ i = 1 n w i I ( A i = a ) = n a for all a ∈ A and w i ≥ 0 for i = 1 , … , n .

Improved EBWs, which encourage covariate balance between all pairs of treatment options, can be defined similarly as (16), where an additional weighted energy distance between each pair of treatment options is added to the objective. Given any two treatment options a , a ′ ∈ A , we can then estimate the causal contrast τ ( a − a ′ ) with

τ ^ w n e m ( a − a ′ ) = 1 n a ∑ i = 1 n w i Y i I ( A i = a ) − 1 n a ′ ∑ i = 1 n w i Y i I ( A i = a ′ ) .

4.3 Estimation of ITRs

As many treatments exhibit heterogeneous effects for different patients, there is great interest in tailoring treatment decisions to patients. A main line of work in this area is the development of statistical methods aimed at finding an optimal ITR, which maps patient characteristics to treatment decisions. Thus, the immediate goal is to estimate a mapping d : X ↦ { 0 , 1 } which optimizes the expected potential outcomes under the distribution induced by d . Following Qian and Murphy [23], and Zhao et al. [24] and assuming that larger values of the outcome are preferred, the optimal ITR is defined as

(19) d * ∈ argmax d E [ Y ( d ) ] = argmax d E Y I ( A = d ( X ) ) π ( A , X ) = argmin d E Y I ( A ≠ d ( X ) ) π ( A , X ) ,

where π ( a , X ) = P ( A = a ∣ X ) and the second equality holds due to the causal assumptions outlined in Section 2.1. The optimal ITR d * has the property that d * ( x ) = a ⇒ μ a ( x ) > μ 1 − a ( x ) . The last term in (19) appears as a weighted classification task due to the weighted 0–1 loss. With observed data, the objective becomes to minimize

(20) 1 n ∑ i = 1 n Y i π ( A i , X i ) I ( A i ≠ d ( X i ) ) .

Due to the nonconvexity of (20), in practice I ( A i ≠ d ( X i ) ) is replaced with a surrogate, such as the hinge function ( 1 − ( 2 A i − 1 ) d ( X i ) ) + (see, e.g., the outcome weighted learning (OWL) method of Zhao et al. [24]). Yet, this objective still requires estimation of the propensity score π ( X ) and involves plugging in this estimate to (20), which subjects it to the same issues of propensity score weighted estimates of τ . To see how EBWs can be used in place of π ( A , X ) , we can express (19) as argmin d ∫ x ∈ X ∑ a ∈ { 0 , 1 } μ a ( x ) I ( d ( x ) ≠ a ) d F ( x ) . Thus, d * ( x ) is a functional of F ( x ) . This suggests a plug-in estimator by replacing F ( x ) with energy-weighted ECDFs (either F n , a , w n e ( x ) or F n , a , w n e i ( x ) ). Thus, we propose to estimate the optimal ITR as follows:

(21) d ˆ * ∈ argmin d 1 n ∑ i = 1 n Y i w i e ϕ ( A i , d ( X i ) ) ,

where w i e could be replaced with improved weights w i e i and ϕ ( a , d ) is a convex surrogate for I ( a ≠ d ) , such as the hinge function or logistic loss. As in Zhao et al. [24], to prevent overfitting, a penalty λ n ‖ d ‖ 2 can be added to (21) to control complexity of the estimated ITR.

To demonstrate the effectiveness of using EBWs in optimal ITR estimation, we provide an illustrative example under two data-generating scenarios. For both scenarios, we generate outcomes as Y = g ( X ) + A ˜ Δ ( X ) ∕ 2 + ε , where g ( X ) are the main effects of X , A ˜ = 2 A − 1 , and Δ ( X ) = μ 1 ( X ) − μ 0 ( X ) is the treatment-covariate interaction, ε ∼ N ( 0 , 1 ) , and R 10 ∋ X ∼ i . i . d . Unif ( − 1 , 1 ) . Both scenarios are motivated by the simulation studies of Zhao et al. [24], but generate A from a logistic regression model with terms depending on up to third-order polynomials in a subset of the predictors, and g ( X ) contains nonlinear terms in the predictors. Full details of the experiment are given in the Supplementary Material. We utilize the OWL method to obtain estimates d ˆ , which uses inverse weighting via the propensity score and adds a penalty λ n ‖ d ‖ 2 to the objective. For OWL, the propensity score is misspecified to only include linear terms in the covariates. We also estimate d * by minimizing (21) plus λ n ‖ d ‖ 2 . We denote this as OWL (EBW) for weights given by (10) and OWL (iEBW) for weights given by (16). We simulate 1,000 independent datasets and compute the average value function E ^ [ Y ( d ˆ ) ] evaluated on a large independent dataset in addition to the missclassification rate in estimating d * ( X ) on the independent dataset. The results are given in Table 1, which indicates that EBWs can yield better performing ITRs.

Table 1

Displayed are the value functions and misclassification rates for the optimal ITR estimation example averaged over 1,000 independent simulated datasets

	Scenario 1			Scenario 2
Method	Value (SD)		Misclass	Value (SD)		Misclass
OWL (EBW)	3.168	(0.253)	0.283	2.921	(0.166)	0.228
OWL (iEBW)	3.198	(0.204)	0.277	2.931	(0.165)	0.223
OWL (PS)	2.671	(0.327)	0.344	2.706	(0.196)	0.295

In Scenario 1, the optimal value is 4.74 with 56% of units with optimal assignments of A = 1 ; in Scenario 2, the optimal value is 3.19 with 50% of units with optimal assignments of A = 1 . The bold values indicate the best performance across all methods for a given setting.

5 Simulation studies

To evaluate the finite sample performance and operating characteristics of our proposed estimators, we conducted a large-scale simulation study across a wide variety of data-generating scenarios. Since existing techniques, such as empirical calibration balancing and the covariate balancing propensity score, work exceedingly well when the correct moments are specified to be balanced, we primarily consider simulation settings where the relationships between covariates and both the treatment status and outcome regression model are nonlinear. We consider a wide range of scenarios for the propensity score and outcome regression models, several of which are examples taken from the existing works. Outcome models B and E are taken from [18], outcome model D is taken from Wong and Chan [50], and outcome model A is a slight modification of an outcome model from Kang and Schafer [11]. Outcome model C is designed to be linear in X for the untreated group, but nonlinear and with interaction terms for the treated group. Propensity model III is from Kang and Schafer [11] but with the main effects having twice the dimension as by the Kang and Schafer [11] example. The remainder of the propensity models have interactions, smooth nonlinear terms, or nonsmooth nonlinear terms. We generate a p -dimensional covariate vector Z = ( Z i , … , Z p ) from a p -dimensional mean zero multivariate Gaussian distribution with Cov ( Z i , Z j ) = − 0.7 5 ∣ i − j ∣ , which results in positive and negative correlations between predictors. In covariate setup 1, we let the observed covariates be X = Z ; however, similar to the widely-tested setup by Kang and Schafer [11], in covariate setup 2, we define X = ( X 1 , … , X p ) , where X 1 = exp ( Z 1 ∕ 2 ) , X 2 = Z 2 ∕ ( 1 + exp ( Z 1 ) ) + 10 , X 3 = ( Z 1 Z 3 ∕ 25 + 0.6 ) 3 , X 4 = 20 + ( Z 2 + Z 4 ) 2 , X 5 = exp ( Z 5 ∕ 2 ) , X 6 = Z 6 ∕ ( 1 + exp ( Z 5 ) ) + 10 , X 7 = ( Z 1 Z 7 ∕ 25 + 0.6 ) 3 , and X 8 = 5 + ( Z 6 + Z 8 ) 2 . To align with the setup of Kang and Schafer [11], we utilize covariate setup 2 for any setting with propensity model III. To align with the setup of Wong and Chan [18], we utilize covariate setup 2 for any setting with outcome model B or E. All other settings use covariate setup 1.

We compare our proposed EBW and the iEBWs (16), both with no penalty on the squares of the weights, with several widely used alternatives. First, we compare with inverse propensity score weights (denoted “IPW”). In addition, we compare with the covariate balancing propensity score weights (denoted “CBPS”), the empirical calibration balancing weights (denoted “Cal”) with exponential tilting weights. For all methods that require specification of a model for the treatment assignment or moments to balance, only first-order terms in X were used, as in Wong and Chan [18]. This reflects a setting where the analyst misspecifies that only first moments should be balanced. We also utilize the kernel-based functional covariate balancing approach of Wong and Chan [18], denoted as “KCB” for kernel covariate balancing with the second-order Sobolev kernel. As a baseline for comparison, we investigate a naïve unweighted estimator that simply compares the means between treated groups and denote this as “Unweighted.” For all estimators, we use weights normalized by treatment group. Thus, the IPW approach is the Hájek estimator as opposed to the more unstable, nonnormalized Horvitz-Thompson estimator. For each setting, we generate 1,000 independent datasets and evaluate each method based on the square root of the mean-square error (RMSE) and bias in estimating the ATE. Simulations for IPW, Cal, and CBPS are conducted using the WeightIt R package [51] and that of KCB using the ATE.ncb package.

For the sake of brevity of presentation, we present a summary of the results across all outcome models. More detailed results are presented in the Supplementary Material. Table 2 contains a summary of the results averaged across outcome models (A–E) and dimension settings ( p ∈ { 10 , 25 } ). Each entry in the table is the average rank of each method in terms of RMSE and bias for each combination of outcome model and dimension; i.e. the method with the smallest RMSE for a particular setting receives a “1” and the method with the largest RMSE receives a “7.” The Supplementary Material contains a similar summary, but averaged over propensity models and dimensions. In general, iEBW tends to yield the best rank in terms of performance across the settings, with EBW and KCB yielding similarly low ranks.

Table 2

Displayed are the ranks among all methods tested of each method in terms of RMSE and bias averaged over all response models (I–VI) for n = 250 and over the dimension settings p = 10 and p = 25 . The bold values indicate the best performance across all methods for a given setting

Propensity model:	I		II		III		IV		V		VI
	Mean rank		Mean rank		Mean rank		Mean rank		Mean rank		Mean rank
Method	RMSE	Bias	RMSE	Bias	RMSE	Bias	RMSE	Bias	RMSE	Bias	RMSE	Bias
Unweighted	4.7	3.7	6.5	4.9	5.0	4.5	4.8	4.1	6.2	5.3	3.9	3.1
EBW	2.5	2.8	2.4	2.7	3.2	4.2	2.5	3.6	1.9	2.6	2.5	3.8
iEBW	2.2	2.5	1.5	1.7	3.0	3.9	1.7	3.5	1.9	2.0	2.0	3.1
KCB	3.0	3.4	2.5	2.5	3.2	4.3	2.7	2.5	3.0	3.6	2.5	2.3
IPW	5.5	5.0	6.4	5.9	5.0	2.9	6.3	5.4	5.8	5.1	6.8	6.0
CBPS	5.1	4.8	5.1	5.3	4.7	3.9	5.1	3.7	5.6	5.3	4.9	4.2
Cal	5.0	5.8	3.6	5.0	3.9	4.3	4.9	5.2	3.6	4.1	5.4	5.5

6 RHC data

6.1 Description of data

A study by Connors et al. [2] was conducted to investigate the effectiveness of RHC, a diagnostic procedure for critically ill patients in ICUs. Since RHC is more relevant for certain forms of intensive care than others, there is substantial imbalance in patient characteristics in those treated with RHC and those who did not receive RHC. The original analysis was based on propensity score matching, and the data have been subsequently re-analyzed in many other works [7,20,52,53]. The study consists of data on 5,735 individuals, 2,184 of whom received RHC, and the remaining 3,551 did not receive RHC. The outcome is an indicator of survival at 30 days after admission. A panel of experts convened to discuss which variables contribute to a decision to use RHC, resulting in a large set of covariates to study (72 in total, 21 of which are continuous, 25 binary, and 26 dummy variables originating from 6 categorical covariates). The dataset is publicly available at: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/rhc.html. There are substantial empirical differences in the distributions of many of these covariates between treatment groups (RHC vs. no RHC). In Section 6.2, we study the effect of RHC on 30-day survival. However, since there is no ground truth available, in Section 6.3, we use the RHC data to conduct a realistic simulation that demonstrates the effectiveness of EBWs.

6.2 Analysis of RHC data

We used 65 of the available covariates, as in the analysis of the same dataset in Rosenbaum [53], leaving out date-related covariates. Using the 65 covariates, we applied the weighting methods used in Section 5 (except the method of Wong and Chan [18] as the code returned constant weights of 1 regardless of the tuning parameters used) to estimate weights to balance the treated groups. To first investigate how well each method balances the marginal means of each covariate, we evaluate the absolute standardized mean differences for each covariate and p -values for the difference in weighted means between treated and control groups, which are displayed in Figure 3. Empirical calibration balancing is designed to balance all specified moments exactly, and since we balance the first moments, these have exact balance. Both EBWs still result in tight moment balance despite the fact that this was not an explicit goal. In Figure 4, we investigate how well each method balances the distributions of all 1- and two-dimensional projections of the covariates. To do this, we compare the weighted ECDFs for each projection by evaluating the square root of the integrated mean-square error (RIMSE) of the weighted ECDFs between the two treatment groups. Both EBWs yield the smallest RIMSEs on average. However, it is important to note that lower-order projections of the distribution are not the explicit focus of EBWs and thus, the EBWs also likely balance other aspects of the distributions of covariates between the treatment groups. We display further measures of discrepancy of the covariates between the treatment groups in Table 3. In addition to displaying the average RIMSE values and absolute SMDs, in this table, we display the weighted energy values for each of the weights, including the unweighted energy distances. By definition, EBWs have the minimum weighted energy distances; however, it can be seen that the empirical calibration balancing weights have a relatively small weighted energy distance, indicating its use is likely sensible for the RHC data. The estimates of the ATE of RHC and its standard errors are displayed in Table 3. Standard errors are computed using the nonparametric bootstrap with 1,000 replications. The EBW-based estimates have the smallest standard errors, and, perhaps more interestingly, yield an estimate of the effect of RHC that is slightly less deleterious than estimates in the literature, despite it still being a highly significant effect.

Figure 3

Displayed are p -values for tests of the difference in weighted means between treatment groups marginally for each covariate (left) and for each pairwise interaction of covariates (right). For continuous covariates, weighted t -tests are used and weighted Chi-squared tests are used for discrete covariates. For an RCT, the sorted p -values are expected to roughly fall along the diagonal – lines above the diagonal indicate an improvement in moment balance over random assignment.

$Figure 4 This figure illustrates the differences in the marginal univariate weighted ECDFs between the treated and control populations. In particular, let x j ∈ X j {x}_{j}\in {{\mathcal{X}}}_{j} denote the j j component of the covariate vector x {\bf{x}} and let F n , a , j ( x ) {F}_{n,a,j}\left(x) denote its empirical CDF on treatment arm a a . Similarly denote the weighted versions of this quantity. Here, we are displaying how well each method balances the marginal empirical CDFs for the treated and control arms. We do so by evaluating an estimate of ∫ x j ∈ X j [ F n , 1 , j , w ‒ F n , 0 , j , w ] ( x j ) d x j 1 ∕ 2 {\left\{{\int }_{{x}_{j}\in {{\mathcal{X}}}_{j}}{[}{F}_{n,1,j,{\boldsymbol{w}}}‒{F}_{n,0,j,{\boldsymbol{w}}}]\left({x}_{j}){\rm{d}}{x}_{j}\right\}}^{1/2} obtained by integration over a grid of values for all j = 1 , … , 65 j=1,\ldots ,65 . The results across all covariates are displayed in the left two plots above. The rightmost plots similarly display the RIMSEs for all possible 65 choose 2 bivariate CDFs.$

Figure 4

This figure illustrates the differences in the marginal univariate weighted ECDFs between the treated and control populations. In particular, let x j ∈ X j denote the j component of the covariate vector x and let F n , a , j ( x ) denote its empirical CDF on treatment arm a . Similarly denote the weighted versions of this quantity. Here, we are displaying how well each method balances the marginal empirical CDFs for the treated and control arms. We do so by evaluating an estimate of ∫ x j ∈ X j [ F n , 1 , j , w ‒ F n , 0 , j , w ] ( x j ) d x j 1 ∕ 2 obtained by integration over a grid of values for all j = 1 , … , 65 . The results across all covariates are displayed in the left two plots above. The rightmost plots similarly display the RIMSEs for all possible 65 choose 2 bivariate CDFs.

Table 3

Estimates of the ATE and standard errors for the RHC data. Standard errors were computed for all methods using the nonparametric bootstrap with 1,000 replications. Also displayed are various measures of discrepancy between the distributions of covariates for the RHC and non-RHC groups. In addition to weighted energy distances, we display mean and max “ ∣ SMD ∣ ,” which are the average and maximum, respectively, absolute standardized marginal mean difference of covariates

	Unwtd	CBPS	IPW	Cal	EBW	iEBW
τ ^ w	0.0736	0.0576	0.0528	0.0547	0.0499	0.0470
SE ( τ ^ w )	0.0132	0.0142	0.0155	0.0145	0.0120	0.0117
Energy dist (10)	8.1996	0.8572	0.7804	0.4250	0.3236	0.3377
Energy dist (16)	23.7172	1.8560	1.5756	1.0045	0.7536	0.7270
Mean ∣ SMD ∣	0.1648	0.0274	0.0182	0.0000	0.0063	0.0043
Max ∣ SMD ∣	0.5826	0.1139	0.0778	0.0000	0.0239	0.0168
SD ∣ SMD ∣	0.1276	0.0206	0.0147	0.0000	0.0050	0.0036

6.3 RHC data simulation

In this section, we fix the covariates and treatment assignments from the RHC data, and simulate responses with confounding. This produces a realistic and highly challenging simulation scenario with high dimension. We use the key functional form from outcome model D from Section 5. Since this outcome model involves only 7 covariates, we use the functional form from this outcome model, apply it over 8 separate groups of 7 covariates in the RHC data, and take the sum of all groups of 7 covariates as the mean of the outcome. We then simulate 1,000 independent datasets using this procedure and each time estimate the ATE and record each method’s RMSE in estimating the true ATE. Since under this model the ordering of the covariates changes the nature of the confounding, we randomly permute the columns 100 times and repeat the simulation 1,000 times for each permutation and each time record the RMSE over the 1,000 replications. Further details are provided in the Supplementary Material. The RMSEs for each of the 100 column permutations are displayed in Figure 5. Both EBW and iEBW consistently yield among the smallest RMSE across the permutation settings. Since the outcome model in this simulation involves a constant treatment effect, and thus is not impacted by potential issues of covariate overlap, and in the Supplementary Material, we additionally consider an outcome model with a treatment effect that depends on X ; the pattern of the results is consistent with the findings using a constant treatment effect, albeit the advantage of EBWs is slightly more pronounced with the heterogeneous treatment effect scenario.

Figure 5

RMSEs for each method across 100 outcome models using the RHC data.

7 Analysis of the MIMIC-III critical care database

We analyze the effectiveness of three treatments based on separate subpopulations of the MIMIC-III v1.4 critical care database [3]: a study of the effect of indwelling arterial catheters (IACs) on mortality in patients with respiratory failure [54], a study of the effect of transthoracic echocardiography (echo) on mortality in sepsis patients [55], and a study of mechanical power of ventilation (MPV) on mortality in critically ill patients [56]. Each study is based on the existing studies utilizing MIMIC-III. The degree of confounding in each study varies, with the IAC and MPV studies exhibiting a great degree of confounding and the echo study with minimal confounding. For all studies, missing values were imputed using missForest [57]. We present the IAC study in this section and the remaining two studies in the Supplementary Material. For each study, we present treatment effect estimates and balance statistics using each method used in the main text. For all studies, we also use the covariates and treatment assignments of the observed data to conduct simulation studies, using the same approach used for the simulation based on the RHC data. In essence, with these simulation studies we preserve the treatment assignment mechanism of the observed data and simulate outcomes that involve a high likelihood of confounding under this real-world treatment assignment mechanism. The simulation studies investigate scenarios with a constant treatment effect over X and with a treatment effect that varies with X but results in a (population and sample) ATE of 0. The outcome models used are the same as described in the Supplementary Material. Each dataset has a differing number of confounders; however, the outcome model across all datasets involves 63 covariates impacting the response for any given scenario. As in the setup for the RHC data simulation, the columns are permuted 100 times, resulting in 100 separate outcome models with different covariates impacting the response in different ways.

7.1 IAC data

In this section, we replicate a study originally conducted Hsu et al. [54] based on the MIMIC-III critical care database to study the effect of indwelling arterial catherization on 28 day mortality. The data are based on the queries provided in https://github.com/MIT-LCP/mimic-codeand contains information on 2,522 mechanically-ventilated patients, 1,298 of whom received the treatment, IAC. The outcome is an indicator of 28 day mortality from time of admission. Pre-treatment covariates likely to be confounders include demographics, lab values, calculated risk scores, missingness indicators, and more, totaling to a design matrix with p = 81 . We leveraged the same approach used in the RHC study to control for the confounders via various weighting approaches to estimate the ATE of IAC on 28-day mortality. The KCB approach yielded constant weights of 1 regardless of the tuning parameter. Table 4 presents information on the estimated effect of IAC on mortality based on each weighting method in addition to various balance statistics for the 81 covariates, including marginal mean balance, univariate and bivariate distributional balance, and weighted energy statistics. EBW and iEBW yield the best univariate and bivariate distributional balance, while Cal and CBPS result in the best marginal mean balance. Note that Cal does not achieve exact marginal mean balance due to numerical issues. All approaches except inverse weighting by propensity score (IPW) yield 95% confidence intervals that contain zero, albeit with point estimates that suggest a potential benefit of IAC. IPW results in a significant estimated benefit of IAC on 28 day mortality, with a point estimate of the benefit that is substantially larger than for other methods.

Table 4

Estimates of the ATE and standard errors for the IAC data. Standard errors were computed for all methods using the nonparametric bootstrap with 1,000 replications. Also displayed are various measures of discrepancy between the distributions of covariates for the IAC and control groups. We also display the mean and max RIMSE statistic for marginal univariate and bivariate CDF differences, as in Figure 4. In addition, we display summary statistics of SMDs for marginal means and SMDs for all polynomials up to order 5 and pairwise interactions (denoted SMD(2)). The bold values indicate the best performance across all methods for a given setting

	Unweighted	CBPS	IPW	Cal	EBW	iEBW
τ ^ w	0.0012	‒ 0.0343	‒ 0.0744	‒ 0.0150	‒ 0.0150	‒ 0.0130
SE ( τ ^ w )	0.015	0.0217	0.0446	0.0165	0.0118	0.0114
Energy dist (10)	4.5625	0.7829	31.336	0.5612	0.3869	0.4028
Energy dist (16)	13.6796	1.6922	62.6704	1.3854	0.9614	0.9304
Mean RIMSE, 1d	0.0362	0.0130	0.0670	0.0126	0.0113	0.0111
Max RIMSE, 1d	0.0863	0.0277	0.1646	0.0256	0.0235	0.0245
Mean RIMSE, 2d	0.0416	0.0075	0.0619	0.0071	0.0067	0.0062
Max RIMSE, 2d	0.3295	0.0283	0.2028	0.0284	0.0248	0.0189
Mean ∣ SMD ∣	0.0732	0.0023	0.0990	0.0001	0.0060	0.0045
Max ∣ SMD ∣	0.2996	0.0203	1.2906	0.0028	0.0289	0.0212
Mean ∣ SMD(2) ∣	0.0788	0.0081	0.0932	0.0075	0.0093	0.0073
Max ∣ SMD ( 2 ) ∣	0.6801	0.1782	1.2906	0.1412	0.1502	0.0998

In addition, we conduct a simulation study based on the IAC data with precisely the same response-generating mechanism as for the simulation on the RHC data in Section 6.3. Although the dimension of the IAC data is higher, the number of dimensions that impact the response for each data-generating setting is 63, the same as in Section 6.3. The results in terms of RMSE in estimating the ATE across the 100 data-generating scenarios for both the constant and heterogeneous treatment effect settings, each averaged over 1,000 replications, are displayed in Table 5. The iEBW approach results in the smallest mean, median, and worst-case RMSE, followed by EBW, which is closely followed by Cal and CBPS. IPW in this case results in nearly worse performance than no weighting. We also conducted the same simulation study but with a treatment effect that varies in X .

Table 5

Displayed are the median, mean, standard deviation, and maximum RMSEs for each method across the 100 simulation settings using the IAC data. The bold values indicate the best performance across all methods for a given setting

	Unweighted	CBPS	IPW	Cal	EBW	iEBW
	Constant treatment effect
Median RMSE	8.0151	3.2899	7.9435	3.4895	3.0276	1.8319
Mean RMSE	9.1296	3.8542	9.5545	3.7609	3.5113	2.2087
SD RMSE	6.7091	2.5588	7.5263	2.5293	2.5028	1.5785
Max RMSE	32.3715	11.0479	39.1095	12.0521	12.4231	7.2066
	Heterogeneous treatment effect
Median RMSE	11.9037	3.8587	15.1933	3.7128	3.0732	1.7355
Mean RMSE	13.5594	4.4330	18.4590	4.1689	3.9079	1.8688
SD RMSE	9.9665	3.1225	14.6490	2.8963	2.8718	1.3415
Max RMSE	48.0820	12.3672	75.3899	12.8563	13.8571	5.6715

8 Discussion

We have introduced a new metric, the weighted energy distance, which measures the distributional balance induced by a set of weights and thus can be used to determine which set of weights is likely to result in low bias when estimating a causal quantity. Building on the weighted energy distance, we have introduced the EBWs that minimize this distance to achieve distributional balance. The energy balancing weights are robust and reliable across many functional forms of confounding and further rarely result in large weights. Due to the distributional balancing of the energy balancing weights, they can be utilized to estimate a wide variety of causal estimands which can be represented as a statistical functional of the population distribution function of the covariates. While we focused entirely on the weighted energy distance, the connection between the energy distance and distances between embeddings of probability measures into reproducing kernel Hilbert spaces [39] opens up the possibility of more effective distributional balancing weights if more is known about the functional form of confounding. In particular, if the analyst believes lower order projections of the distribution should be balanced with priority over higher order aspects of the distribution, the use of a kernel which emphasizes these projections, such as the sparsity-inducing kernel in Mak and Joseph [58], could be used.

Acknowledgements

The authors would like to thank the anonymous referees for their helpful and constructive feedback. Dr. Mak was funded by NSF CSSI Frameworks 2004571, NSF DMS 2316012, and NSF DMS 2316012.

Conflict of interest: Authors state no conflict of interest.

References

[1] Dasgupta T, Pillai NS, Rubin DB. Causal inference from 2K factorial designs by using potential outcomes. J R Stat Soc Ser B (Stat Methodol). 2015;77(4):727–53. 10.1111/rssb.12085Suche in Google Scholar

[2] Connors AF, Speroff T, Dawson NV, Thomas C, Harrell FE, Wagner D, et al. The effectiveness of right heart catheterization in the initial care of critically Ill patients. J Amer Med Assoc. 1996;276(11):889–97. 10.1001/jama.276.11.889Suche in Google Scholar PubMed

[3] Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Scientific Data. 2016;3:160035. 10.1038/sdata.2016.35Suche in Google Scholar PubMed PubMed Central

[4] Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Amer Stat Assoc. 1995;90(429):122–9. 10.1080/01621459.1995.10476494Suche in Google Scholar

[5] Hahn J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica. 1998;66:315–31. 10.2307/2998560Suche in Google Scholar

[6] Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–60. 10.1097/00001648-200009000-00011Suche in Google Scholar PubMed

[7] Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Services Outcomes Res Meth. 2001;2(3–4):259–78. Suche in Google Scholar

[8] Hirano K, Imbens GW, Ridder G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica. 2003;71(4):1161–89. 10.1111/1468-0262.00442Suche in Google Scholar

[9] Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: A review. Rev Econom Stat. 2004;86(1):4–29. 10.1162/003465304323023651Suche in Google Scholar

[10] Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. 10.1093/biomet/70.1.41Suche in Google Scholar

[11] Kang JD, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci. 2007;22(4):523–39. 10.1214/07-STS227Suche in Google Scholar PubMed PubMed Central

[12] Imai K, Ratkovic M. Covariate balancing propensity score. J R Stat Soc Ser B (Stat Methodol). 2014;76(1):243–63. 10.1111/rssb.12027Suche in Google Scholar

[13] Hainmueller J. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis. 2012;20(1):25–46. 10.1093/pan/mpr025Suche in Google Scholar

[14] Chan KCG, Yam SCP, Zhang Z. Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. J R Stat Soc Ser B (Stat Methodol). 2016;78(3):673–700. 10.1111/rssb.12129Suche in Google Scholar PubMed PubMed Central

[15] Zubizarreta JR. Stable weights that balance covariates for estimation with incomplete outcome data. J Amer Stat Assoc. 2015;110(511):910–22. 10.1080/01621459.2015.1023805Suche in Google Scholar

[16] Zhao Q. Covariate balancing propensity score by tailored loss functions. Ann Stat. 2019 Apr;47(2):965–93. 10.1214/18-AOS1698Suche in Google Scholar

[17] Deming WE, Stephan FF. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann Math Stat. 1940;11(4):427–44. 10.1214/aoms/1177731829Suche in Google Scholar

[18] Wong RK, Chan KCG. Kernel-based covariate functional balancing for observational studies. Biometrika. 2017;105(1):199–213. 10.1093/biomet/asx069Suche in Google Scholar PubMed PubMed Central

[19] Kallus N. Generalized optimal matching methods for causal inference. J Machine Learn Res. 2020;21(1):2300–53. Suche in Google Scholar

[20] Li F, Morgan KL, Zaslavsky AM. Balancing covariates via propensity score weighting. J Amer Stat Assoc. 2018;113(521):390–400. 10.1080/01621459.2016.1260466Suche in Google Scholar

[21] Hirshberg DA, Maleki A, Zubizarreta JR. Minimax linear estimation of the retargeted mean. 2019. arXiv: http://arXiv.org/abs/arXiv:190110296. Suche in Google Scholar

[22] Székely GJ, Rizzo ML. Testing for equal distributions in high dimension. InterStat. 2004;5(1–6):1249–72. Suche in Google Scholar

[23] Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Ann Stat. 2011;39(2):1180. 10.1214/10-AOS864Suche in Google Scholar PubMed PubMed Central

[24] Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Stat Assoc. 2012;107(499):1106–18. 10.1080/01621459.2012.695674Suche in Google Scholar PubMed PubMed Central

[25] Athey S, Imbens GW, Wager S. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. J R Stat Soc Ser B (Stat Methodol). 2018;80(4):597–623. 10.1111/rssb.12268Suche in Google Scholar

[26] Neyman J. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. translated in Statistical Science. vol. 1923; 1990. p. 465–472. Suche in Google Scholar

[27] Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66(5):688. 10.1037/h0037350Suche in Google Scholar

[28] Rubin DB. Bayesian inference for causal effects: The role of randomization. Ann Stat. 1978;6(1):34–58. 10.1214/aos/1176344064Suche in Google Scholar

[29] Hernan MA, Robins JM. Causal inference. Boca Raton, FL: CRC; 2019. Suche in Google Scholar

[30] Dawid AP. Some misleading arguments involving conditional independence. J R Stat Soc Ser B (Methodological). 1979;41(2):249–52. 10.1111/j.2517-6161.1979.tb01079.xSuche in Google Scholar

[31] Hájek J. Comment on a paper by D. Basu. Foundations of Statistical Inference; 1971. p. 236. Suche in Google Scholar

[32] Ding P, Li F. Causal Inference: A Missing Data Perspective. Stat Sci. 2018 May;33(2):214–37. 10.1214/18-STS645Suche in Google Scholar

[33] Chattopadhyay A, Hase CH, Zubizarreta JR. Balancing vs modeling approaches to weighting in practice. Stat Med. 2020;39(24):3227–54. 10.1002/sim.8659Suche in Google Scholar PubMed

[34] Székely GJ, Rizzo ML. Energy statistics: A class of statistics based on distances. J Stat Plan Inference. 2013;143(8):1249–72. 10.1016/j.jspi.2013.03.018Suche in Google Scholar

[35] Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Ann Stat. 2007;35(6):2769–94. 10.1214/009053607000000505Suche in Google Scholar

[36] Mak S, Joseph VR. Support points. Ann Stat. 2018;46(6A):2562–92. 10.1214/17-AOS1629Suche in Google Scholar

[37] Genevay A. Entropy-regularized optimal transport for machine learning. Université Paris Dauphine and Ecole Normale Supérieure; 2019. Suche in Google Scholar

[38] Weed J, Bach F. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli. 2019;25(4A):2620–48. 10.3150/18-BEJ1065Suche in Google Scholar

[39] Sejdinovic D, Sriperumbudur B, Gretton A, Fukumizu K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann Stat. 2013 Oct;41(5):2263–91. 10.1214/13-AOS1140Suche in Google Scholar

[40] Wendland H. Scattered data approximation. vol. 17. United Kingdom: Cambridge University Press; 2004. 10.1017/CBO9780511617539Suche in Google Scholar

[41] Nocedal J, Wright SJ. Numerical optimization. New York, NY: Springer; 1999. 10.1007/b98874Suche in Google Scholar

[42] Andersen M, Dahl J, Liu Z, Vandenberghe L, Sra S, Nowozin S, et al. Interior-point methods for large-scale cone programming. Optim Machine Learn. 2011;5583. 10.7551/mitpress/8996.003.0005Suche in Google Scholar

[43] Delbos F, Gilbert JC. Global linear convergence of an augmented Lagrangian algorithm for solving convex quadratic optimization problems. J Convex Anal. 2003;12:45–69. Suche in Google Scholar

[44] Murty KG, Yu FT. Linear complementarity, linear and nonlinear programming. vol. 3. Berlin, Germany: Citeseer; 1988. Suche in Google Scholar

[45] Stellato B, Banjac G, Goulart P, Bemporad A, Boyd S. OSQP: An operator splitting solver for quadratic programs. Math Program Comput. 2020;12:637–72. 10.1007/s12532-020-00179-2Suche in Google Scholar

[46] Pfaff B. cccp: Cone Constrained Convex Problems; 2015. R package version 0.2-4. 10.32614/CRAN.package.cccpSuche in Google Scholar

[47] Wright SJ. Primal-dual interior-point methods. United States of America: SIAM; 1997. 10.1137/1.9781611971453Suche in Google Scholar

[48] Gondzio J. Interior point methods 25 years later. Europ J Operat Res. 2012;218(3):587–601. 10.1016/j.ejor.2011.09.017Suche in Google Scholar

[49] Lopez MJ, Gutman R. Estimation of causal effects with multiple treatments: a review and new ideas. Stat Sci. 2017;32(3):432–54. 10.1214/17-STS612Suche in Google Scholar

[50] Cannas M, Arpino B. A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting. Biometric J. 2019;61:1049–72. 10.1002/bimj.201800132Suche in Google Scholar PubMed

[51] Greifer N. WeightIt: Weighting for Covariate Balance in Observational Studies; 2020. R package version 0.9.0. Suche in Google Scholar

[52] Crump RK, Hotz VJ, Imbens GW, Mitnik OA. Dealing with limited overlap in estimation of average treatment effects. Biometrika. 2009;96(1):187–99. 10.1093/biomet/asn055Suche in Google Scholar

[53] Rosenbaum PR. Optimal matching of an optimally chosen subset in observational studies. J Comput Graph Stat. 2012;21(1):57–71. 10.1198/jcgs.2011.09219Suche in Google Scholar

[54] Hsu DJ, Feng M, Kothari R, Zhou H, Chen KP, Celi LA. The association between indwelling arterial catheters and mortality in hemodynamically stable patients with respiratory failure: a propensity score analysis. Chest. 2015;148(6):1470–6. 10.1378/chest.15-0516Suche in Google Scholar PubMed PubMed Central

[55] Feng M, McSparron JI, Kien DT, Stone DJ, Roberts DH, Schwartzstein RM, et al. Transthoracic echocardiography and mortality in sepsis: analysis of the MIMIC-III database. Intensive Care Med. 2018;44(6):884–92. 10.1007/s00134-018-5208-7Suche in Google Scholar PubMed

[56] Neto AS, Deliberato RO, Johnson AE, Bos LD, Amorim P, Pereira SM, et al. Mechanical power of ventilation is associated with mortality in critically ill patients: an analysis of patients in two observational cohorts. Intensive Care Med. 2018;44(11):1914–22. 10.1007/s00134-018-5375-6Suche in Google Scholar PubMed

[57] Stekhoven DJ, Bühlmann P. MissForest non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–8. 10.1093/bioinformatics/btr597Suche in Google Scholar PubMed

[58] Mak S, Joseph VR. Projected support points: a new method for high-dimensional data reduction. 2017, arXiv: http://arXiv.org/abs/arXiv:170806897. Suche in Google Scholar

Received: 2022-04-26

Revised: 2023-06-05

Accepted: 2023-10-03

Published Online: 2024-04-24

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary material

Artikel in diesem Heft

https://doi.org/10.1515/jci-2022-0029

Schlagwörter für diesen Artikel

energy distance; causal inference; covariate balance; observational data; weighting

Creative Commons

BY 4.0

Energy balancing of covariate distributions

Artikel

Abstract

1 Introduction

2 Distributional balance and weighted energy distance

2.1 Setup

2.2 Weighted average estimates and distributional balance

2.3 Weighted energy distance

Proposition 2.1

Theorem 2.2

2.4 Bias and distributional imbalance

2.5 Other measures of distributional imbalance

3 Energy balancing weights

3.1 Definition

3.2 Asymptotic properties

Theorem 3.1

Corollary 3.2

Lemma 3.3

Theorem 3.4

3.3 Optimization

3.4 Controlling weight variability

3.5 Three-way EBWs

4 Extensions and applications of EBWs

4.1 Estimation of the ATT

4.2 Multi-category treatments

4.3 Estimation of ITRs

5 Simulation studies

6 RHC data

6.1 Description of data

6.2 Analysis of RHC data

6.3 RHC data simulation

7 Analysis of the MIMIC-III critical care database

7.1 IAC data

8 Discussion

Acknowledgements

References

Zusatzmaterial

Artikel in diesem Heft

Artikel in diesem Heft

Artikel in diesem Heft