Evaluating Different Covariate Balancing Methods: A Monte Carlo Simulation

Hideki Fukui

doi:10.1515/spp-2022-0019

Article Open Access

Evaluating Different Covariate Balancing Methods: A Monte Carlo Simulation

Hideki Fukui

Published/Copyright: July 31, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Statistics, Politics and Policy Volume 14 Issue 2

Abstract

We investigate the effectiveness of five weighting and matching techniques, including propensity score matching (PSM), in improving covariate balance and reducing bias when estimating treatment effects in finite-sample situations through Monte Carlo simulations. King and Nielsen (2019. “Why Propensity Scores Should Not Be Used for Matching.” Political Analysis 27 (4): 435–54) argue that pruning observations based on PSM with 1-to-1 greedy matching can worsen rather than improve covariate balance and can increase the bias in the estimates of treatment effects. In our simulations, we observed this phenomenon not only in PSM with 1-to-1 greedy matching but also in other covariate balancing techniques that King and Nielsen (2019. “Why Propensity Scores Should Not Be Used for Matching.” Political Analysis 27 (4): 435–54) recommend as better matching methods, i.e. Mahalanobis distance matching (MDM) and coarsened exact matching (CEM). Regardless of the weighting/matching techniques and the data generation processes in this study, our findings indicate that matching and weighting under extreme caliper or cut-point settings does not improve covariate balance. In addition, once a substantially improved covariate balance is achieved in a given sample, the estimated bias tends to worsen slightly as the covariate balance continues to improve. Moreover, our simulation results suggest that OLS with proper covariates reduces selection bias as well as other weighting and matching methods. The results suggest that when analyzing observational data, it is important to avoid looking for a one-size-fits-all estimator and to identify the appropriate nonexperimental estimator carefully for the sample by thoroughly investigating the available data’s characteristics.

Keywords: bias correction; covariate balancing; Monte Carlo simulation; observed/unobserved confounders

1 Introduction

This paper examines the effectiveness of five weighting and matching methods^[1] in improving covariate balance and correcting for bias in treatment effect estimation using Monte Carlo simulations (hereafter referred to as simulations).

Matching is a nonexperimental method that aims to reduce bias in treatment effect estimates by making the effects of factors other than treatment more homogeneous between the treatment and comparison groups. One of the most commonly used methods is propensity score matching (PSM). Needless to say, randomized experiments have an advantage over nonexperimental methods in balancing the comparison and treatment groups in regards to observed and unobserved confounders (e.g. Arceneaux, Gerber, and Green 2006, 2010). Unfortunately, experimentation is not always feasible due to financial, practical, or ethical constraints. Therefore, researchers have assessed the performance of nonexperimental identification strategies.

Literature regarding this issue can be broadly classified into two groups (see Calónico and Smith 2017): (1) studies examining a substantive question, i.e. the plausibility of a particular identifying assumption, such as conditional independence, in certain substantive contexts with specific sets of available conditioning variables (e.g. Dehejia and Wahba 1999, 2002) and (2) studies examining an applied econometric or statistical question, i.e. the performance of a particular estimator that presumes the same basic identifying assumptions (e.g. conditional independence) in contexts in which the assumptions hold true. In the latter group of studies, researchers examined the asymptotic and finite-sample properties of matching and weighting estimators (e.g. Abadie and Imbens 2006; Busso, DiNardo, and McCrary 2014; Frölich 2004; Heckman, Ichimura, and Todd 1998; Hirano, Imbens, and Ridder 2003; Huber, Lechner, and Wunsch 2013). We aim to help explain the finite-sample properties of matching and weighting estimators.

Frölich (2004), Huber, Lechner, and Wunsch (2013), and Busso, DiNardo, and McCrary (2014) made recent major contributions to studies on the finite-sample properties of matching and weighting estimators. Frölich (2004) investigated the finite-sample properties of several matching and weighting estimators and concluded that the weighting estimator was “very unreliable and has a larger MSE [mean squared error] than pair matching in all simulations.” This finding prompted Busso, DiNardo, and McCrary (2014) to reexamine the finite-sample properties of matching and weighting estimators because Frölich’s (2004) conclusion conflicts with the findings from studies on the large sample properties of matching and weighting estimators. Busso, DiNardo, and McCrary (2014) concluded that weighting is a substantially more effective technique for estimating the average treatment effects than Frölich (2004) suggested and that neither weighing nor matching prevails uniformly across data-generating processes. Huber, Lechner, and Wunsch (2013) also investigated the finite-sample properties of matching and weighting estimators with a more complex and thorough simulation design referred to as the Empirical Monte Carlo Study. This study simulated realistic placebo treatments among the untreated using empirical data. After conducting a thorough set of simulation studies, Huber, Lechner, and Wunsch (2013) concluded that trimming observations with excessive weight is crucial and that certain radius-matching estimators generally perform best. However, they also found that “no estimator is superior in all designs and for all outcomes.”

After considering various possibilities, the majority of researchers suggest that there is not a single, universally applicable estimator because in small samples, the selection of an estimator greatly depends on the specific circumstances. As sample sizes grow, all matching methods move toward comparing exact matches, which asymptotically produce the same results. However, in small samples, a trade-off typically occurs between bias and variance, suggesting that the choice of matching method is crucial because the performances of matching estimators vary depending on the data structure at hand (Caliendo and Kopeinig 2008; Heckman, Ichimura, and Todd 1997; Smith 2000).

Compared to the aforementioned background of the current research on matching estimators, King and Nielsen (2019), who do not deny the effectiveness of matching methods per se, have attracted attention by harshly criticizing PSM. According to King and Nielsen’s (2019) simulation analysis, PSM can worsen rather than improve covariate balance and can increase the bias in the estimates of treatment effects (“the PSM paradox”). Therefore, they suggest that researchers should use other “better matching methods,” such as Mahalanobis distance matching (MDM) and coarsened exact matching (CEM), to avoid the consequences. On the other hand, Guo, Fraser, and Chen (2020) argue that the results of their simulation analyses do not allow them to conclude that MDM and CEM more effectively reduce bias in estimated treatment effects than PSM. Furthermore, some researchers suggest that in some cases, traditional econometric methods may effectively reduce selection bias without resorting to matching methods (Angrist and Pischke 2009; Smith and Todd 2005).

Interestingly, none of these studies show how the bias in the estimates of treatment effects changes as the covariate balance improves. Instead, they emphasize the importance of checking covariate balance. However, King and Nielsen (2019) only show the relationship between the number of observations removed by 1-to-1 greedy matchings without replacement and the maximum estimate biases of the treatment effect as well as the relationship between the number of removed observations and the covariate balance. Furthermore, Guo, Fraser, and Chen (2020) only show the relationship between the number of observations after matchings and the mean values of the estimated biases (real values) of the treatment effect.

Lee (2013) is a notable exception, with a simulation analysis on the relationship between bias in the estimates of treatment effects and covariate balance being conducted. Although Lee’s (2013) analysis was developed in a different context, it is valuable as research examining the effectiveness of post-matching balancing tests. Lee (2013) examined several permutation versions of the balancing tests and argued that the one Dehejia and Wahba (2002) described was less likely to reject the correct propensity score specification falsely, resulting in superior size properties. The results suggest that the balancing test effectively increases the probability of detecting a possible functional misspecification in propensity score calculations. However, Lee (2013) could not identify a connection between the balance or imbalance in covariates and the resulting bias in the estimates of the average treatment effect. Indeed, Lee (2013) found that many of his simulation results showed larger biases in the estimates of average treatment effect when using the propensity score specification that passed the balancing test compared to when no balancing test was conducted.

In addition to the above issue, King and Nielsen (2019) do not analyze the impact of unobserved confounders that applied researchers often worry about in empirical analyses. However, researchers, such as Caliendo, Mahlstedt, and Mitnik (2017) and Groenwold and Klungel (2015), suggest that when observable variables sufficiently correlate with the unobservable variables, the unobservable variables can be indirectly controlled by controlling for the observable variables. Therefore, it is worthwhile to explore whether this suggestion regarding unobservables is valid.

Therefore, we investigate weighting and matching techniques’ effectiveness in improving covariate balance and reducing bias when estimating treatment effects in finite-sample situations through simulation analyses. We examine five covariate balancing methods: PSM, MDM, CEM, inverse probability weighting (IPW), and entropy balancing (EB; Hainmueller 2012; Hainmueller and Xu 2013). We chose this set of estimators for the following four reasons. First, King and Nielsen (2019) express sharp criticism of PSM with 1-to-1 greedy matching. Second, King and Nielsen (2019) argue that PSM’s weakness stems from the specific way propensity scores interact with matching and that other applications of propensity scores, such as IPW, are not necessarily implicated in the weakness although they do not examine alternative uses of propensity scores. Third, King and Nielsen (2019) suggest that MDM and CEM are alternatives to PSM as “better matching approaches.” Fourth, although EB has grown to become one of the most popular techniques for correcting covariate imbalance, at least in political science, King and Nielsen (2019) do not examine it.

Our simulation results suggest that all of the above methods effectively improve covariate balance. At the same time, we observed the consequence King and Nielsen (2019) note, i.e. an increase in covariate imbalance driven by removing observations, in our simulations, not just in PSM, but also in IPW, MDM, and CEM. Our simulation results suggest that, regardless of the matching or weighting techniques used, implementing covariate balancing to the maximum extent does not always lead to the optimal covariate balance or a lower bias in the estimated treatment effects in our generated data.

Unfortunately, our simulation analysis could not establish the optimal range of covariate balances to minimize the bias in the estimated treatment effects. Additionally, with our simulation results, we could not determine which of the covariate balancing techniques under consideration would be optimal for each data set. The results suggest that while analyzing observational data, we need to investigate the properties of the available data carefully to select the appropriate nonexperimental estimator for the sample, including conventional ones, such as OLS.

We structure this paper as follows. Section 2 provides this study’s background and summarizes King and Nielsen’s (2019) criticism of PSM with 1-to-1 greedy matching. Section 3 reviews the responses from other researchers, such as Guo, Fraser, and Chen (2020). In Section 4, we present simulation analyses and evaluate various matching techniques’ effectiveness. Finally, Section 5 presents the conclusions and limitations of this study and identifies some future research directions.

2 Criticism and Debate Over PSM With 1-to-1 Greedy Matching

2.1 Randomized Controlled Trials, Exact Matching, and Propensity Score Matching

In many scientific fields, randomized controlled trials (RCTs) are regarded as the most reliable method of estimating causal effects and are frequently incorporated into real-world policy-making processes (Banerjee and Duflo 2011; Gneezy and List 2014; Leigh 2018).

However, conducting RCTs on every policy is not feasible due to various factors, such as budgetary constraints and resistance arising from individual or organizational interests. According to Holland (1986), some “treatments” of interest cannot be randomly assigned because it would be unethical. Moreover, experiments only offer a limited range of outcome distributions, making it challenging to identify parameters (e.g. the distribution of treatment effects) that depend on joint distribution. Therefore, we cannot eliminate using observational data for policy analysis and evaluation.

Nonetheless, in observational data, the treatment and comparison groups are likely to be affected differently by factors (covariates) other than the treatment, unlike those in experimental data. Consequently, these factors (covariates) may impact the outcome difference between the treatment and comparison groups when one uses unadjusted observational data.

Matching is one technique for dealing with sample selection biases caused by non-randomized data that are inherent in observational data analysis. We create pairs by selecting observation units from the treatment and comparison groups with similar values of covariates other than the treatment. As a result of this matching, the effects of covariates other than the treatment are, on average, similar in the two groups. For example, in exact matching (EM), we create pairs by taking observation units with the same values for all covariates from the treatment and comparison groups, which can reduce the bias caused by the covariate imbalance between the treatment and comparison groups to zero. This allows us to estimate the effect of the treatment on the outcome more accurately. The important point is that matching adjusts only for bias due to observed covariates and does not correct for bias derived from unobserved covariates.

However, simple matching, such as EM, becomes infeasible as the covariates increase. Indeed, EM is infeasible with even one continuous covariate. PSM can effectively deal with this problem, referred to as the curse of dimensionality, which occurs as more covariates are taken into account in the covariate matrix X, causing a sharp decline in the probability of achieving exact matches for all units. In other words, inexact matching, rather than EM, is required when there are many discrete covariates and/or at least one continuous covariate. To perform inexact matching, some distance metric must be used to measure the similarity of the treated and untreated units concerning the values of the observable variables. Propensity scores and Mahalanobis distance are two examples of such distance metrics. In PSM, we estimate and quantify the “propensity” of each observation unit to receive the treatment by parametric estimators of binary response models, such as the probit and logit^[2] regression, by using observed covariates. The predicted value from the regression is the propensity score, i.e. the conditional probability that observation unit i(i = 1,⋯N) is assigned a treatment (W_i = 1; W_i = 0 for non-treatment) under given observed covariates x_i(x_i ∈ X_i). Put simply, the propensity score is a single value for each unit of observation that summarizes the effect of multiple covariates on treatment assignment.

e(xi)=pr(Wi=1|Xi=xi)

There are several PSM methods. However, because King and Nielsen (2019) criticize PSM with 1-to-1 greedy matching, we limit our discussion to this type of PSM. In nearest-neighbor caliper matching, we create pairs for a given observation unit i(i ∈ I₁) in the treatment group by taking the observation unit j(j ∈ I₀) in the comparison group, given that the difference between the propensity scores of i and j is less than or equal to a predetermined acceptable value (caliper) ε. Suppose C is the set of observation units of the comparison group; P_i and P_j are the propensity scores of the observation units in the treatment and comparison groups, respectively; and C(i) is the set of observation units j in the comparison group that is matched to the observation unit i of the treatment group whose propensity score is P_i. For observation unit i in the treatment group, observation unit j in the comparison group is selected if the absolute difference in propensity scores between i and j satisfies the following conditions:

|Pi−Pj|<ε,j∈I0

C(i)=minj|Pi−Pj|,j∈I0.

As a result of such matching, the overall effect of covariates is approximately homogenized between the treatment and matched comparison groups, allowing for randomization although individual covariates may differ considerably (for details and variations of PSM methods, see Rosenbaum and Rubin (1983), Dehejia and Wahba (1999, 2002, Caliendo and Kopeinig (2008), Rosenbaum (2010), Guo and Fraser (2014), and Imbens and Rubin (2015), among others). Although various matching algorithms have been developed, PSM is one of the most extensively used. As of September 4, 2021, 28,330 academic papers referenced PSM in the Scopus literature search. In contrast, only 426 and 428 academic papers mentioned MDM and CEM, respectively.

2.2 Weaknesses of PSM With 1-to-1 Greedy Matching

King and Nielsen (2019) contend that PSM with 1-to-1 greedy matching has serious problems. According to them, the weakness of PSM is that it is modeled after a fully randomized controlled trial, in which treatment assignment is determined solely by the probability that each observation unit receives the treatment, independent of the covariates. In other words, when treatment is assigned via randomization, there is no relationship between covariates and treatment status in the population. As a result, the observation unit pairs are not necessarily similar in terms of their covariates even though the difference in propensity scores of the paired observation units is less than or equal to a specific predetermined acceptance value and is minimal. PSM effectively addresses the curse of dimensionality by allowing researchers to match observation units that are different in terms of individual covariates. However, because of this feature, PSM with 1-to-1 greedy matching worsens the covariate balance at a certain point and increases bias in estimating treatment effects.

2.3 Advantages of MDM and CEM

In a fully blocked RCT, treatments are randomly assigned after observation units with similar attributes are grouped or stratified based on the value of covariates that may affect the outcome. Thus, if ideally implemented, a fully blocked RCT will balance the covariates in the treatment and comparison groups. In other words, a fully blocked randomized experimental design will block the treated and comparison groups on the observed covariates exactly at the start. Therefore, the imbalance in these experiments is always zero by design without having to remove any observation units from the sample. In contrast, in a completely randomized experimental design, treatment assignment is random in regard to the observed covariates because it only depends on the scalar probability of treatment for all units. Randomness does not mean there is zero imbalance in any one sample (except by rare coincidence or in asymptotic samples or when modified to create randomized block designs). PSM approximates a fully randomized experiment whereas CEM and MDM approximate a totally blocked experiment. King and Nielsen (2019) argue that this means that CEM and MDM can achieve lower degrees of imbalance, model dependence, and bias than PSM (see also Iacus, King, and Porro 2011).

MDM calculates distance d(i, j), i.e. Mahalanobis distance, between any observation unit in the treatment group and all observation units in the comparison group, which is defined as follows:

d(i,j)=(u−v)T∑−1(u−v),

where u and v are the values of the covariates used for matching observation unit i in the treatment group and observation unit j in the comparison group in the case of single-nearest-neighbor matching. ∑ is the sample variance-covariance matrix of the covariates used for matching. The observation unit j of the comparison group with the smallest distance d(i, j) is selected and matched to the observation unit i of the treatment group. The Mahalanobis distance increases with the degree of covariate difference. Therefore, each pair will have similar covariate values if the comparison units near the treated units are located within the Mahalanobis distance. In contrast, PSM matches units with similar propensity scores. Because the covariate distribution is reduced into a single dimension by propensity scores, it follows that in PSM, the unit pairs are not necessarily similar regarding their covariates.

As its name implies, CEM is a matching method in the EM family. EM can effectively remove errors introduced by covariate imbalance between treatment and comparison groups with a small number of discrete covariates. However, EM suffers from the curse of dimensionality and tends to reduce the number of observation units to be matched because real-world data usually contain multiple covariates and continuous variables. CEM attempts to overcome this curse of dimensionality differently than PSM. In CEM’s algorithm, each covariate is grouped by a coarse index, which matches the observation units in the treatment and comparison groups. The original covariates used to create the coarse indices are retained in the matched data and used in subsequent analyses (e.g. a second-step analysis that estimates a parametric model using weights provided by CEM; Blackwell et al. 2009; Iacus, King, and Porro 2011, 2012).

CEM’s algorithm allows for matching on coarse indices such as school systems. For example, the covariate of education level is represented by a continuous variable of years of education. However, in the US institutional context, years of education can also be grouped into five coarse indices: 0–6 years for grade school students, 7–8 years for middle school students, 9–12 years for high school students, 13–16 years for college students, and 17 years or more for graduate school students.

3 Review of Literature

3.1 Responses From other Researchers

The results of the simulation analyses by King and Nielsen (2019) suggest that MDM and CEM, which approximate a fully blocked randomized experiment, achieve better covariate balance than PSM, which approximates a fully randomized controlled experiment. King and Nielsen (2019) also suggest that the causal effect estimates from MDM and CEM are less biased than the estimates from PSM.

Inspired by the preprint version of their study, Ripollone et al. (2018) investigated whether King and Nielsen’s (2019) findings can be replicated when PSM is applied to real observational data rather than simulation data. They used US insurance claims data from the Pharmaceutical Assistance Contract for the Elderly and the Medicaid Analytic eXtract databases to test whether further pruning of matched sets would lead to a rise in covariate imbalance. Their analysis results suggest that covariate imbalance worsened when matched sets were gradually pruned in descending order of propensity score distance calipers (0.01–0.05). However, they also found that the imbalance in the matched data set improved after it was applied with commonly used propensity score calipers to define an acceptable match. Indeed, when they used standard propensity score calipers (e.g. 0.025), the data pruning stopped in the lowest region of the imbalance trend. Ripollone et al. (2018) concluded that when standard propensity score calipers are used on the aforementioned observational data, PSM does not increase covariate imbalance.

Guo, Fraser, and Chen (2020) agree with King and Nielsen (2019) that the PSM aims to approximate a fully randomized controlled experiment and that, in many cases, the matching methods that approximate fully blocked randomized experiments are more desirable for studies using observational data. However, in empirical research, priority is typically given to ensure that the matched sample remains representative of the research population of interest. In this context, losing a substantive proportion of observations can be problematic. Guo, Fraser, and Chen (2020) argue that PSM is not necessarily inferior to MDM and CEM in maintaining observation units in the sample.

Furthermore, Guo, Fraser, and Chen (2020) contend that blocking can lead to incomplete matching. Indeed, appropriate blocking measures are necessary to implement a matching design approximating a fully blocked randomized experiment. When appropriate blocking measures are not evident, blocked matching can result in imprecise estimates. After conducting simulation analysis with two conditions of data generation, i.e. selection on the observables and selection on the unobservables, Guo, Fraser, and Chen (2020) state that they cannot conclude that the alternative methods King and Nielsen (2019) recommend more effectively reduce finite-sample bias than PSM.

3.2 Critical Review

There are several concerns regarding the analyses by King and Nielsen (2019) and their critics. King and Nielsen (2019) draw 100 comparison units and 100 treated units from uniform distributions with limits of 0 and 5 (Uniform(0, 5)) and 1 and 6 (Uniform(1, 6)), respectively, for each of the two covariates, X_i1 and X_i2, of the equations below. The observations within the overlapping [1, 5] × [1, 5] squares are considered the data from a completely random experiment. In contrast, the observations outside these squares introduce an imbalance in the data. The following equation generates the outcome variable Y_i:

Yi=2Ti+Xi1+Xi2+εi,

where ε∼N(0, 1), i.e. ε follows a standard normal distribution N(0, 1). T_i is the treatment variable.

King and Nielsen (2019) generated 100 simulated data sets. They estimated propensity scores using a logistic regression model (T_i = X_i1 + X_i2 + ϵ_i) and performed 1-to-1 greedy matchings without replacement. They also performed 1-to-1 matching using the Mahalanobis distance. They then estimated 512 estimation models, which include X₁,X₂, and all possible combinations of their three quadratic terms and four cubic terms for each data set and each of the 85 calipers. Thus, the simulation is performed in a setting with no unobserved confounders (the assumption of “unconfoundedness” is achieved), but the correct functional form (specification of estimating equation) is not known.

King and Nielsen (2019) aim to illustrate how the combination of model dependence, analyst discretion, and PSM can cause biases in this somewhat extreme setup. Finding out how data are generated is the goal of the research process in observational data analysis. However, using one model as if it were known makes little sense when our understanding of the data-generating process is limited. The analyst therefore faces model dependence, whereby two or more models that fit the data essentially equally generate empirically different causal estimates. The analyst chooses one or more results to publish in response to this ambiguity, but this qualitative decision made without any constraints from a range of possible estimates is probably biased. King and Nielsen (2019) intended to demonstrate how the bias can be worsened in this situation by deleting the worst-matched observation pairs according to the absolute propensity score distance.

King and Nielsen (2019) calculated the variances of the coefficients of T (true value is 2) obtained from this simulation for each number of observations pruned due to the single-nearest-neighbor matching with a caliper. They were then plotted in a figure. The larger the variance, the greater the variation in the estimated coefficient of T, meaning that the change in the estimated coefficient of T depends on the specification of the estimating equation for the outcome variable Y_i. King and Nielsen’s (2019) analysis shows that in MDM matching, the variance of the estimated coefficient of T decreases as the number of removed observations increases whereas in PSM matching, the variance of the estimated coefficient of T continues to increase as the number of removed observations increases beyond a certain point.

To illustrate the model dependence and bias in the causal effect estimates that can result from the analyst selecting the specification of the estimating equation arbitrarily, King and Nielsen (2019) also present a figure plotting the estimated coefficient of T for each number of removed observations. Their assumption in this figure is that researchers choose the largest value among the estimated coefficients from the various models, hoping the causal effect is large. Again, the simulation results show that in MDM matching, the bias of the estimated coefficient of T decreases as the number of removed observations increases whereas in PSM matching, the bias of the estimated coefficient of T continues to increase as the number of removed observations increases beyond a certain point.

King and Nielsen (2019) finally attempted to replicate their results on the real observational data used in published papers. The results again show that the covariate balance worsens as the number of removed observations increases in PSM matching. In contrast, the covariate balance improves as the number of removed observations increases in MDM and CEM.

King and Nielsen’s (2019) findings nevertheless raise three concerns. The first concern relates to the assumption of unconfoundedness. Indeed, in many studies, researchers using observational data make the unconfoundedness assumption, that is, the assumption that potential outcomes are conditionally independent of treatment assignment. However, in reality, it is difficult to assume that unobserved confounding is zero. Heckman et al. (1996, 1998 decomposed the conventional measure of selection bias in observational studies into three components: those corresponding to (1) failure of common support condition (bias arising from regions of nonoverlapping support), (2) failure to weight the data appropriately (bias due to different distributions of observed characteristics), and (3) selection bias precisely defined (residual selection on unobserved characteristics). Matching and weighting estimators can help eliminate the first and second components, but matching and weighting do not necessarily eliminate the third component. In light of the above, it would be more practical and interesting to investigate how and which matching techniques could or could not help reduce the estimated bias of causal effects in a circumstance in which the assumption of unconfoundedness is violated (i.e. in a setting where unobserved confounding factors exist).

The second concern relates to the presentation of the bias in the estimated coefficients of T. In practice, it is highly uncommon, although it may be possible, for estimation to be performed using all possible combinations of quadratic and cubic terms of the covariates. Instead, the estimation will often be conducted after initial selection of a relatively limited number of specifications of the estimating equations following theories and previous studies. As a result, the resulting estimated coefficients are not necessarily the largest among all possible specifications of the estimating equations. Given this point, it would probably be more realistic and practical to plot the average estimated coefficients rather than to plot the maximum estimated coefficients obtained from various models for each number of removed observations.

The third concern relates to their methodological approach. King and Nielsen (2019) investigate whether the covariate balance and bias in causal effect estimates improve or worsen as more observations are removed through matching. However, in empirical research, it is usually a top priority to make sure that the matched sample continues to reflect the research population of interest. The loss of a sizeable number of observations can be problematic in this situation. Therefore, researchers who apply matching to observational data often attempt to improve the covariate balance while maintaining the number of observations as much as feasible. They then seek to reduce the bias in the causal effect estimates due to the improved covariate balance. From the perspective of applied research, examining how the bias in the causal effect estimates changes as the covariate balance varies would be more beneficial.

Regarding the first concern (i.e. the assumption of unconfoundedness), Guo, Fraser, and Chen (2020) conduct simulations in a more realistic setting. The settings Guo, Fraser, and Chen (2020) use are particularly beneficial because they perform simulations with and without observed/unobserved confoundings. (We will describe the detailed simulation setup of Guo, Fraser, and Chen (2020) in the next section.).

For the second concern (i.e. the presentation of the bias in the estimated coefficients of the treatment variable [T]), we also have concerns with the method Guo, Fraser, and Chen (2020) use. Indeed, the mean bias in Guo, Fraser, and Chen (2020) is the difference in real values between the true value of the treatment variable’s coefficient and the estimated coefficients. The upward and downward biases cancel each other out in this calculation; therefore, the true magnitude of the bias is unknown. The mean bias should be calculated using the absolute difference between the true value of the coefficient of the treatment variable and the estimated coefficients to determine the actual size of the bias more accurately. Ripollone et al. (2018) use real observational data for their simulation. Therefore, they could not examine whether PSM would increase bias in estimating treatment effects when PSM successively pruned observations because the true value of the treatment variable’s coefficient was unknown for them.

Finally, for the third concern, Guo, Fraser, and Chen (2020) also have limitations similar to those of King and Nielsen (2019). Guo, Fraser, and Chen (2020) only present the average number of remaining observations after the simulations, the average estimated coefficients of the treatment variable, and their biases. In addition, they perform matchings with only one width of the caliper. Guo, Fraser, and Chen (2020) did not examine the relationship between changes in covariate balance and changes in biases in the estimates of causal effects.

4 Method: Monte Carlo Simulations

We address the three concerns raised above and evaluate the performance of weighting and matching methods described below. Regarding the first concern (i.e. the assumption of unconfoundedness), we model our simulations on the setting Guo, Fraser, and Chen (2020) use because Guo, Fraser, and Chen (2020) perform their simulations in more realistic settings than King and Nielsen (2019), including simulations with and without observed/unobserved confoundings. However, following King and Nielsen’s (2019) setting, we run our simulations repeatedly with different widths of a given type of caliper for PSM and MDM and different cut points for CEM. Regarding the second concern (i.e. the presentation of the bias in the estimated coefficients of the treatment variable [T]), we calculate the mean bias as the absolute value of the difference between the true value of the treatment variable and the estimated coefficients when investigating the bias in the estimated coefficients of the treatment variable. Finally, regarding the third concern, we analyze the relationship between changes in the covariate balance and changes in the estimated bias by plotting them on figures.

The simulation settings of Guo, Fraser, and Chen (2020) follow those of Guo and Fraser (2014). As a result, we replicate Guo and Fraser’s (2014) data generation settings along with several extensions, such as the inclusion of caliper variation, additional covariate balancing techniques (EB, IPW, MDM, and CEM), and analyses of the relationship between the covariate balance and the bias of the estimates.

EB involves recalculating the sample units’ weight function to account for covariate balance directly. The researcher first sets various balancing requirements to ensure that the covariate distributions of the treatment and comparison groups in the processed data match precisely on all preset moments. The level of covariate balance that the researcher wants to achieve is predetermined. Then, to retain valuable information in the preprocessed data, EB seeks out the set of weights that fulfills the balance criteria while remaining as close as possible (in an entropy sense) to a set of uniform base weights. Finally, the unit weights are recalculated to adjust for systematic and random inequalities in representation with respect to the first, second, and possibly higher moments of the covariate distributions. According to Hainmueller (2012), the advantage of the EB procedure is that it directly incorporates the auxiliary information about the known sample moments and modifies the weights such that the user achieves exact covariate balance for all moments included in the reweighting procedure (Hainmueller 2012; Hainmueller and Xu 2013).

When estimating the average treatment effect for the treated (ATET), IPW balances observed characteristics in the treatment and comparison groups by weighting each unit in the comparison group by the inverse probability of receiving the treatment. The following steps are involved in IPW. The probability—or propensity—of being treated is calculated given each unit’s characteristics using methods such as logistic regression. The inverse of the propensity score is then used to compute weights. These weights produce data with observed confounders distributed more evenly between the treatment and comparison groups (e.g. Guo and Fraser 2014). However, IPW may perform poorly when some units have extreme propensity scores approaching 1 or 0, i.e. some units almost always receive treatment or never receive treatment. In these situations, IPW generates large weights for the units with extreme propensity scores, biased estimates, and large variances (e.g. Stuart 2010). To address these problems, we used the stabilized IPWs (which were normalized to a mean of 1) Cole and Hernán (2008) propose (see also Busso, DiNardo, and McCrary 2014).

These methods, i.e. EB and IPW, are similar to the weighting analysis frequently employed to analyze data derived from complex sampling designs. 1-to-1 matching reduces observations whereas weighting does not. Therefore, social scientists have used this method extensively.

Figures 1 –3 show the data-generation settings. Figure 1 presents Setting 1 with no unobserved confounders. First, three covariates (x₁, x₂, x₃) affect the outcome variable Y. Next, the covariate Z affects the treatment variable W. The covariate x₃ also affects the treatment variable W. The following equations generate the outcome variable Y, the covariates (x₁, x₂, x₃, z), and the treatment variable W:

Y=100+0.5x1+0.2x2−0.05x3+0.5W+u

W*=0.5Z+0.1x3+v,

where W^* is the latent continuous variable, Z is the observed exogenous variable affecting treatment assignment, and v is the unobserved factor, or error term, affecting the treatment assignment, which follows a standard normal distribution N(0, 1). The treatment variable W is set to 1 if W^* in the treatment assignment equation is greater than its median and to 0 if it is less than the median. The covariates x₁, x₂, x₃, Z, and u are random variables and follow a normal distribution with mean vector (3 2 10 5 0), standard deviation vector (0.5 0.6 9.5 2 1), and the following correlation matrix.

r(x1,x2,x3,Z,u)=[10.20.3001000100101]

In Setting 1, the correlations between Z and u and between u and v are 0. Therefore, Setting 1 includes no latent confounders and no selection bias.

Figure 1:

Setting 1: No unobserved confounders and no selection bias.

Figure 2:

Setting 2: Selection bias due to observable confounders.

Figure 3:

Setting 3: Selection bias due to unobservable confounders.

In Setting 2, depicted in Figure 2, the correlation between u and v remains 0, but the correlation between Z and u is 0.4. In this setting, due to the correlation between Z and u, stochastic dependence between W and u occurs, leading to the problem of selection bias. Because Z is an observable confounder, Heckman and Robb (1985, 1986 call the dependence between W and u “selection on observables.”

r(x1,x2,x3,Z,u)=[10.20.300100010010.41]

Figure 3 shows Setting 3 with an unobserved confounder. In Setting 3, the following equation generates data:

Y=100+0.5x1+0.2x2−0.05x3+0.5W+u

W*=0.5Z+0.1x3+v

v=δ+0.15ε,

where the covariates x₁, x₂, x₃, Z, u, and ε are random variables and follow a normal distribution with mean vector (3 2 10 5 0 0), standard deviation vector (0.5 0.6 9.5 2 1 1), and the following correlation matrix.

r(x1,x2,x3,Z,u,ε)=[10.20.300010000100010010.71]

The treatment variable W is set to 1 if W^* in the treatment assignment equation is greater than its median and to 0 if it is less than the median, as in Setting 1. δ is a random variable, which follows a standard normal distribution N(0, 1). In Setting 3, the correlation between Z and u is 0, but the correlation between u and v is 0.1. Because of the correlation between u and v, stochastic dependence between W and u again occurs, leading to the problem of selection bias. Because v is an unobservable confounder, Setting 3 is selection on the unobservables.

In Setting 2, selection is on observables. Therefore, conditioning on Z controls the dependence between W and u and mitigates the selection bias. However, in Setting 3, it is impossible to condition on v and mitigate the selection bias because selection is on unobservables.

We ran Monte Carlo simulations using the following procedure with the three settings mentioned above.

Generate a random sample of 2500 observations (1250 for the treatment and comparison groups).
Check for covariate balance between the treatment and comparison groups before matching using the mean absolute standardized difference (MASD) (we explain the formula in the next section).
Generate propensity scores by logit regression using all covariates except Z (x₁, x₂, x₃): Bhattacharya and Vogt (2007) and Wooldridge (2016) show that when the treatment variable W is correlated with u, as Figure 2 shows, including the instrumental variable Z as a predictor in the propensity score model yields greater bias than using the simple difference in means between treatment and comparison groups. Bhattacharya and Vogt (2007) also find that as the instrument’s strength increases, the propensity score method becomes more inconsistent than the naive estimator (i.e. the simple difference in means between treatment and comparison groups). However, for comparison, we also generate propensity scores by using all covariates including Z (x₁, x₂, x₃, Z) and compare the bias in the ATET from the propensity score model with and without Z.
Estimate the coefficients of the treatment variable W using a simple difference in means without any controls, IPW/PSM estimates, and OLS estimates with all covariates (x₁, x₂, x₃, Z) for samples 4.1–4.6 below, generated in Settings 1–3. The last two-step procedure, which combines weighting and matching with parametric regression, is called “doubly robust estimation.” The two-step procedure is doubly robust because if either the weighting/matching or the parametric model is misspecified but one of them is correctly specified, causal estimates will still be consistent. A simple difference in means does not have this double robustness (Bang and Robins 2005; Ho et al. 2007).
Yˆ=βˆ0+βˆ1x1+βˆ2x2+βˆ3x3+βˆ4Z+τˆW.
1. An unmatched and unweighted sample.
2. A sample weighted using EB.
3. A sample weighted using the stabilized IPWs (which were normalized to a mean of 1), which Cole and Hernán (2008) propose (see also Busso, DiNardo, and McCrary 2014).
4. A sample matched using PSM, i.e. 1-to-1 nearest neighbor matching with and without caliper (without replacement in descending order) and 1-to-5 nearest neighbor matching (with replacement) with and without caliper using propensity scores, which were generated by the propensity score model with and without Z. Calipers were varied by multiplying the standard deviation of the propensity score by five multipliers (0.05, 0.1, 0.15, 0.2, 0.25). The calipers were selected to achieve the MASD in the matched units below approximately 10 % (when including the instrumental variable Z as a predictor in the propensity score model) because MDM achieved less than 10 % MASD even when using wider calipers (see below for MDM).
5. A sample matched using 1-to-1 MDM. We used five calipers (0.3, 0.975, 1.65, 2.325, 3), which were different from those used in 4.4 because matching was not achieved in some simulation runs with the same caliper settings as in 4.4. MDM with the five calipers achieved less than 10 % MASD in the matched units. We also tried calipers wider than three, but the resulting MASDs were unaffected.
6. A sample matched and weighted using CEM. We used five sets of cut points for grouping (21, 22, 23, 24, 25) for all covariates, which we selected to achieve the MASD in the matched units below 10 %. In addition, we used 13 cut points for grouping, which we selected using the automatic binning algorithm implemented by the CEM command that Blackwell et al. (2009) provide. We performed matching first by using maximal information based on an automatic algorithm implemented with the CEM command, which resulted in strata including different numbers of treated and comparison units (j-to-k match). According to Blackwell et al. (2009), the CEM command calculates weights to compensate for the differential strata sizes. We used the CEM weights in subsequent simulations. Second, we obtained a sample including the same number of treatment and comparison units. We generated the second sample by removing units from each stratum by random matching until the strata contained the same number of treated and comparison units (k-to-k match), which is also based on an automatic algorithm implemented using the CEM command.
Perform 10,000 runs of Steps 1–4 above. The number of runs is 10,000 for 4.1–4.3 and 50,000 for 4.4–4.6 (10,000 runs for each set of 5 calipers or cut points).
After matching, plot the relationship between the MASD for all covariates and the number of removed observations in samples.
Examine the relationship between the mean absolute error (MAE: here, the mean absolute difference between the true value of 0.5 and the estimated coefficients) and the root mean square error (RMSE) of the estimated coefficient of the treatment variable W and the MASD for all covariates.

4.1 Relationship Between the Covariate Balance and the Number of Removed Observations

In the case of continuous covariates, the standardized difference is the difference in means between two groups divided by the estimated standard deviation of the same covariate (e.g. Austin 2009). Therefore, standardized differences are comparable for covariates measured on different scales. The following equation defines the absolute standardized difference, d, in the case of continuous covariates:

dbefore=|x‾T−x‾C|sT2+sC22 dafter=|x‾TA−x‾CA|sTA2+sCA22,

where x‾T and x‾C are the sample means of the covariate X for the treatment (T) and comparison (C) groups before adjustment (weighting or matching), x‾TA and x‾CA are the sample means of the covariate X for the treatment (T) and comparison (C) groups after adjustment, and sT2, sC2, sTA2, and sCA2 are the corresponding sample variances of the covariate X for both groups.

We derive the absolute standardized difference using the following formula when the covariate is a binary variable:

dbefore=|PT−PC|PT(1−PT)+PC(1−PC)2 dafter=|PTA−PCA|PTA(1−PTA)+PCA(1−PCA)2,

where P_T, P_TA and P_C, P_CA are the proportions of the binary variables in the treatment and comparison groups, respectively.

First, let us examine whether the simulations in this paper replicate the consequence King and Nielsen (2019) mention, according to which PSM with 1-to-1 greedy matching without replacement worsens the covariate balance beyond a certain point.

In Figure 4, we plot the overall MASD of all covariates after 1-to-1 matching without replacement for each number of observations removed by matching, using the sample generated in Setting 1 with no unobserved confounders and no selection bias. Regarding PSM, we generated propensity scores by logit regression using all covariates except for Z.

Figure 4:

Relationship between MASD and the number of removed observations (1).

Figure 4 suggests that MDM (1-to-1) and CEM (k-to-k) effectively improve covariate balance: both achieved less than 10 % MASD in the matched units. MDM achieved less than 10 % MASD with a mean of 2105 matched units. CEM improves covariate balance to the same extent as MDM but leads to a far greater reduction in units. Indeed, CEM requires pruning of more than 2000 units to achieve an MASD of less than 10 %. In contrast, PSM (1-to-1) only achieves an MASD in the 30 % range with a mean of 1649 matched units. This result presumably occurs because the instrumental variable Z is not used in the calculation of the propensity scores whereas Z is included in the calculation of the overall MASD of all covariates after matching. Consistent with the conjecture, Figure 5 shows that the overall MASD of all covariates after PSM also becomes less than 10 % with a mean of 1067 matched units when the instrumental variable Z is included in the propensity score model.

Figure 5:

Relationship between MASD and the number of removed observations (2).

The phenomenon King and Nielsen (2019) mention is present not just in PSM but also in MDM and CEM. Indeed, as Figures 4 and 5 shows, the covariate balance worsens slightly in all matching methods as the removal of observations by matching advances. Figures 6 –9 shows that the same phenomenon occurs even when we use 1-to-5 matching for PSM with and without Z in the propensity score model and j-to-k matching for CEM with and without auto cut points for CEM. These results suggest that matching and weighting under extreme caliper settings do not improve covariate balance, regardless of the matching and weighting methods.

Figure 6:

Relationship between MASD and the number of removed observations (3).

Figure 7:

Relationship between MASD and the number of removed observations (4).

Figure 8:

Relationship between MASD and the number of removed observations (5).

Figure 9:

Relationship between MASD and the number of removed observations (6).

Iacus, King, and Porro (2012), who developed CEM, recommend checking covariate balance with the multivariate imbalance measure (MIM), which they also developed. Therefore, we calculated MIMs after matching and plotted them for each number of removed observations.

Figures 10 –15 shows that when they are evaluated using a MIM, there is no substantive difference in the covariate balancing performance of PSM and MDM. In contrast, CEM’s performance is better than those of PSM and MDM. However, a phenomenon King and Nielsen (2019) observe still seems present in the data sets generated by Setting 1, albeit slightly, for all matching methods, even when tested on MIM. The phenomenon can also be observed in the data sets generated by Settings 2 (i.e. selection on observables) and 3 (i.e. selection on unobservables); see Appendix A (Figures A1-1–A1-12; A2-1–A2-12).

Figure 10:

Relationship between MIM and the number of removed observations (1).

Figure 11:

Relationship between MIM and the number of removed observations (2).

Figure 12:

Relationship between MIM and the number of removed observations (3).

Figure 13:

Relationship between MIM and the number of removed observations (4).

Figure 14:

Relationship between MIM and the number of removed observations (5).

Figure 15:

Relationship between MIM and the number of removed observations (6).

The relationship between the covariate balance measures (the MASD and MIM) and the number of removed units after matching seems simply to suggest that implementing matching to the extreme does not result in the most improved covariate balance, regardless of matching methods.

4.2 Relationship Between the Bias in the Treatment Effect Estimate and the Covariate Balance

4.2.1 Setting 1

Next, we examine the relationship between the MAE and RMSE of the estimated coefficient of the treatment variable W and the overall MASDs of the covariates. We examined 34 estimators in our simulations. Table 1 provides a summary of the examined estimators. Tables 2 –4 show the simulation results from Settings 1–3, which are presented in ascending order of RMSE. Each error’s effect on RMSE is in proportion to the size of the squared error. Therefore, RMSE can detect larger errors more easily than MAE, in which each error influences in proportion to the absolute value of the error. (Tables B1–B3 in Appendix B present the mean, min, max, and the 5th, 25th, 50th, 75th, and 95th percentile values of MAE of treatment effects.)

Table 1:

Summary of the examined estimators.

ID	Estimation	Estimates	Propensity score (PS) calculation	Matching	Calipers/cut-points
1	OLS	Simple difference in means
2	OLS	Two-step (doubly robust estimation)
3	EB	Simple difference in means
4	EB	Two-step (doubly robust estimation)
5	IPW	IPW estimates	Without Z (instrumental variable)
6	IPW	Two-step (doubly robust estimation)	Without Z (instrumental variable)
7	IPW	IPW estimates	With Z (instrumental variable)
8	IPW	Two-step (doubly robust estimation)	With Z (instrumental variable)
9	PSM	PSM estimates	Without Z (instrumental variable)	1-to-1	No caliper
10	PSM	PSM estimates	With Z (instrumental variable)	1-to-1	No caliper
11	PSM	PSM estimates	Without Z (instrumental variable)	1-to-5	No caliper
12	PSM	PSM estimates	With Z (instrumental variable)	1-to-5	No caliper
13	PSM	Two-step (doubly robust estimation)	Without Z (instrumental variable)	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25
14	PSM	Two-step (doubly robust estimation)	With Z (instrumental variable)	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25
15	PSM	Two-step (doubly robust estimation)	Without Z (instrumental variable)	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25
16	PSM	Two-step (doubly robust estimation)	With Z (instrumental variable)	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25
17	MDM	Simple difference in means		1-to-1	0.3, 0.975, 1.65, 2.325, 3
18	MDM	Two-step (doubly robust estimation)		1-to-1	0.3, 0.975, 1.65, 2.325, 3
19	CEM	Simple difference in means		k-to-k	21–25
20	CEM	Two-step (doubly robust estimation)		k-to-k	21–25
21	CEM	Simple difference in means		j-to-k	21–25
22	CEM	Two-step (doubly robust estimation)		j-to-k	21–25
23	CEM	Simple difference in means		k-to-k	Auto (13)
24	CEM	Two-step (doubly robust estimation)		k-to-k	Auto (13)
25	CEM	Simple difference in means		j-to-k	Auto (13)
26	CEM	Two-step (doubly robust estimation)		j-to-k	Auto (13)
27	CEM (weighting)	Simple difference in means		k-to-k	21–25
28	CEM (weighting)	Two-step (doubly robust estimation)		k-to-k	21–25
29	CEM (weighting)	Simple difference in means		j-to-k	21–25
30	CEM (weighting)	Two-step (doubly robust estimation)		j-to-k	21–25
31	CEM (weighting)	Simple difference in means		k-to-k	Auto (13)
32	CEM (weighting)	Two-step (doubly robust estimation)		k-to-k	Auto (13)
33	CEM (weighting)	Simple difference in means		j-to-k	Auto (13)
34	CEM (weighting)	Two-step (doubly robust estimation)		j-to-k	Auto (13)

Table 2:

Simulation results from setting 1.

ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units (treated/untreated) [*]	Mean abs. std. diff. (MASD) after adj.	Mean multivariate imbalance measure (MIM) after adj.	Mean abs. error (MAE) of treatment effects	Root mean squared error (RMSE) of treatment effects	Rank
ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units (treated/untreated) [*]	Mean abs. std. diff. (MASD) after adj.	Mean multivariate imbalance measure (MIM) after adj.	Mean abs. error (MAE) of treatment effects	Root mean squared error (RMSE) of treatment effects	MASD	MIM	MAE	RMSE
5	IPW	IPW estimates	Without Z				0.4625		0.0323	0.0405	30		1	1
11	PSM	PSM estimates	Without Z	1-to-5	No caliper		0.3563		0.0422	0.0527	26		2	2
16	PSM	Two-step	With Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	1995 (1240/755)	0.4086	0.9758	0.0437	0.0548	29	12	3	3
2	OLS	Parametric					0.5883		0.0446	0.0557	33		4	4
15	PSM	Two-step	Without Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	2293 (1241/1052)	0.5398	0.9753	0.0450	0.0563	32	11	5	5
6	IPW	Two-step	Without Z				0.4625		0.0474	0.0593	30		6	6
14	PSM	Two-step	With Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1067 (533/533)	0.0548	0.9388	0.0490	0.0614	9	9	7	7
13	PSM	Two-step	Without Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1649 (825/825)	0.3776	0.9671	0.0497	0.0623	28	10	8	8
9	PSM	PSM estimates	Without Z	1-to-1	No caliper		0.3619		0.0566	0.0713	27		9	9
26	CEM	Two-step		j-to-k	Auto (13)	763 (382/381)	0.1638	0.9243	0.0592	0.0744	14	7	10	10
8	IPW	Two-step	With Z				0.3382		0.0625	0.0790	24		11	11
7	IPW	IPW estimates	With Z				0.3382		0.0644	0.0814	24		12	12
24	CEM	Two-step		k-to-k	Auto (13)	544 (272/272)	0.0373	0.8927	0.0689	0.0859	5	5	14	13
32	CEM (weighting)	Two-step		k-to-k	Auto (13)		0.3045		0.0689	0.0859	22		14	13
34	CEM (weighting)	Two-step		j-to-k	Auto (13)		0.2549		0.0688	0.0863	16		13	15
17	MDM	Diff. in means		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2105 (1053/1053)	0.0612	0.9794	0.0692	0.0868	10	13	16	16
18	MDM	Two-step		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2105 (1053/1053)	0.0612	0.9794	0.0700	0.0879	10	13	17	17
33	CEM (weighting)	Diff. in means		j-to-k	Auto (13)		0.2549		0.0729	0.0912	16		18	18
23	CEM	Diff. in means		k-to-k	Auto (13)	544 (272/272)	0.0373	0.8927	0.0731	0.0913	5	5	19	19
31	CEM (weighting)	Diff. in means		k-to-k	Auto (13)		0.3045		0.0731	0.0913	22		19	19
25	CEM	Diff. in means		j-to-k	Auto (13)	763 (382/381)	0.1638	0.9243	0.1101	0.1277	14	7	24	21
12	PSM	PSM estimates	With Z	1-to-5	No caliper		0.0818		0.1056	0.1327	12		23	22
3	EB	Diff. in means					0.0000		0.1055	0.1332	1		21	23
4	EB	Two-step					0.0000		0.1055	0.1332	1		22	24
10	PSM	PSM estimates	With Z	1-to-1	No caliper		0.1133		0.1503	0.1900	13		25	25
21	CEM	Diff. in means		j-to-k	21–25	115 (58/57)	0.0511	0.5384	0.1578	0.1988	7	3	27	26
22	CEM	Two-step		j-to-k	21–25	115 (58/57)	0.0511	0.5384	0.1577	0.1989	7	3	26	27
19	CEM	Diff. in means		k-to-k	21–25	108 (54/54)	0.0248	0.5113	0.1609	0.2027	3	1	30	28
27	CEM (weighting)	Diff. in means		k-to-k	21–25		0.3000		0.1609	0.2027	20		30	28
20	CEM	Two-step		k-to-k	21–25	108 (54/54)	0.0248	0.5113	0.1608	0.2027	3	1	28	30
28	CEM (weighting)	Two-step		k-to-k	21–25		0.3000		0.1608	0.2027	20		28	30
29	CEM (weighting)	Diff. in means		j-to-k	21–25		0.2936		0.1626	0.2053	18		32	32
30	CEM (weighting)	Two-step		j-to-k	21–25		0.2936		0.1626	0.2054	18		33	33
1	OLS	Diff. in means					0.5883		0.3704	0.3718	33		34	34

[*] Due to rounding, the sum of treated and untreated units does not necessarily correspond to the number of matched units.

Table 3:

Simulation results from setting 2.

ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units (treated/untreated) [*]	Mean abs. std. diff. (MASD) after adj.	Mean multivariate imbalance measure (MIM) after adj.	Mean abs. error (MAE) of treatment effects	Root mean squared error (RMSE) of treatment effects	Rank
ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units (treated/untreated) [*]	Mean abs. std. diff. (MASD) after adj.	Mean multivariate imbalance measure (MIM) after adj.	Mean abs. error (MAE) of treatment effects	Root mean squared error (RMSE) of treatment effects	MASD	MIM	MAE	RMSE
1	OLS	Diff. in means					0.5883		0.0264	0.0329	33		1	1
16	PSM	Two-step	With Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	1995 (1240/755)	0.4086	0.9758	0.0401	0.0502	29	12	2	2
2	OLS	Parametric					0.5883		0.0409	0.0511	33		3	3
15	PSM	Two-step	Without Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	2293 (1241/1052)	0.5398	0.9753	0.0413	0.0516	32	11	4	4
6	IPW	Two-step	Without Z				0.4625		0.0435	0.0543	30		5	5
14	PSM	Two-step	With Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1067 (533/533)	0.0548	0.9388	0.0449	0.0563	9	9	6	6
13	PSM	Two-step	Without Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1649 (825/825)	0.3776	0.9671	0.0455	0.0571	28	10	7	7
26	CEM	Two-step		j-to-k	Auto (13)	763 (382/381)	0.1638	0.9243	0.0543	0.0682	14	7	8	8
8	IPW	Two-step	With Z				0.3382		0.0573	0.0724	24		9	9
25	CEM	Diff. in means		j-to-k	Auto (13)	763 (382/381)	0.1638	0.9243	0.0591	0.0737	14	7	10	10
7	IPW	IPW estimates	With Z				0.3382		0.0596	0.0752	24		11	11
17	MDM	Diff. in means		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2105 (1053/1053)	0.0612	0.9794	0.0604	0.0758	10	13	12	12
24	CEM	Two-step		k-to-k	Auto (13)	544 (272/272)	0.0373	0.8927	0.0631	0.0787	5	5	14	13
32	CEM (weighting)	Two-step		k-to-k	Auto (13)		0.3045		0.0631	0.0787	22		14	13
34	CEM (weighting)	Two-step		j-to-k	Auto (13)		0.2549		0.0631	0.0791	16		13	15
33	CEM (weighting)	Diff. in means		j-to-k	Auto (13)		0.2549		0.0640	0.0802	16		16	16
23	CEM	Diff. in means		k-to-k	Auto (13)	544 (272/272)	0.0373	0.8927	0.0642	0.0802	5	5	18	17
31	CEM (weighting)	Diff. in means		k-to-k	Auto (13)		0.3045		0.0642	0.0802	22		18	17
18	MDM	Two-step		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2105 (1053/1053)	0.0612	0.9794	0.0641	0.0806	10	13	17	19
3	EB	Diff. in means					0.0000		0.0967	0.1221	1		20	20
4	EB	Two-step					0.0000		0.0967	0.1221	1		21	21
12	PSM	PSM estimates	With Z	1-to-5	No caliper		0.0818		0.1061	0.1328	12		22	22
22	CEM	Two-step		j-to-k	21–25	115 (58/57)	0.0511	0.5384	0.1445	0.1823	7	3	23	23
21	CEM	Diff. in means		j-to-k	21–25	115 (58/57)	0.0511	0.5384	0.1452	0.1831	7	3	24	24
20	CEM	Two-step		k-to-k	21–25	108 (54/54)	0.0248	0.5113	0.1474	0.1858	3	1	25	25
28	CEM (weighting)	Two-step		k-to-k	21–25		0.3000		0.1474	0.1858	20		25	25
19	CEM	Diff. in means		k-to-k	21–25	108 (54/54)	0.0248	0.5113	0.1476	0.1860	3	1	27	27
27	CEM (weighting)	Diff. in means		k-to-k	21–25		0.3000		0.1476	0.1860	20		27	27
30	CEM (weighting)	Two-step		j-to-k	21–25		0.2936		0.1490	0.1882	18		29	29
29	CEM (weighting)	Diff. in means		j-to-k	21–25		0.2936		0.1492	0.1885	18		30	30
10	PSM	PSM estimates	With Z	1-to-1	No caliper		0.1133		0.1571	0.1984	13		31	31
5	IPW	IPW estimates	Without Z				0.4625		0.4692	0.4707	30		32	32
9	PSM	PSM estimates	Without Z	1-to-1	No caliper		0.3619		0.4772	0.4824	27		33	33
11	PSM	PSM estimates	Without Z	1-to-5	No caliper		0.3563		0.4807	0.4834	26		34	34

[*] Due to rounding, the sum of treated and untreated units does not necessarily correspond to the number of matched units.

Table 4:

Simulation results from setting 3.

ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units (treated/untreated) [*]	Mean abs. std. diff. (MASD) after adj.	Mean multivariate imbalance measure (MIM) after adj.	Mean abs. error (MAE) of treatment effects	Root mean squared error (RMSE) of treatment effects	Rank
ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units (treated/untreated) [*]	Mean abs. std. diff. (MASD) after adj.	Mean multivariate imbalance measure (MIM) after adj.	Mean abs. error (MAE) of treatment effects	Root mean squared error (RMSE) of treatment effects	MASD	MIM	MAE	RMSE
25	CEM	Diff. in means		j-to-k	Auto (13)	768 (382/386)	0.1660	0.9242	0.0810	0.0988	14	7	1	1
5	IPW	IPW estimates	Without Z				0.4599		0.1202	0.1265	30		2	2
11	PSM	PSM estimates	Without Z	1-to-5	No caliper		0.3539		0.1253	0.1350	26		3	3
9	PSM	PSM estimates	Without Z	1-to-1	No caliper		0.3600		0.1253	0.1408	27		4	4
2	OLS	Parametric					0.5883		0.1469	0.1551	33		5	5
15	PSM	Two-step	Without Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	2293 (1241/1052)	0.5360	0.9756	0.1516	0.1602	32	11	6	6
6	IPW	Two-step	Without Z				0.4599		0.1610	0.1701	30		12	7
13	PSM	Two-step	Without Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1652 (826/826)	0.3740	0.9680	0.1601	0.1705	28	10	11	8
16	PSM	Two-step	With Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	2004 (1240/763)	0.4082	0.9762	0.1622	0.1705	29	12	13	9
23	CEM	Diff. in means		k-to-k	Auto (13)	549 (275/275)	0.0361	0.8918	0.1541	0.1741	5	5	7	10
31	CEM (weighting)	Diff. in means		k-to-k	Auto (13)		0.3098		0.1541	0.1741	22		7	10
33	CEM (weighting)	Diff. in means		j-to-k	Auto (13)		0.2592		0.1554	0.1753	16		9	12
17	MDM	Diff. in means		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2109 (1054/1054)	0.0608	0.9790	0.1590	0.1772	10	13	10	13
26	CEM	Two-step		j-to-k	Auto (13)	768 (382/386)	0.1660	0.9242	0.1768	0.1911	14	7	14	14
14	PSM	Two-step	With Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1079 (540/540)	0.0543	0.9399	0.1890	0.1984	9	9	18	15
34	CEM (weighting)	Two-step		j-to-k	Auto (13)		0.2592		0.1811	0.1995	16		15	16
24	CEM	Two-step		k-to-k	Auto (13)	549 (275/275)	0.0361	0.8918	0.1816	0.1998	5	5	16	17
32	CEM (weighting)	Two-step		k-to-k	Auto (13)		0.3098		0.1816	0.1998	22		16	17
18	MDM	Two-step		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2109 (1054/1054)	0.0608	0.9790	0.1999	0.2170	10	13	21	19
8	IPW	Two-step	With Z				0.3373		0.2050	0.2185	24		25	20
7	IPW	IPW estimates	With Z				0.3373		0.2047	0.2191	24		24	21
12	PSM	PSM estimates	With Z	1-to-5	No caliper		0.0813		0.1939	0.2251	12		19	22
3	EB	Diff. in means					0.0000		0.2040	0.2339	1		22	23
4	EB	Two-step					0.0000		0.2040	0.2339	1		23	24
21	CEM	Diff. in means		j-to-k	21–25	113 (57/56)	0.0519	0.5259	0.1955	0.2419	7	3	20	25
19	CEM	Diff. in means		k-to-k	21–25	106 (53/53)	0.0246	0.5030	0.2064	0.2542	3	1	26	26
27	CEM (weighting)	Diff. in means		k-to-k	21–25		0.3092		0.2064	0.2542	20		26	26
22	CEM	Two-step		j-to-k	21–25	113 (57/56)	0.0519	0.5259	0.2115	0.2584	7	3	29	28
29	CEM (weighting)	Diff. in means		j-to-k	21–25		0.3006		0.2105	0.2593	18		28	29
20	CEM	Two-step		k-to-k	21–25	106 (53/53)	0.0246	0.5030	0.2126	0.2607	3	1	30	30
28	CEM (weighting)	Two-step		k-to-k	21–25		0.3092		0.2126	0.2607	20		30	30
30	CEM (weighting)	Two-step		j-to-k	21–25		0.3006		0.2166	0.2656	18		32	32
1	OLS	Diff. in means					0.5883		0.2705	0.2724	33		34	33
10	PSM	PSM estimates	With Z	1-to-1	No caliper		0.1101		0.2327	0.2742	13		33	34

[*] Due to rounding, the sum of treated and untreated units does not necessarily correspond to the number of matched units.

Table 2 presents the simulation results obtained from Setting 1, where the assumption of unconfoundedness is satisfied. Overall, there is no substantive difference in the ranking results based on MAE and RMSE. IPW and PSM are generally lower in MAE and RMSE than CEM, MDM, and EB. IPW estimates (without Z) (ID 5) achieved the lowest bias, with an average bias of about 6 % in MAE. PSM estimates (without Z) (ID 11) attained the next lowest bias, with an average bias of about 8 % in MAE. However, the two-step procedure, i.e. doubly robust estimation, is not necessarily less effective in reducing bias than IPW/PSM estimates. In fact, Table 2 shows that half of the top 10 in the RMSE ranking are two-step estimates (IDs 16, 15, 6, 14, and 13), except IPW estimates in the first place (ID 5), OLS estimates in the fourth place (ID 2), and PSM estimates in the second and ninth places (IDs 11 and 9). In Setting 1, where the assumption of unconfoundedness is satisfied, the OLS estimation result (ID 2) from unmatched/unweighted samples is equivalent to those obtained from samples applied with PSM, and its estimated bias in terms of RMSE is lower than the estimated bias from samples applied with MDM and CEM.

Regarding the instrumental variable Z, Table 2 suggests that not including Z in the propensity score calculation (e.g. IDs 5, 11, 15, and 6) is, on average, more likely to result in lower bias than including Z in the propensity score calculation (e.g. IDs 16 and 14) in terms of MAE and RMSE. The results are consistent with those of Bhattacharya and Vogt (2007) and Wooldridge (2016). For 1-to-1/k-to-k matching and 1-to-many/j-to-k matching, the latter (e.g. IDs 11, 16, and 15) appears to achieve lower bias in terms of MAE and RMSE than the former (e.g. IDs 14, 13, and 9).

EB (IDs 3 and 4) performs very well in improving covariate balance. After covariate distribution adjustment using EB, the MASD between the treatment and comparison groups is nearly zero. The problem is that despite its covariate-balancing ability, EB does not very effectively reduce bias in terms of MAE and RMSE: estimation with an EB-weighted sample shows a mean bias of about 20 % for MAE in the simple difference in means and the two-step procedures.

MDM performs as well as other matching/weighting methods in improving covariate balance. However, MDM is less effective than IPW, PSM, and CEM regarding bias reduction. Indeed, estimates from samples obtained by MDM (IDs 17 and 18), which have less than 10 % MASD between the treatment and comparison groups, show an average bias of about 14 % when measured by MAE.

CEM (matching), which uses j-to-k match and auto cut-points (13), is next to PSM regarding bias reduction. It shows a mean bias of about 12 % for MAE (ID 26). The simulation results from Setting 1 also suggest that in the case of CEM, weighting is slightly inferior to matching in terms of bias reduction: it results in a mean bias of about 14 % for MAE (e.g. IDs 32 and 34).

Overall, comparing RMSE and MASD rankings in Table 2, it appears that the improvement in covariate balance measured by MASD does not necessarily lead to an improvement in bias as measured by RMSE. A similar trend emerges even when we compare RMSE and MIM rankings, though only with post-matching data because the CEM command does not accept weights calculated with methods other than cem.

Then, what is the relationship between RMSE and MASD in each sample obtained by weighting or matching? Does the improvement in covariate balance measured by MASD again not necessarily lead to an improvement in bias measured by RMSE in each sample? We examine the relationship between changes in covariate balance as measured by MASD and changes in bias as measured by RMSE by plotting them on figures. For simplicity, the figures compare the bias of the simple difference in means and parametric OLS estimators using unweighted and unmatched samples with the bias of estimations using samples applied with EB, IPW, PSM, MDM, and CEM. We created figures for all 32 estimators (except for the simple difference in means and parametric OLS estimators using raw data). The number of figures for the three settings amounts to 96. Therefore, we present the figures with the lowest mean bias for each of EB, IPW, PSM, MDM, and CEM in the main text, and we present the remaining figures in Appendix C (Figures C1-1–C1-32; C2-1–C2-32; C3-1–C3-32).

In Setting 1, the biases of a simple difference in means and OLS estimation on raw data are generally stable regardless of the magnitude of covariate imbalance (see Figures 16 and 20). In contrast, the covariate imbalance is stable at almost zero in the sample obtained with EB weighting, but the bias in terms of RMSE is highly variable and large. The same trend emerges for the two-step procedure, with almost the same size of bias (Appendix C, Figure C1-2).

Figure 16:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 3.

In contrast, Figure 17 shows that in the sample obtained from IPW, the bias in terms of RMSE is generally constant regardless of covariate balance, as in the case of a simple difference in means and OLS estimation on raw data. For IPW, we observe a trend in which the bias increases slightly when the covariate balance is achieved to nearly the fullest extent. The same tendency is observed for the two-step procedure, and the size of the bias is slightly larger (Appendix C, Figure C1-4).

Figure 17:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 5.

Figure 18 suggests that a similar pattern emerges in the PSM. Again, the bias in terms of RMSE is generally constant regardless of the covariate balance in the sample obtained from PSM, but the bias increases slightly when the maximum covariate balance is achieved. The same tendency is observed in other PSM estimates (Appendix C, Figures C1-7–C1-14), but we observe no substantive increase in bias proportional to the improvement in covariate balance for ID 10 and ID 12 (Appendix C, Figures C1-8 and C1-10). Although biases in average PSM estimates for ID 10 and ID 12, obtained when Z (instrumental variable) was included in the propensity score calculation, decrease in proportion to the improvement in covariate balance, the biases are almost twice as large as those obtained from other PSM estimates.

Figure 18:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 11.

In contrast, Figure 19 shows that when MDM is applied, biases in the estimates of the coefficients of W improve in conjunction with the improvement in covariate balance. However, the biases in the estimated coefficients of W obtained from samples applied with MDM are generally higher than those of the estimates obtained from IPW and PSM. The same result is observed for the two-step procedure, with almost the same size of bias (Appendix C, Figure C1-16).

Figure 19:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 17.

Finally, Figure 20 shows that a similar pattern is observed in CEM, too. Here again, the bias in terms of RMSE is generally constant regardless of the covariate balance of the samples obtained from the CEM. Moreover, the bias shows a slight tendency to increase when the covariate balance is maximized. It should be noted, however, that such a trend is not always observed in the case of CEM, depending on the estimation setup. In other words, in k-to-k matching using five sets of cut points (Figures C1-17 and C1-18; IDs 19, 20) and weighting after k-to-k matching using auto cut points (Figures C1-29 and C1-30; IDs 31, 32), there is almost no tendency for the mean bias to increase even when the covariate balance is maximized. Unfortunately, they did not achieve the lowest bias obtained from the CEM.

Figure 20:

The bias in the treatment effect estimate and the covariate balance: ID 2 versus ID 26.

Our examination of the relationship between changes in covariate balance as measured by MASD and changes in bias as measured by RMSE suggests that beyond a certain point, the improvement in covariate balance can no longer help reduce bias in the estimated effects of treatment regardless of the weighting/matching method.

4.2.2 Setting 2

Even when unconfoundedness could be reasonably assumed, there is always the risk of confounding bias in estimates of causal effects when using observational data. In Setting 2, selection is on observables, in which stochastic dependence between W and u occurs because of the correlation between Z and u, leading to selection bias. Because Z is an observable confounder, conditioning on Z controls the dependence between W and u and mitigates the selection bias.

Table 3 presents the simulation results obtained from Setting 2. Again, there is no substantive difference in the ranking results based on MAE and RMSE. Overall, IPW and PSM are lower in MAE and RMSE than CEM, MDM, and EB. The lowest bias was achieved by a simple difference in means from raw data (ID 1), with an average bias of about 5 % in MAE. The OLS estimation results (ID 2) from unmatched/unweighted samples are again equivalent to or better than those obtained from samples applied with EB, IPW, PSM, MDM, and CEM. Biases in the OLS coefficient estimates from unmatched/unweighted samples are lower than the estimated bias from samples applied with EB, IPW, MDM, and CEM.

Here again, the two-step procedure, i.e. doubly robust estimation, is not necessarily less effective in reducing bias than IPW/PSM estimates: we attained the next-lowest bias using PSM estimates (with Z) from the two-step procedure (ID 16), with an average bias of about 8 % in MAE. In addition, Table 3 shows that 7 of the top 10 in the RMSE ranking are two-step estimates (IDs 16, 15, 6, 14, 13, 26, 8). As mentioned above, in Setting 2, selection is on observables: Z is an observable confounder, and conditioning on Z controls the dependence between W and u and mitigates selection bias. Put differently, the unobservable confounding factor is indirectly controlled by controlling for the observable confounders because of the correlation between observable and unobservable confounders (Z and u; Caliendo, Mahlstedt, and Mitnik 2017; Groenwold and Klungel 2015). Thanks to the indirect control of the unobservable confounding factor through conditioning on Z, the two-step procedure in Setting 2 seems no less effective than the IPW/PSM estimates in reducing bias.

Table 3 suggests, in contrast to Table 2, that including Z in the propensity score calculation (e.g. IDs 16, 14, and 8) does not necessarily result in higher bias than not including Z in the propensity score calculation (e.g. IDs 15, 6, and 13) in terms of MAE and RMSE. Apparently, the results are inconsistent with those of Bhattacharya and Vogt (2007) and Wooldridge (2016). However, Bhattacharya and Vogt (2007) suggest that the propensity score method becomes more inconsistent as the instrumental variable’s strength increases, i.e. as the instrument becomes a better predictor of treatment assignment. In Setting 2, the coefficient of Z is 0.5 and the correlation between Z and u is 0.4, suggesting that Z is neither a very strong nor a very weak instrument. In their simulation, Bhattacharya and Vogt (2007) set the coefficients of the instrument as 0.2 and 0.8 for the weak and strong instruments, respectively. Our Z is a moderately weak instrument compared to Bhattacharya and Vogt’s (2007) setting. Due to the moderately weak correlation between Z and u, it is presumed that including Z in the propensity score calculation does not necessarily result in a greater bias in terms of MAE and RMSE than not including Z in the propensity score calculation. For 1-to-1/k-to-k matching and 1-to-many/j-to-k matching, again, the latter (e.g. IDs 16 and 15) appears to achieve lower bias in terms of MAE and RMSE compared to the former (e.g. IDs 14 and 13).

When we compare RMSE and MASD/MIM rankings in Table 3, we again observe that the improvement in covariate balance measured by MASD/MIM does not necessarily lead to an improvement in bias as measured by RMSE. Also, by plotting the relationship between changes in MASD and changes in RMSE for each of EB, IPW, PSM, MDM, and CEM on figures, we find the following results, which are quite similar to those from Setting 1 and are consistent with the findings from the comparison of RMSE and MASD/MIM rankings in Table 3 (see Figures 21–25 for figures with the lowest mean bias for each estimator and Figures C2-1–C2-32 in Appendix C for all figures):

the biases of OLS estimators on raw data are generally stable regardless of the magnitude of covariate imbalance in the data (Figures 21 and 22);
the covariate imbalance is stable at almost zero in the sample obtained with EB weighting, but the bias in terms of RMSE is highly variable and large (Figure 21);
the biases in terms of RMSE are generally constant for IPW, PSM, and CEM regardless of covariate balance in the sample (Figures 22, 23, and 25);
all estimators show that the bias tends to increase slightly when the covariate balance is achieved to nearly the fullest extent (Figures 21–25) except for several estimates from EB (Figures C2-1 and C2-2), IPW (Figure C2-3), PSM (Figures C2-8 and C2-10), and CEM (Figures C2-17 and C2-18).
even when maximizing the covariate balance leads to little or no increase in the bias, unfortunately, it does not achieve the lowest bias that can be attained from each estimator (Figures C2-3, C2-8, C2-10, C2-17, and C2-18).

Figure 21:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 3.

Figure 22:

The bias in the treatment effect estimate and the covariate balance: ID 2 versus ID 6.

Figure 23:

The bias in the treatment effect estimate and the covariate balance: ID 2 versus ID 16.

Figure 24:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 17.

Figure 25:

The bias in the treatment effect estimate and the covariate balance: ID 2 versus ID 26.

4.2.3 Setting 3

Setting 3 has the issue of selection bias because of the stochastic dependence between W and u caused by the correlation between u and v, an unobservable confounder. Therefore, in Setting 3 (i.e. selection on unobservables), it is impossible to condition on v and mitigate the selection bias. As expected, including unobservable confounding factors causes an increase in the mean absolute bias in the coefficient estimates of W.

Table 4 exhibits the simulation results obtained from Setting 3. Here, we achieved the lowest bias with a simple difference in means from samples applied with CEM (ID 25) though with a larger average bias of about 16 % in MAE compared to those from Settings 1 and 2 due to selection on unobservables. Interestingly, the ranking results based on MAE and RMSE show a notable difference: in the MAE ranking, we observed a simple difference in means based on samples obtained from CEM and MDM (IDs 23, 31, 33, and 17) in the 7th to 10th places, but in the RMSE ranking, they dropped to the 10th to 13th places. Instead, doubly robust estimators based on IPW and PSM samples (IDs 6, 13, and 16) moved up to 7th–9th in the RMSE ranking. The results in Table 4 suggest that in Setting 3, estimation with samples obtained by CEM and MDM tends to produce larger outlier errors.

The OLS estimation results (ID 2) from unmatched/unweighted samples are again equivalent to those obtained from samples applied with PSM. Biases in the OLS coefficient estimates from unmatched/unweighted samples are lower than the estimated bias from samples applied with EB and MDM.

Regarding the two-step procedure, doubly robust estimators are not among the top four in the RMSE ranking. In fact, in Setting 3, the two-step procedure does not seem as effective as a difference in means after CEM (ID 25) and IPW/PSM estimates (IDs 5, 11, and 9) in reducing bias. These results seem consistent with the fact that the two-step procedure cannot be doubly robust in Setting 3, i.e. selection on unobservables. Indeed, in Setting 3, the propensity score model is incorrect, and the parametric model is misspecified.

Table 4 suggests that not including Z in the propensity score calculation (e.g. IDs 5, 11, 9, 15, 6, and 13) results in lower bias than including Z in the propensity score calculation (e.g. ID 16) in terms of MAE and RMSE, which is consistent with Bhattacharya and Vogt (2007) and Wooldridge (2016). Furthermore, for 1-to-1/k-to-k matching and 1-to-many/j-to-k matching, again, the latter (e.g. IDs 25 and 11) seems to achieve lower bias in terms of MAE and RMSE than the former (e.g. IDs 9 and 13).

Comparing RMSE and MASD/MIM rankings in Table 4, here again, we find that the improvement in covariate balance measured using MASD/MIM does not necessarily lead to an improvement in bias as measured using RMSE.

The relationship between changes in MASD and changes in RMSE for each of EB, IPW, PSM, MDM, and CEM demonstrates the following results (see Figures 26–30 for figures with the lowest mean bias for each estimator and Figures C3-1–C3-32 in Appendix C for all figures):

the biases of OLS estimators on raw data are generally stable regardless of the magnitude of covariate imbalance in the data (Figures 26 and C3-2);
the covariate imbalance is stable at almost zero in the sample obtained with EB weighting, but the bias in terms of RMSE is highly variable and large (Figures 26 and C3-2);
the biases in terms of RMSE are generally constant for IPW, PSM, and CEM regardless of covariate balance in the sample (Figures 27, 28, and 30);
a simple difference in means after MDM (Figure 29) suggests that the bias in terms of RMSE increases as the covariate balance improves whereas the bias in the estimates from the two-step procedure (Figure C3-16) is generally constant;
all estimators show that the bias tends to increase slightly when the covariate balance is achieved to nearly the fullest extent (Figures 26 –30) except for several estimates from EB (Figures C3-1 and C3-2), PSM (Figure C3-10), MDM (Figure C3-16), and CEM (Figures C3-22, C3-24, C3-29 and C3-30);
however, again, even when maximizing the covariate balance leads to little or no increase in the ATET bias, it does not achieve the lowest bias that can be attained from each estimator.

Figure 26:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 3.

Figure 27:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 5.

Figure 28:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 11.

Figure 29:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 17.

Figure 30:

The bias in the treatment effect estimate and the covariate balance: ID 1 versus ID 25.

These findings are consistent with those from Settings 1 and 2 and MASD/MIM rankings in Table 4.

4.2.4 Summary

In the analysis so far, OLS (ID 2), IPW (ID 6), and PSM (IDs 13, 15, and 16) achieve good performance and remained in the top 10 in the bias reduction ranking in all three settings. We obtain all of these estimates from the doubly robust approaches. Also, except for ID 16, the instrumental variable Z is not included in the propensity score calculation. Regarding 1-to-1 matching and 1-to-many matching, the latter (IDs 15 and 16) achieved lower bias in terms of MAE and RMSE than the former (ID 13). The above results suggest that the two-step procedure with IPW and PSM (1-to-many matching) without instrument(s) may lead to relatively low bias estimates for selections on observables and unobservables in the finite-sample situations with the absence of functional misspecification of the propensity score calculation. However, it does not necessarily lead to the estimates with the lowest bias.

At the same time, our simulation results, especially the results for ID 2, seem to align with Angrist and Pischke’s (2009) remark that correctly specified regression works as well as matching techniques, including PSM, in reducing estimation bias. According to Angrist and Pischke (2009), “Regression control for the right covariates does a reasonably good job of eliminating selection bias.” In light of this rule of thumb, matching and weighting estimators can and perhaps should be used as a sensitivity check on conventional parametric practice.

However, the following points should be noted: (1) at least in our data-generating processes, the improvement of covariate balance no longer tends to reduce bias further in the estimated effects of treatment regardless of the weighting/matching methods when the covariate balance is achieved to nearly the fullest extent; (2) our simulation analysis did not reveal an appropriate range of covariate balances to minimize the bias in the estimated effects of treatment; (3) our simulation results did not uniquely reveal which of the covariate balancing methods examined would be appropriate for which data sets.

5 Conclusions and Limitations

We examined the effectiveness of weighting and matching methods in improving covariate balance and correcting for bias in the causal effect estimation through simulations in finite-sample situations. The results can be summarized in the following points:

EB, IPW, PSM, MDM, and CEM all effectively improve covariate balance.
In our simulations, the consequence King and Nielsen (2019) mention, i.e. an increase in covariate imbalance associated with removing observations, occurs not only in PSM but also in IPW, MDM, and CEM. No matter the approach used for matching or weighting, implementing covariate balancing to the utmost extent does not necessarily lead to the best covariate balance in our generated data.
In our generated data, a reduction in the covariate imbalance does not necessarily lead to a lower bias in the estimated effects of treatment. Rather, intervals in which the mean bias is constant and scarcely varies regardless of the covariate balance are predominant in the data generated in this study.
Regardless of the weighting/matching techniques and the data generation processes in this study, once a substantially improved covariate balance is achieved with the given techniques for a given sample, the estimated bias tends to worsen slightly as the covariate balance continues to improve. The exceptions are IDs 31 and 32 (CEM weighting with 1-to-1 matching and auto-cut points), but they do not reduce bias well.
In general, our simulations support Angrist and Pischke’s (2009) remark that regression with proper covariates reduces selection bias as well as other weighting and matching methods. This rule of thumb suggests that matching and weighting estimators can be used as a sensitivity check on conventional parametric estimation.
Regarding ATET bias reduction, IPW, PSM, and CEM also performed well in our generated data sets. Unfortunately, EB did not perform well in terms of bias reduction in any of the settings. MDM also did not perform as well. However, we could not clearly determine which of these covariate balancing methods would be appropriate for which data sets.

Overall, our simulation results suggest that when we analyze observational data, it would be important, as Smith and Todd (2005) also point out, not to look for “a magic bullet estimator.” Instead, we should carefully explore or develop the appropriate nonexperimental estimator, including traditional ones, such as OLS, for the sample by thoroughly investigating the characteristics of the available data.

Unfortunately, our simulation results do not indicate to what extent the covariate balance should be achieved to minimize estimation bias. Conversely, quite a few data sets generated in this study show no substantive increase in the average estimated bias, even with the standardized differences in the 30–50 % range between the treatment and comparison groups. The relationship between covariate balance and ATET bias should be further investigated in future research with different data generation processes and other covariate balancing methods. For example, we did not thoroughly examine the effect of functional misspecification of the propensity score on covariate balance or bias in ATET estimation. We were also not able to perform simulations using empirical data that Huber, Lechner, and Wunsch (2013) conducted. These limitations of our study require further investigation in future research.

Corresponding author: Hideki Fukui, Faculty of Law and Letters, Ehime University, 3 Bunkyo-Cho, Matsuyama, Ehime 790-8577, Japan, E-mail: fukui.hideki.hz@ehime-u.ac.jp

Research interest: Policy Evaluation; Empirical Industrial Organization.

Acknowledgments

I would like to thank an anonymous reviewer, the Editor-in-chief Uwe Wagschal, and participants at the 2022 APSA Annual Meeting for their valuable and insightful comments which greatly improved the manuscript. I also acknowledge financial support from the Japan Society for the Promotion of Science (20K01448; 21KK0028) and FY2023 Faculty of Law and Letters, Ehime University, Grant-in-Aid for Publication of Scientific Research Results. All remaining errors are the responsibility of the author.

Appendix A

Figure A1-1:

Relationship between MASD and the number of removed observations (7).

Figure A1-2:

Relationship between MASD and the number of removed observations (8).

Figure A1-3:

Relationship between MASD and the number of removed observations (9).

Figure A1-4:

Relationship between MASD and the number of removed observations (10).

Figure A1-5:

Relationship between MASD and the number of removed observations (11).

Figure A1-6:

Relationship between MASD and the number of removed observations (12).

Figure A1-7:

Relationship between MIM and the number of removed observations (7).

Figure A1-8:

Relationship between MIM and the number of removed observations (8).

Figure A1-9:

Relationship between MIM and the number of removed observations (9).

Figure A1-10:

Relationship between MIM and the number of removed observations (10).

Figure A1-11:

Relationship between MIM and the number of removed observations (11).

Figure A1-12:

Relationship between MIM and the number of removed observations (12).

Figure A2-1:

Relationship between MASD and the number of removed observations (13).

Figure A2-2:

Relationship between MASD and the number of removed observations (14).

Figure A2-3:

Relationship between MASD and the number of removed observations (15).

Figure A2-4:

Relationship between MASD and the number of removed observations (16).

Figure A2-5:

Relationship between MASD and the number of removed observations (17).

Figure A2-6:

Relationship between MASD and the number of removed observations (18).

Figure A2-7:

Relationship between MIM and the number of removed observations (13).

Figure A2-8:

Relationship between MIM and the number of removed observations (14).

Figure A2-9:

Relationship between MIM and the number of removed observations (15).

Figure A2-10:

Relationship between MIM and the number of removed observations (16).

Figure A2-11:

Relationship between MIM and the number of removed observations (17).

Figure A2-12:

Relationship between MIM and the number of removed observations (18).

Appendix B

Table B1:

Mean absolute error (MAE) of treatment effects from Setting 1.

ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units	Mean abs. std. diff. (MASD) after adj.	Mean abs. error (MAE) of treatment effects
ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units	Mean abs. std. diff. (MASD) after adj.	Mean	Min	Max	5th PCTL	25th PCTL	50th PCTL	75th PCTL	95th PCTL
1	OLS	Diff. in means					0.5883	0.3704	0.2539	0.4909	0.3175	0.3482	0.3708	0.3921	0.4232
2	OLS	Parametric					0.5883	0.0446	0.0000	0.1949	0.0037	0.0180	0.0379	0.0640	0.1106
3	EB	Diff. in means					0.0000	0.1055	0.0000	0.5639	0.0082	0.0425	0.0879	0.1508	0.2634
4	EB	Two-step					0.0000	0.1055	0.0000	0.5640	0.0082	0.0425	0.0879	0.1508	0.2634
5	IPW	IPW estimates	Without Z				0.4625	0.0323	0.0000	0.1607	0.0024	0.0127	0.0274	0.0468	0.0794
6	IPW	Two-step	Without Z				0.4625	0.0474	0.0000	0.2409	0.0037	0.0187	0.0407	0.0686	0.1156
7	IPW	IPW estimates	With Z				0.3382	0.0644	0.0000	0.4819	0.0050	0.0254	0.0541	0.0921	0.1606
8	IPW	Two-step	With Z				0.3382	0.0625	0.0000	0.4613	0.0051	0.0246	0.0523	0.0893	0.1553
9	PSM	PSM estimates	Without Z	1-to-1	No caliper		0.3619	0.0566	0.0000	0.2911	0.0045	0.0222	0.0468	0.0820	0.1410
10	PSM	PSM estimates	With Z	1-to-1	No caliper		0.1133	0.1503	0.0000	0.8497	0.0116	0.0575	0.1267	0.2148	0.3701
11	PSM	PSM estimates	Without Z	1-to-5	No caliper		0.3563	0.0422	0.0000	0.1802	0.0032	0.0167	0.0360	0.0608	0.1034
12	PSM	PSM estimates	With Z	1-to-5	No caliper		0.0818	0.1056	0.0000	0.4948	0.0081	0.0417	0.0894	0.1522	0.2591
13	PSM	Two-step	Without Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1649	0.3776	0.0497	0.0000	0.2545	0.0039	0.0197	0.0421	0.0718	0.1219
14	PSM	Two-step	With Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1067	0.0548	0.0490	0.0000	0.2558	0.0039	0.0196	0.0414	0.0706	0.1201
15	PSM	Two-step	Without Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	2293	0.5398	0.0450	0.0000	0.2501	0.0036	0.0182	0.0381	0.0650	0.1103
16	PSM	Two-step	With Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	1995	0.4086	0.0437	0.0000	0.2412	0.0035	0.0176	0.0368	0.0632	0.1076
17	MDM	Diff. in means		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2105	0.0612	0.0692	0.0000	0.4098	0.0055	0.0275	0.0586	0.0996	0.1693
18	MDM	Two-step		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2105	0.0612	0.0700	0.0000	0.4118	0.0055	0.0278	0.0588	0.1006	0.1727
19	CEM	Diff. in means		k-to-k	21–25	108	0.0248	0.1609	0.0000	0.9897	0.0124	0.0636	0.1354	0.2301	0.4008
20	CEM	Two-step		k-to-k	21–25	108	0.0248	0.1608	0.0000	0.9800	0.0124	0.0635	0.1351	0.2302	0.4005
21	CEM	Diff. in means		j-to-k	21–25	115	0.0511	0.1578	0.0000	0.8830	0.0125	0.0628	0.1317	0.2267	0.3904
22	CEM	Two-step		j-to-k	21–25	115	0.0511	0.1577	0.0000	0.9074	0.0124	0.0618	0.1318	0.2261	0.3902
23	CEM	Diff. in means		k-to-k	Auto (13)	544	0.0373	0.0731	0.0000	0.3425	0.0059	0.0298	0.0624	0.1056	0.1780
24	CEM	Two-step		k-to-k	Auto (13)	544	0.0373	0.0689	0.0000	0.3321	0.0056	0.0282	0.0588	0.0988	0.1681
25	CEM	Diff. in means		j-to-k	Auto (13)	763	0.1638	0.1101	0.0000	0.3724	0.0137	0.0593	0.1064	0.1544	0.2238
26	CEM	Two-step		j-to-k	Auto (13)	763	0.1638	0.0592	0.0000	0.2792	0.0043	0.0233	0.0505	0.0852	0.1457
27	CEM (weighting)	Diff. in means		k-to-k	21–25		0.3000	0.1609	0.0000	0.9897	0.0124	0.0636	0.1354	0.2301	0.4008
28	CEM (weighting)	Two-step		k-to-k	21–25		0.3000	0.1608	0.0000	0.9800	0.0124	0.0635	0.1351	0.2302	0.4005
29	CEM (weighting)	Diff. in means		j-to-k	21–25		0.2936	0.1626	0.0000	0.9147	0.0126	0.0640	0.1360	0.2341	0.4042
30	CEM (weighting)	Two-step		j-to-k	21–25		0.2936	0.1626	0.0000	0.9260	0.0125	0.0639	0.1359	0.2339	0.4056
31	CEM (weighting)	Diff. in means		k-to-k	Auto (13)		0.3045	0.0731	0.0000	0.3425	0.0059	0.0298	0.0624	0.1056	0.1780
32	CEM (weighting)	Two-step		k-to-k	Auto (13)		0.3045	0.0689	0.0000	0.3321	0.0056	0.0282	0.0588	0.0988	0.1681
33	CEM (weighting)	Diff. in means		j-to-k	Auto (13)		0.2549	0.0729	0.0000	0.3223	0.0063	0.0295	0.0616	0.1051	0.1768
34	CEM (weighting)	Two-step		j-to-k	Auto (13)		0.2549	0.0688	0.0000	0.3631	0.0055	0.0269	0.0586	0.0993	0.1689

Table B2:

Mean absolute error (MAE) of treatment effects from Setting 2.

ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units	Mean abs. std. diff. (MASD) after adj.	Mean abs. error (MAE) of treatment effects
ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units	Mean abs. std. diff. (MASD) after adj.	Mean	Min	Max	5th PCTL	25th PCTL	50th PCTL	75th PCTL	95th PCTL
1	OLS	Diff. in means					0.5883	0.0264	0.0000	0.1235	0.0022	0.0107	0.0227	0.0381	0.0635
2	OLS	Parametric					0.5883	0.0409	0.0000	0.1786	0.0034	0.0165	0.0347	0.0586	0.1013
3	EB	Diff. in means					0.0000	0.0967	0.0000	0.5169	0.0075	0.0390	0.0806	0.1382	0.2414
4	EB	Two-step					0.0000	0.0967	0.0000	0.5169	0.0075	0.0390	0.0806	0.1382	0.2415
5	IPW	IPW estimates	Without Z				0.4625	0.4692	0.3105	0.5951	0.4054	0.4436	0.4691	0.4951	0.5323
6	IPW	Two-step	Without Z				0.4625	0.0435	0.0000	0.2208	0.0034	0.0171	0.0373	0.0629	0.1060
7	IPW	IPW estimates	With Z				0.3382	0.0596	0.0000	0.4463	0.0047	0.0235	0.0501	0.0852	0.1486
8	IPW	Two-step	With Z				0.3382	0.0573	0.0000	0.4227	0.0046	0.0226	0.0479	0.0818	0.1424
9	PSM	PSM estimates	Without Z	1-to-1	No caliper		0.3619	0.4772	0.2186	0.7529	0.3593	0.4300	0.4778	0.5247	0.5929
10	PSM	PSM estimates	With Z	1-to-1	No caliper		0.1133	0.1571	0.0000	0.8569	0.0113	0.0634	0.1310	0.2243	0.3912
11	PSM	PSM estimates	Without Z	1-to-5	No caliper		0.3563	0.4807	0.3067	0.6590	0.3949	0.4468	0.4810	0.5154	0.5639
12	PSM	PSM estimates	With Z	1-to-5	No caliper		0.0818	0.1061	0.0000	0.4890	0.0086	0.0426	0.0897	0.1527	0.2613
13	PSM	Two-step	Without Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1649	0.3776	0.0455	0.0000	0.2333	0.0036	0.0180	0.0386	0.0658	0.1117
14	PSM	Two-step	With Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1067	0.0548	0.0449	0.0000	0.2345	0.0036	0.0180	0.0380	0.0647	0.1101
15	PSM	Two-step	Without Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	2293	0.5398	0.0413	0.0000	0.2292	0.0033	0.0167	0.0349	0.0596	0.1011
16	PSM	Two-step	With Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	1995	0.4086	0.0401	0.0000	0.2211	0.0032	0.0162	0.0337	0.0580	0.0986
17	MDM	Diff. in means		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2105	0.0612	0.0604	0.0000	0.3438	0.0047	0.0242	0.0508	0.0869	0.1488
18	MDM	Two-step		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2105	0.0612	0.0641	0.0000	0.3774	0.0050	0.0255	0.0539	0.0922	0.1583
19	CEM	Diff. in means		k-to-k	21–25	108	0.0248	0.1476	0.0000	0.9258	0.0114	0.0585	0.1241	0.2110	0.3676
20	CEM	Two-step		k-to-k	21–25	108	0.0248	0.1474	0.0000	0.8982	0.0113	0.0582	0.1238	0.2110	0.3671
21	CEM	Diff. in means		j-to-k	21–25	115	0.0511	0.1452	0.0000	0.8177	0.0109	0.0572	0.1214	0.2086	0.3591
22	CEM	Two-step		j-to-k	21–25	115	0.0511	0.1445	0.0000	0.8316	0.0114	0.0567	0.1208	0.2072	0.3576
23	CEM	Diff. in means		k-to-k	Auto (13)	544	0.0373	0.0642	0.0000	0.2994	0.0053	0.0255	0.0546	0.0926	0.1583
24	CEM	Two-step		k-to-k	Auto (13)	544	0.0373	0.0631	0.0000	0.3043	0.0052	0.0258	0.0539	0.0906	0.1540
25	CEM	Diff. in means		j-to-k	Auto (13)	763	0.1638	0.0591	0.0000	0.2883	0.0048	0.0242	0.0503	0.0852	0.1438
26	CEM	Two-step		j-to-k	Auto (13)	763	0.1638	0.0543	0.0000	0.2559	0.0040	0.0214	0.0463	0.0781	0.1335
27	CEM (weighting)	Diff. in means		k-to-k	21–25		0.3000	0.1476	0.0000	0.9258	0.0114	0.0585	0.1241	0.2110	0.3676
28	CEM (weighting)	Two-step		k-to-k	21–25		0.3000	0.1474	0.0000	0.8982	0.0113	0.0582	0.1238	0.2110	0.3671
29	CEM (weighting)	Diff. in means		j-to-k	21–25		0.2936	0.1492	0.0000	0.8505	0.0114	0.0586	0.1245	0.2153	0.3713
30	CEM (weighting)	Two-step		j-to-k	21–25		0.2936	0.1490	0.0000	0.8487	0.0115	0.0586	0.1246	0.2144	0.3717
31	CEM (weighting)	Diff. in means		k-to-k	Auto (13)		0.3045	0.0642	0.0000	0.2994	0.0053	0.0255	0.0546	0.0926	0.1583
32	CEM (weighting)	Two-step		k-to-k	Auto (13)		0.3045	0.0631	0.0000	0.3043	0.0052	0.0258	0.0539	0.0906	0.1540
33	CEM (weighting)	Diff. in means		j-to-k	Auto (13)		0.2549	0.0640	0.0000	0.3053	0.0049	0.0257	0.0544	0.0920	0.1570
34	CEM (weighting)	Two-step		j-to-k	Auto (13)		0.2549	0.0631	0.0000	0.3327	0.0050	0.0246	0.0537	0.0910	0.1548

Table B3:

Mean absolute error (MAE) of treatment effects from Setting 3.

ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units	Mean abs. std. diff. (MASD) after adj.	Mean abs. error (MAE) of treatment effects
ID	Estimation	Estimates	PS calculation	Matching	Calipers/cut-points	Mean number of matched units	Mean abs. std. diff. (MASD) after adj.	Mean	Min	Max	5th PCTL	25th PCTL	50th PCTL	75th PCTL	95th PCTL
1	OLS	Diff. in means					0.5883	0.2705	0.1586	0.3832	0.2172	0.2487	0.2702	0.2923	0.3240
2	OLS	Parametric					0.5883	0.1469	0.0000	0.3349	0.0623	0.1136	0.1475	0.1809	0.2276
3	EB	Diff. in means					0.0000	0.2040	0.0002	0.7247	0.0287	0.1155	0.1979	0.2838	0.3990
4	EB	Two-step					0.0000	0.2040	0.0002	0.7247	0.0287	0.1155	0.1979	0.2838	0.3990
5	IPW	IPW estimates	Without Z				0.4599	0.1202	0.0007	0.2751	0.0542	0.0936	0.1205	0.1470	0.1847
6	IPW	Two-step	Without Z				0.4599	0.1610	0.0008	0.3884	0.0702	0.1247	0.1614	0.1985	0.2504
7	IPW	IPW estimates	With Z				0.3373	0.2047	0.0002	0.5402	0.0736	0.1522	0.2060	0.2576	0.3313
8	IPW	Two-step	With Z				0.3373	0.2050	0.0003	0.4818	0.0785	0.1546	0.2059	0.2567	0.3262
9	PSM	PSM estimates	Without Z	1-to-1	No caliper		0.3600	0.1253	0.0000	0.3640	0.0217	0.0777	0.1238	0.1690	0.2349
10	PSM	PSM estimates	With Z	1-to-1	No caliper		0.1101	0.2327	0.0001	0.9237	0.0249	0.1166	0.2201	0.3334	0.4899
11	PSM	PSM estimates	Without Z	1-to-5	No caliper		0.3539	0.1253	0.0002	0.3166	0.0401	0.0905	0.1254	0.1596	0.2086
12	PSM	PSM estimates	With Z	1-to-5	No caliper		0.0813	0.1939	0.0001	0.6743	0.0224	0.1038	0.1875	0.2734	0.3945
13	PSM	Two-step	Without Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1652	0.3740	0.1601	0.0000	0.4107	0.0625	0.1204	0.1603	0.1998	0.2566
14	PSM	Two-step	With Z	1-to-1	0.05, 0.1, 0.15, 0.2, 0.25	1079	0.0543	0.1890	0.0001	0.4237	0.0881	0.1486	0.1894	0.2298	0.2886
15	PSM	Two-step	Without Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	2293	0.5360	0.1516	0.0001	0.3772	0.0655	0.1171	0.1518	0.1866	0.2363
16	PSM	Two-step	With Z	1-to-5	0.05, 0.1, 0.15, 0.2, 0.25	2004	0.4082	0.1622	0.0004	0.3878	0.0748	0.1272	0.1624	0.1981	0.2482
17	MDM	Diff. in means		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2109	0.0608	0.1590	0.0000	0.5536	0.0299	0.1023	0.1577	0.2130	0.2910
18	MDM	Two-step		1-to-1	0.3, 0.975, 1.65, 2.325, 3	2109	0.0608	0.1999	0.0000	0.6458	0.0588	0.1414	0.1997	0.2577	0.3404
19	CEM	Diff. in means		k-to-k	21–25	106	0.0246	0.2064	0.0000	1.1202	0.0178	0.0867	0.1812	0.2987	0.4877
20	CEM	Two-step		k-to-k	21–25	106	0.0246	0.2126	0.0000	1.1252	0.0178	0.0907	0.1879	0.3070	0.4970
21	CEM	Diff. in means		j-to-k	21–25	113	0.0519	0.1955	0.0000	1.0180	0.0165	0.0820	0.1689	0.2816	0.4668
22	CEM	Two-step		j-to-k	21–25	113	0.0519	0.2115	0.0000	1.0337	0.0183	0.0920	0.1873	0.3046	0.4907
23	CEM	Diff. in means		k-to-k	Auto (13)	549	0.0361	0.1541	0.0000	0.4736	0.0266	0.0941	0.1514	0.2093	0.2934
24	CEM	Two-step		k-to-k	Auto (13)	549	0.0361	0.1816	0.0002	0.5047	0.0449	0.1227	0.1804	0.2384	0.3216
25	CEM	Diff. in means		j-to-k	Auto (13)	768	0.1660	0.0810	0.0000	0.3498	0.0069	0.0351	0.0721	0.1164	0.1860
26	CEM	Two-step		j-to-k	Auto (13)	768	0.1660	0.1768	0.0000	0.4659	0.0565	0.1275	0.1765	0.2254	0.2968
27	CEM (weighting)	Diff. in means		k-to-k	21–25		0.3092	0.2064	0.0000	1.1202	0.0178	0.0867	0.1812	0.2987	0.4877
28	CEM (weighting)	Two-step		k-to-k	21–25		0.3092	0.2126	0.0000	1.1252	0.0178	0.0907	0.1879	0.3070	0.4970
29	CEM (weighting)	Diff. in means		j-to-k	21–25		0.3006	0.2105	0.0000	1.1524	0.0179	0.0890	0.1837	0.3034	0.4981
30	CEM (weighting)	Two-step		j-to-k	21–25		0.3006	0.2166	0.0000	1.1637	0.0181	0.0927	0.1912	0.3120	0.5084
31	CEM (weighting)	Diff. in means		k-to-k	Auto (13)		0.3098	0.1541	0.0000	0.4736	0.0266	0.0941	0.1514	0.2093	0.2934
32	CEM (weighting)	Two-step		k-to-k	Auto (13)		0.3098	0.1816	0.0002	0.5047	0.0449	0.1227	0.1804	0.2384	0.3216
33	CEM (weighting)	Diff. in means		j-to-k	Auto (13)		0.2592	0.1554	0.0000	0.4834	0.0255	0.0948	0.1533	0.2113	0.2933
34	CEM (weighting)	Two-step		j-to-k	Auto (13)		0.2592	0.1811	0.0000	0.5244	0.0424	0.1213	0.1799	0.2386	0.3207

Appendix C

Figure C1-1:

Setting 1 – ID 1 versus ID 3.

Figure C1-2:

Setting 1 – ID 2 versus ID 4.

Figure C1-3:

Setting 1 – ID 1 versus ID 5.

Figure C1-4:

Setting 1 – ID 2 versus ID 6.

Figure C1-5:

Setting 1 – ID 1 versus ID 7.

Figure C1-6:

Setting 1 – ID 2 versus ID 8.

Figure C1-7:

Setting 1 – ID 1 versus ID 9.

Figure C1-8:

Setting 1 – ID 1 versus ID 10.

Figure C1-9:

Setting 1 – ID 1 versus ID 11.

Figure C1-10:

Setting 1 – ID 1 versus ID 12.

Figure C1-11:

Setting 1 – ID 2 versus ID 13.

Figure C1-12:

Setting 1 – ID 2 versus ID 14.

Figure C1-13:

Setting 1 – ID 2 versus ID 15.

Figure C1-14:

Setting 1 – ID 2 versus ID 16.

Figure C1-15:

Setting 1 – ID 1 versus ID 17.

Figure C1-16:

Setting 1 – ID 2 versus ID 18.

Figure C1-17:

Setting 1 – ID 1 versus ID 19.

Figure C1-18:

Setting 1 – ID 2 versus ID 20.

Figure C1-19:

Setting 1 – ID 1 versus ID 21.

Figure C1-20:

Setting 1 – ID 2 versus ID 22.

Figure C1-21:

Setting 1 – ID 1 versus ID 23.

Figure C1-22:

Setting 1 – ID 2 versus ID 24.

Figure C1-23:

Setting 1 – ID 1 versus ID 25.

Figure C1-24:

Setting 1 – ID 2 versus ID 26.

Figure C1-25:

Setting 1 – ID 1 versus ID 27.

Figure C1-26:

Setting 1 – ID 2 versus ID 28.

Figure C1-27:

Setting 1 – ID 1 versus ID 29.

Figure C1-28:

Setting 1 – ID 2 versus ID 30.

Figure C1-29:

Setting 1 – ID 1 versus ID 31.

Figure C1-30:

Setting 1 – ID 2 versus ID 32.

Figure C1-31:

Setting 1 – ID 1 versus ID 33.

Figure C1-32:

Setting 1 – ID 2 versus ID 34.

Figure C2-1:

Setting 2 – ID 1 versus ID 3.

Figure C2-2:

Setting 2 – ID 2 versus ID 4.

Figure C2-3:

Setting 2 – ID 1 versus ID 5.

Figure C2-4:

Setting 2 – ID 2 versus ID 6.

Figure C2-5:

Setting 2 – ID 1 versus ID 7.

Figure C2-6:

Setting 2 – ID 2 versus ID 8.

Figure C2-7:

Setting 2 – ID 1 versus ID 9.

Figure C2-8:

Setting 2 – ID 1 versus ID 10.

Figure C2-9:

Setting 2 – ID 1 versus ID 11.

Figure C2-10:

Setting 2 – ID 1 versus ID 12.

Figure C2-11:

Setting 2 – ID 2 versus ID 13.

Figure C2-12:

Setting 2 – ID 2 versus ID 14.

Figure C2-13:

Setting 2 – ID 2 versus ID 15.

Figure C2-14:

Setting 2 – ID 2 versus ID 16.

Figure C2-15:

Setting 2 – ID 1 versus ID 17.

Figure C2-16:

Setting 2 – ID 2 versus ID 18.

Figure C2-17:

Setting 2 – ID 1 versus ID 19.

Figure C2-18:

Setting 2 – ID 2 versus ID 20.

Figure C2-19:

Setting 2 – ID 1 versus ID 21.

Figure C2-20:

Setting 2 – ID 2 versus ID 22.

Figure C2-21:

Setting 2 – ID 1 versus ID 23.

Figure C2-22:

Setting 2 – ID 2 versus ID 24.

Figure C2-23:

Setting 2 – ID 1 versus ID 25.

Figure C2-24:

Setting 2 – ID 2 versus ID 26.

Figure C2-25:

Setting 2 – ID 1 versus ID 27.

Figure C2-26:

Setting 2 – ID 2 versus ID 28.

Figure C2-27:

Setting 2 – ID 1 versus ID 29.

Figure C2-28:

Setting 2 – ID 2 versus ID 30.

Figure C2-29:

Setting 2 – ID 1 versus ID 31.

Figure C2-30:

Setting 2 – ID 2 versus ID 32.

Figure C2-31:

Setting 2 – ID 1 versus ID 33.

Figure C2-32:

Setting 2 – ID 2 versus ID 34.

Figure C3-1:

Setting 3 – ID 1 versus ID 3.

Figure C3-2:

Setting 3 – ID 2 versus ID 4.

Figure C3-3:

Setting 3 – ID 1 versus ID 5.

Figure C3-4:

Setting 3 – ID 2 versus ID 6.

Figure C3-5:

Setting 3 – ID 1 versus ID 7.

Figure C3-6:

Setting 3 – ID 2 versus ID 8.

Figure C3-7:

Setting 3 – ID 1 versus ID 9.

Figure C3-8:

Setting 3 – ID 1 versus ID 10.

Figure C3-9:

Setting 3 – ID 1 versus ID 11.

Figure C3-10:

Setting 3 – ID 1 versus ID 12.

Figure C3-11:

Setting 3 – ID 2 versus ID 13.

Figure C3-12:

Setting 3 – ID 2 versus ID 14.

Figure C3-13:

Setting 3 – ID 2 versus ID 15.

Figure C3-14:

Setting 3 – ID 2 versus ID 16.

Figure C3-15:

Setting 3 – ID 1 versus ID 17.

Figure C3-16:

Setting 3 – ID 2 versus ID 18.

Figure C3-17:

Setting 3 – ID 1 versus ID 19.

Figure C3-18:

Setting 3 – ID 2 versus ID 20.

Figure C3-19:

Setting 3 – ID 1 versus ID 21.

Figure C3-20:

Setting 3 – ID 2 versus ID 22.

Figure C3-21:

Setting 3 – ID 1 versus ID 23.

Figure C3-22:

Setting 3 – ID 2 versus ID 24.

Figure C3-23:

Setting 3 – ID 1 versus ID 25.

Figure C3-24:

Setting 3 – ID 2 versus ID 26.

Figure C3-25:

Setting 3 – ID 1 versus ID 27.

Figure C3-26:

Setting 3 – ID 2 versus ID 28.

Figure C3-27:

Setting 3 – ID 1 versus ID 29.

Figure C3-28:

Setting 3 – ID 2 versus ID 30.

Figure C3-29:

Setting 3 – ID 1 versus ID 31.

Figure C3-30:

Setting 3 – ID 2 versus ID 32.

Figure C3-31:

Setting 3 – ID 1 versus ID 33.

Figure C3-32:

Setting 3 – ID 2 versus ID 34.

References

Abadie, A., and G. W. Imbens. 2006. “Large Sample Properties of Matching Estimators for Average Treatment Effects.” Econometrica 74 (1): 235–67. https://doi.org/10.1111/j.1468-0262.2006.00655.x.Search in Google Scholar

Angrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton and Oxford: Princeton university press.10.1515/9781400829828Search in Google Scholar

Arceneaux, K., A. S. Gerber, and D. P. Green. 2006. “Comparing Experimental and Matching Methods Using a Large-Scale Voter Mobilization Experiment.” Political Analysis 14 (1): 37–62. https://doi.org/10.1093/pan/mpj001.Search in Google Scholar

Arceneaux, K., A. S. Gerber, and D. P. Green. 2010. “A Cautionary Note on the Use of Matching to Estimate Causal Effects: An Empirical Example Comparing Matching Estimates to an Experimental Benchmark.” Sociological Methods & Research 39 (2): 256–82. https://doi.org/10.1177/0049124110378098.Search in Google Scholar

Austin, P. C. 2009. “Using the Standardized Difference to Compare the Prevalence of a Binary Variable Between Two Groups in Observational Research.” Communications in Statistics – Simulation and Computation 38 (6): 1228–34. https://doi.org/10.1080/03610910902859574.Search in Google Scholar

Banerjee, A., and E. Duflo. 2011. Poor Economics. New York: PublicAffairs.Search in Google Scholar

Bang, H., and J. M. Robins. 2005. “Doubly Robust Estimation in Missing Data and Causal Inference Models.” Biometrics 61 (4): 962–73. https://doi.org/10.1111/j.1541-0420.2005.00377.x.Search in Google Scholar

Bhattacharya, J., and W. B. Vogt. 2007. Do Instrumental Variables Belong in Propensity Scores? NBER Technical Working Paper 343. Cambridge: National Bureau of Economic Research (NBER).10.3386/t0343Search in Google Scholar

Blackwell, M., S. Iacus, G. King, and G. Porro. 2009. “Cem: Coarsened Exact Matching in Stata.” STATA Journal 9 (4): 524–46. https://doi.org/10.1177/1536867x0900900402.Search in Google Scholar

Busso, M., J. DiNardo, and J. McCrary. 2014. “New Evidence on the Finite Sample Properties of Propensity Score Reweighting and Matching Estimators.” The Review of Economics and Statistics 96 (5): 885–97. https://doi.org/10.1162/rest_a_00431.Search in Google Scholar

Caliendo, M., and S. Kopeinig. 2008. “Some Practical Guidance for the Implementation of Propensity Score Matching.” Journal of Economic Surveys 22 (1): 31–72. https://doi.org/10.1111/j.1467-6419.2007.00527.x.Search in Google Scholar

Caliendo, M., R. Mahlstedt, and O. A. Mitnik. 2017. “Unobservable, but Unimportant? The Relevance of Usually Unobserved Variables for the Evaluation of Labor Market Policies.” Labour Economics 46: 14–25. https://doi.org/10.1016/j.labeco.2017.02.001.Search in Google Scholar

Calónico, S., and J. Smith. 2017. “The Women of the National Supported Work Demonstration.” Journal of Labor Economics 35 (S1): S65–97. https://doi.org/10.1086/692397.Search in Google Scholar

Cole, S. R., and M. A. Hernán. 2008. “Constructing Inverse Probability Weights for Marginal Structural Models.” American Journal of Epidemiology 168 (6): 656–64. https://doi.org/10.1093/aje/kwn164.Search in Google Scholar

Dehejia, R. H., and S. Wahba. 1999. “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.” Journal of the American Statistical Association 94 (448): 1053–62. https://doi.org/10.1080/01621459.1999.10473858.Search in Google Scholar

Dehejia, R. H., and S. Wahba. 2002. “Propensity Score-Matching Methods for Nonexperimental Causal Studies.” The Review of Economics and Statistics 84 (1): 151–61. https://doi.org/10.1162/003465302317331982.Search in Google Scholar

Frölich, M. 2004. “Finite-Sample Properties of Propensity-Score Matching and Weighting Estimators.” The Review of Economics and Statistics 86 (1): 77–90. https://doi.org/10.1162/003465304323023697.Search in Google Scholar

Gneezy, U., and J. List. 2014. The Why Axis: Hidden Motives and the Undiscovered Economics of Everyday Life. London: Random House.Search in Google Scholar

Groenwold, R. H. H., and O. H. Klungel. 2015. “Unobserved Confounding in Propensity Score Analysis.” In Propensity Score Analysis: Fundamentals and Developments, edited by W. Pan, and H. Bai, 296–319. New York: Guilford.Search in Google Scholar

Guo, S., and M. W. Fraser. 2014. Propensity Score Analysis: Statistical Methods and Applications, 2nd ed. Los Angeles: SAGE Publications.Search in Google Scholar

Guo, S., M. W. Fraser, and Q. Chen. 2020. “Propensity Score Analysis: Recent Debate and Discussion.” Journal of the Society for Social Work and Research 11 (3): 463–82. https://doi.org/10.1086/711393.Search in Google Scholar

Hainmueller, J. 2012. “Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies.” Political Analysis 20 (1): 25–46. https://doi.org/10.1093/pan/mpr025.Search in Google Scholar

Hainmueller, J., and Y. Xu. 2013. “Ebalance: A Stata Package for Entropy Balancing.” Journal of Statistical Software 54 (7): 1–18. https://doi.org/10.18637/jss.v054.i07.Search in Google Scholar

Heckman, J. J., H. Ichimura, J. Smith, and P. Todd. 1996. “Sources of Selection Bias in Evaluating Social Programs: An Interpretation of Conventional Measures and Evidence on the Effectiveness of Matching as a Program Evaluation Method.” Proceedings of the National Academy of Sciences 93 (23): 13416–20. https://doi.org/10.1073/pnas.93.23.13416.Search in Google Scholar

Heckman, J. J., H. Ichimura, and P. E. Todd. 1997. “Matching as an Econometric Evaluation Estimator: Evidence From Evaluating a Job Training Programme.” The Review of Economic Studies 64 (4): 605–54. https://doi.org/10.2307/2971733.Search in Google Scholar

Heckman, J. J., H. Ichimura, and P. Todd. 1998. “Matching as an Econometric Evaluation Estimator.” The Review of Economic Studies 65 (2): 261–94. https://doi.org/10.1111/1467-937x.00044.Search in Google Scholar

Heckman, J. J., H. Ichimura, J. A. Smith, and P. E. Todd. 1998. Characterizing Selection Bias Using Experimental Data. NBER Working Paper 6699. Cambridge: National Bureau of Economic Research (NBER).10.3386/w6699Search in Google Scholar

Heckman, J. J., and R. RobbJr. 1985. “Alternative Methods for Evaluating the Impact of Interventions.” In Longitudinal Analysis of Labor Market Data, Vol. 10, edited by J. Heckman, and B. S. Singer, 156–245. New York: Cambridge University Press.10.1017/CCOL0521304539.004Search in Google Scholar

Heckman, J. J., and R. RobbJ. 1986. “Alternative Methods for Solving the Problem of Selection Bias in Evaluating the Impact of Treatments on Outcomes.” In Drawing Inferences From Self-Selected Samples, edited by H. Wainer, 63–107. New York: Springer-Verlag.10.1007/978-1-4612-4976-4_7Search in Google Scholar

Hirano, K., G. W. Imbens, and G. Ridder. 2003. “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score.” Econometrica 71 (4): 1161–89. https://doi.org/10.1111/1468-0262.00442.Search in Google Scholar

Ho, D. E., K. Imai, G. King, and E. A. Stuart. 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis 15 (3): 199–236. https://doi.org/10.1093/pan/mpl013.Search in Google Scholar

Holland, P. W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–60. https://doi.org/10.1080/01621459.1986.10478354.Search in Google Scholar

Huber, M., M. Lechner, and C. Wunsch. 2013. “The Performance of Estimators Based on the Propensity Score.” Journal of Econometrics 175 (1): 1–21. https://doi.org/10.1016/j.jeconom.2012.11.006.Search in Google Scholar

Iacus, S. M., G. King, and G. Porro. 2011. “Multivariate Matching Methods That Are Monotonic Imbalance Bounding.” Journal of the American Statistical Association 106 (493): 345–61. https://doi.org/10.1198/jasa.2011.tm09599.Search in Google Scholar

Iacus, S. M., G. King, and G. Porro. 2012. “Causal Inference Without Balance Checking: Coarsened Exact Matching.” Political Analysis 20 (1): 1–24. https://doi.org/10.1093/pan/mpr013.Search in Google Scholar

Imbens, G. W., and D. B. Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. New York: Cambridge University Press.10.1017/CBO9781139025751Search in Google Scholar

King, G., and R. Nielsen. 2019. “Why Propensity Scores Should Not Be Used for Matching.” Political Analysis 27 (4): 435–54. https://doi.org/10.1017/pan.2019.11.Search in Google Scholar

Lee, W. S. 2013. “Propensity Score Matching and Variations on the Balancing Test.” Empirical Economics 44: 47–80. https://doi.org/10.1007/s00181-011-0481-0.Search in Google Scholar

Lehrer, S. F., and G. Kordas. 2013. “Matching Using Semiparametric Propensity Scores.” Empirical Economics 44: 13–45. https://doi.org/10.1007/s00181-012-0591-3.Search in Google Scholar

Leigh, A. 2018. Randomistas. New Haven: Yale University Press.Search in Google Scholar

Ripollone, J. E., K. F. Huybrechts, K. J. Rothman, R. E. Ferguson, and J. M. Franklin. 2018. “Implications of the Propensity Score Matching Paradox in Pharmacoepidemiology.” American Journal of Epidemiology 187 (9): 1951–61. https://doi.org/10.1093/aje/kwy078.Search in Google Scholar

Rosenbaum, P. R. 2010. Design of Observational Studies. New York: Springer.10.1007/978-1-4419-1213-8Search in Google Scholar

Rosenbaum, P. R., and D. B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55. https://doi.org/10.1093/biomet/70.1.41.Search in Google Scholar

Smith, J. 2000. “A Critical Survey of Empirical Methods for Evaluating Active Labor Market Policies.” Swiss Journal of Economics and Statistics 136 (3): 247–67.Search in Google Scholar

Smith, J. A., and P. E. Todd. 2005. “Does Matching Overcome LaLonde’s Critique of Nonexperimental Estimators?” Journal of Econometrics 125 (1–2): 305–53. https://doi.org/10.1016/j.jeconom.2004.04.011.Search in Google Scholar

Stuart, E. A. 2010. “Matching Methods for Causal Inference: A Review and a Look Forward.” Statistical Science: A Review Journal of the Institute of Mathematical Statistics 25 (1): 1–21. https://doi.org/10.1214/09-sts313.Search in Google Scholar

Todd, P. E. 2009. “Matching Estimators.” In Microeconometrics, edited by S. N. Durlauf, and L. E. Blume, 108–21. New York: Palgrave Macmillan.10.1057/9780230280816_15Search in Google Scholar

Wooldridge, J. M. 2016. “Should Instrumental Variables Be Used as Matching Variables?” Research in Economics 70 (2): 232–7. https://doi.org/10.1016/j.rie.2016.01.001.Search in Google Scholar

Received: 2022-10-25

Accepted: 2023-06-30

Published Online: 2023-07-31

Published in Print: 2023-06-27

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/spp-2022-0019

Keywords for this article

bias correction; covariate balancing; Monte Carlo simulation; observed/unobserved confounders

Creative Commons

BY 4.0