Abstract
Combining observational and experimental data for causal inference can improve treatment effect estimation. However, many observational datasets cannot be released due to data privacy considerations, so one researcher may not have access to both experimental and observational data. Nonetheless, a small amount of risk of disclosing sensitive information might be tolerable to organizations that house confidential data. In these cases, organizations can employ data privacy techniques, which decrease disclosure risk, potentially at the expense of data utility. In this study, we explore disclosure limiting transformations of observational data, which can be combined with experimental data to estimate the sample and population average treatment effects. We consider leveraging observational data to improve generalizability of treatment effect estimates, when a randomized controlled trial (RCT) is not representative of the population of interest, and to increase precision of treatment effect estimates. Through simulation studies, we illustrate the trade-off between privacy and utility when employing different disclosure limiting transformations. We find that leveraging transformed observational data in treatment effect estimation can still improve estimation over only using data from an RCT.
1 Introduction
A growing literature is developing ways to combine experimental and observational studies for causal inferences [1]. While treatment effect estimates from randomized controlled trials (RCTs) can be free from confounding bias, observational studies generally provide a richer source of information on a population of interest. The internet has given rise to more observational datasets that may be used to address the same questions as randomized experiments. Therefore, there should be more opportunities to leverage observational and RCT data together for improved treatment effect estimation. However, in practice, it is not always the case that a researcher has access to both sources of data due to data privacy [2].
Many government agencies have useful data that they cannot release to the public in order to preserve data privacy. Data privacy refers to the right of individuals whom the data describe to control what information about themselves is shared [3,4]. Typically, sensitive data that are released to the public are sanitized in various ways, which potentially render the released data less useful. For example, only aggregate statistics or a sample of the data may be released, or small values are censored. There is a trade-off to consider, between data privacy, a right to be upheld, and the amount of information researchers can access to answer societal questions.
Balancing data privacy and releasing useful information is ultimately a policy decision. There are currently no overarching policies in the US regulating data privacy or confidentiality [5,6]. Rather, data privacy is legislated on a sector-by-sector basis [5]. For example, patient, student, financial, and additional data from US governmental agencies data are protected by separate acts: Health Insurance Portability and Accountability Act; Family Educational Rights and Privacy Act; Fair Credit Reporting Act; Confidential Information Protection and Statistical Efficiency Act [5,7]. On the other hand, members of the European Union and other countries throughout the world have adopted data privacy legislation with broad scopes across sectors [5]. We do not address the legality of different approaches to data privacy. Rather, we raise these issues to establish that data stewards operate in different legal contexts, which may or may not provide specific procedures for protecting data privacy.
We consider the setting in which analysts of an RCT could potentially use auxiliary observational data to improve treatment effect estimation through data integration. However, the relevant auxiliary data cannot be released in its raw form. We aim to address the primary research question: Can privacy-preserving releases of confidential observational data be used to improve causal estimation when integrated with the RCT data?
There are two broad ways that integrating observational data into experimental treatment effect estimates can improve causal estimation, (1) estimating treatment effects for a population of interest when the RCT sample is not representative of that population and (2) increasing precision of RCT estimates. We consider two previously developed data integration methods for causal inference, each of which addresses one of these aims. These methods are well-suited for our investigation because they only require summaries of the auxiliary observational data. Then, we consider ways that stewards of such observational data could transform and release the data to preserve data privacy, as discussed in the data privacy literature. We focus on transformations that can still be used in data integration methods for treatment effect estimation. Finally, through simulation studies, we illustrate the trade-off between privacy and utility when integrating different releases of the private, observational data in RCT treatment effect estimation, both to generalize to populations of interest and to improve experimental precision.
The work in this study is distinct from previous work on releasing privacy-protected causal estimates, such as in previous studies [8–10], in a couple of ways. First, we do not consider releasing a private causal estimate, but rather how a private data release itself could be used in causal estimation with data integration. Second, previous studies [8–10] rely on a causal framework different from the one we do in this study.
We aim for this work to inform data stewards who want to release data that are as useful as possible while balancing data privacy. We also aim to encourage conversation between the literature that combines experimental and observational data for causal inference and the data privacy literature.
This work is organized as follows. Section 2 establishes notation and the causal estimands and estimators of interest. Section 3 provides background on data privacy techniques and presents disclosure limiting transformations of auxiliary data. Section 4 describes simulation studies to evaluate the utility of the proposed transformations in treatment effect estimation that integrates transformed auxiliary data and RCT data and discusses the results. Section 5 gives the conclusion of this work.
2 Leveraging observational and experimental data in causal inference
The presentation in this section primarily follows the research by Colnet et al. [1]. Consider a randomized experiment with
Following Neyman [11] and Rubin [12], each subject has two potential outcomes, one which would be observed if treated,
Researchers may be interested in a number of causal estimands. This work explores integrating experimental and observational data, under data privacy considerations, to estimate the population average treatment effect (PATE), as well as the RCT sample average treatment effect (SATE). The PATE is the expected treatment effect for a population of interest:
2.1 Generalizing to populations of interest
Treatment effects estimated only with an RCT do not necessarily generalize to a population of interest. Consider estimating the PATE for the population from which subjects of an RCT were sampled. If the subjects in the RCT are systematically different from this population (i.e., there is some selection bias to inclusion in the RCT), and the treatment effect depends on subject characteristics, then an estimate of the PATE with only the RCT data is biased. Further, it will be unclear how “far off” the estimate is. In this case, we say that the RCT estimate does not generalize to the population. Observational data are often more representative of the population from which they were sampled. Therefore, to estimate the PATE, it is useful to leverage information from auxiliary observational data.
The literature proposes a number of estimates for the PATE, which integrate experimental and observational data [14–18]. Refer the study by Colnet et al. [1] for a full review. We focus on the calibration weighted (CW) estimator proposed by Lee et al. [18], for two primary reasons. First, a statistical summary of auxiliary data is sufficient for the CW estimator. As will be discussed in detail in Section 3, statistical summaries of data already do some of the work to limit disclosure of confidential data. Thus, the CW estimator is better-suited to take private releases of auxiliary data as an input than other estimators, which rely on the full auxiliary data, at the subject level. Second, the results of the review in the study by Colnet et al. [1] indicate that the CW estimator outperforms other options across different settings.
The CW estimator [18] is a variation of the inverse probability weighted (IPW) estimator [19,20],
The IPW estimator only uses data from the RCT (
The CW estimator is defined as
Each subject
Equation (1) is the key restriction on
2.2 Improving experimental precision
We next consider a context in which the treatment effect for only an RCT sample is of interest. Even though there may no longer be a larger population of interest, incorporating information from observational data can still prove useful, in this case by improving the precision of an estimate of the SATE.
RCTs often have small sample sizes, so RCT estimates of the SATE may lack precision. A common approach to improve precision is covariate adjustment. Covariate adjustment accounts for variance in
Because observational studies typically have much larger sample sizes than randomized experiments (
The first step of the super-covariate data integration approach is to fit a model of the outcome of interest (Y) with the auxiliary data only. The only goal of this model is to predict the outcome of interest as well as possible in the RCT sample. The only requirements for the auxiliary prediction model are that (1) it uses only covariates that would be considered pre-treatment or baseline in the RCT and (2) predictions can be generated for the RCT sample (i.e., the same baseline covariates need to be available in the RCT and auxiliary data). Let
Especially when there are a large number of covariates, a model fit on (large) auxiliary data can be more informative of the outcome of interest than a model fit on (small) RCT data. For this reason, we expect
We focus on this data integration approach for the SATE because it is highly flexible, relies on few assumptions, and is highly efficacious. All that is required from the auxiliary data is a model of the outcome of interest. Thus, as with the CW estimator, only a statistical summary of the auxiliary data is required. Additionally, this method relies on very few assumptions with regards to the auxiliary and RCT data. The auxiliary model need not be “correct” in any sense. The only requirement for the approach to improve precision is that the auxiliary model predicts the outcomes in the RCT setting well. Finally, there is a body of previous literature supporting the efficacy of this approach in general [25,27–29], so we are interested in whether that efficacy can be maintained even if using privacy-preserving transformations of auxiliary data, rather than the original data itself.
3 Disclosure limiting transformations of auxiliary data
In Section 2, we discussed two estimators that combine experimental and observational data. In practice, one entity may not have access to both types of data. We assume that analysts with access to a randomized experiment are interested in incorporating information from a relevant auxiliary observational, study. However, the data stewards of the auxiliary study cannot release the data (
3.1 Data privacy overview
With the rise of the internet and an explosion of data availability, computer scientists and statisticians have been considering issues of data privacy for the past four decades [refer refs 30–35, for reviews]. Two distinct but related concepts are data privacy and data confidentiality. Data privacy is defined as the right of individuals to control information about themselves [3]. Data confidentiality is the agreement between individuals and data stewards regarding the extent to which others can access any private/sensitive information provided [3,4]. Disclosure risk refers to the risk that an attacker could access sensitive information from the released data. Historically there was a focus on microdata, data that have information at an individual level and includes information that could be used to identify an individual; however, more recent work considers the risk of disclosure for statistical summaries [35]. The goal of data privacy methods is to reduce disclosure risk.
We can view data privacy techniques as falling into two primary frameworks: statistical disclosure control (SDC) and differential privacy (DP) (refer [4,35], for detailed discussions). SDC aims to uphold data privacy by maintaining data confidentiality. SDC techniques include, for example, synthetic data, cell suppression, data swapping, matrix masking, and noise additions to limit the risk for disclosure of individual identities and attributes [33]. There is not one measure of disclosure risk in the SDC framework. On the other hand, the DP framework provides a mathematical definition for disclosure risk, although for a specific type of risk (discussed in Section 3.2). We consider techniques under both frameworks in this study, and use the term disclosure limitation to mean limiting the risk of disclosing sensitive information, not specifically referring to SDC.
With any disclosure limiting procedure, there is a trade-off between privacy and utility [33,35,36]. To maximize utility, analysts would want the original data, and thus, there would be no data privacy. On the other hand, for there to be no disclosure risk, the data could not be released, so there would be no utility. Data stewards therefore must decide on a tolerable disclosure risk and a measure of data utility in order to balance the two competing forces. This is not a trivial task. As discussed in Section 1, depending on the context, there are not necessarily formal guidelines for determining a tolerable disclosure risk. Additionally, anticipating all of the desired uses of a dataset is not feasible. In this study, we focus on specific uses for the data releases, which are associated with specific metrics to assess utility. We do not specify a tolerable disclosure risk. Rather, the simulation studies in Section 4 explore the privacy-utility trade-off when employing different disclosure limiting transformations.
3.2 DP
Dwork et al. [37] and Dwork [38] introduced DP in the early 2000’s. DP is considered the first rigorous mathematical quantification of disclosure risk for privacy-preserving algorithms. DP algorithms limit the difference in distribution between outputs of the algorithm generated from datasets that differ by only one observation [39].
Consider a private dataset
for all (
Algorithms that achieve
In other words,
The researcher chooses
DP algorithms maintain a couple of benefits. First, the definition makes no assumptions about the information that an attacker has. DP algorithms are also robust to post processing, so any transformation of a differentially private output is still differentially private. DP algorithms also have useful composition properties [40]. When multiple statistics are calculated from the same data, the amount of privacy budget used for each statistic is simply added to calculate the total privacy budget used, under Sequential Composition [42]. The composition of a
There is a clear trade-off between the privacy budget
Organizations have started adopting DP for disclosure limitation in recent years [7]. Notably, the US Census Bureau implemented a differentially private algorithm for releases of the 2020 Census redistricting data with
3.3 Disclosure limiting transformations
We will compare two major approaches for releasing confidential data while limiting disclosure risk – synthetic data (3.3.1) and a differentially private Gram matrix (3.3.2) – as described in the following.
3.3.1 Synthetic data
Synthetic data, introduced by Rubin [46], is a synthetic version of a dataset when raw microdata cannot be released. Rubin [46] and Little [47] viewed synthetic data as a missing data problem, where the sensitive information was missing and could be imputed with multiple imputation. In general, synthetic data replace sensitive information in the original data with values generated from statistical summaries of the original data [4,48]. The big idea, typically, is to generate synthetic data by sampling from the empirical joint and marginal distributions of the columns in the confidential data. For example, a popular parametric method for data synthesis generates variables sequentially, generating the next variable with predictions from classification and regression trees fit on the already synthesized variables [49]. We use this synthesis method in the simulations in this study.
Synthetic data are appealing as a disclosure limiting transformation of confidential data because they can technically be analyzed with the same methods as the original data. However, they have two major drawbacks. First, even though no observations from the confidential data are released in such synthetic data, there is still a risk of information disclosure. Recent work has discussed the risk of leaking information with typical synthetic data generation [6]. More recently, methods have been developed to generate differentially private synthetic data [50,51], in which case the disclosure risk is clear, via the choice of privacy budget. The second drawback of releasing synthetic data is that making valid inferences with synthetic data requires clear communication from the data steward to the public of how the data can be analyzed.
3.3.2 Noise infused gram matrix
The major benefit of synthetic data is that it is at the same observation level as the confidential data that cannot be released. However, the estimators discussed in Section 2 do not require the individual level auxiliary data. Estimating the PATE with
We define the data matrix
Therefore, the gram matrix of the auxiliary data
There is a body of literature that considers perturbed sufficient statistics for linear regression (OLS), penalized regression (such as Ridge and LASSO), and principle component analysis (PCA), which are all contained in
Instead, we divide the upper triangle of

Illustration of how the gram matrix
We construct
Dividing the gram matrix into these elements allows adding a smaller magnitude of noise while still achieving DP. First, separating the matrix into the different elements accounts for the different sensitivities of first and second moments, and second, applying the Gaussian mechanism element-wise within each element accounts for scale differences between columns. Refer Appendix A for details of the sensitivity calculations that are used in the algorithm.
There is no guarantee that the resulting DP matrix will be positive definite – an important property of gram matrices. Therefore, we post-process
4 Simulation studies
We conduct two simulation studies to compare the utility of different auxiliary data releases in estimators of the PATE and SATE, respectively. As baselines, we use the difference-in-means estimator
4.1 Generalizing to populations of interest
We consider a setting where there is an RCT and a related observational study which were both sampled from a population of interest. The aim is to estimate the PATE for this population of interest. The auxiliary study is representative of this population. However, the RCT sample is not representative of the population of interest due to selection bias. We consider a modified version of the simulations in the studies by Colnet et al. [63] and Lee et al. [18], which assumes a heterogeneous treatment effect.
4.1.1 Data generation
We emulate a hypothetical randomized experiment and auxiliary study assuming that the SATE in the RCT do not equal the desired PATE due to selection bias. We generate the
4.1.2 Simulation procedure
For
We then generate the treatment assignment
4.1.3 Utility metrics
We evaluate the utility of each private data release for estimating the PATE with two metrics. First, we look at the mean squared error (MSE) of the estimator. Second, we consider the coverage probability for a 95% confidence interval. We additionally consider utility metrics for estimates of the column means themselves in the auxiliary data. For this, we compare the square root of the MSE (RMSE) of the column means for each disclosure limiting transformation to the RMSE of the confidential auxiliary data (
4.1.4 Results
Figure 2 shows the empirical MSE for each estimator, estimated across the 1,000 simulations. We note that we do not show results for the DP release

MSE of estimators of the PATE calculated across 1,000 iterations for different numbers of covariates (
Table 1 shows the empirical coverage probability of the true PATE (0.5) for 95% confidence intervals, across the 1,000 simulations. The regression estimator has very poor coverage for the PATE, as does the difference-in-means estimator, which only has higher coverage because the variance is larger. The coverage of bootstrap 95% confidence intervals for the CW estimator is close to 0.95 cross the different disclosure limiting transformations of the auxiliary data, with the exception of
Estimated coverage probability for 95% confidence intervals for estimators of the PATE for different numbers of covariates (
Estimate of the PATE |
|
Coverage | |
---|---|---|---|
|
|
||
RCT data only | |||
|
— | 0.51 (0.010) | 0.51 (0.013) |
|
— | 0.08 (0.009) | 0.13 (0.010) |
Includes auxiliary data | |||
|
— | 0.95 (0.007) | 0.96 (0.005) |
|
1 | 0.94 (0.007) | — |
|
3 | 0.95 (0.007) | 0.92 (0.012) |
|
6 | 0.95 (0.006) | 0.95 (0.008) |
|
15 | 0.96 (0.005) | 0.95 (0.007) |
|
30 | 0.95 (0.005) | 0.95 (0.005) |
|
— | 0.95 (0.006) | 0.95 (0.006) |
Note: Simulation standard errors in parentheses.
Table 2 shows the RMSE of the column means for the different releases of the confidential auxiliary data. This gives some insight into the utility of releasing
RMSE of column means for releases of confidential auxiliary (observational) data, varying the number of covariates
Data release |
|
RMSE of column means (
|
|
---|---|---|---|
|
|
||
|
— | 0.98 (0.007) | 0.99 (0.005) |
|
— | 1.44 (0.009) | 1.51 (0.008) |
|
1 | 8.58 (0.089) | — |
|
3 | 2.02 (0.026) | 8.16 (0.059) |
|
6 | 1.09 (0.009) | 3.80 (0.030) |
|
15 | 0.97 (0.007) | 1.29 (0.009) |
|
30 | 0.97 (0.007) | 1.00 (0.005) |
Note: Simulation standard errors in parentheses.
To summarize, when the PATE is the estimand of interest, leveraging auxiliary data in the analysis of an RCT with
4.2 Improving experimental precision
In the second simulation study, the goal is to estimate the average treatment effect for an RCT sample. We assume that there is an RCT evaluating a specific treatment, and a related observational study which includes the outcome of interest. We do not necessarily observe the treatment in the auxiliary observational study. However, given that observational studies typically have large sample sizes, we can expect to get good estimates of model parameters for a model of the outcome on observed covariates, resulting in good predictions of the outcome. Further, if there are a large number of covariates, an observational study will likely better estimate model parameters for all covariates than a small RCT. As discussed in Section 2.2, predictions of the outcome based on a model trained on the auxiliary data can therefore be a very powerful covariate to include in RCT covariate adjustment. For simplicity, we consider a setting where the RCT and observational samples arise from the same, linear, data generating model (there is no covariate shift) and where the regression models match this data generating model. There are other covariate adjustment methods incorporating auxiliary data that account for covariate shift (Appendix B).
4.2.1 Data generation
We emulate a hypothetical RCT with
4.2.2 Simulation procedure
For each value of
First, we generate an RCT and auxiliary dataset as described above. We then calculate each disclosure limiting transformation of the auxiliary data as described in Section 4.1.2. With each transformation of the auxiliary data, we calculate the OLS estimated coefficients for a model of the outcome on the covariates (auxiliary model). With each of these coefficient vectors, we predict the outcome for subjects in the RCT (
Since we are interested in the average treatment effect for an RCT sample, we treat the outcomes and covariates as fixed. We then generate 1,000 treatment assignment vectors
4.2.3 Utility metrics
Because we use unbiased estimators for the SATE, we evaluate the utility of the data integration approach, with different auxiliary data releases, using the variance of the given point estimator. We calculate the empirical variance of the estimates across the 1,000 treatment assignment vectors for each data generation for this comparison. To more easily compare the variances, we consider the relative efficiency of each point estimator as compared to
We additionally consider utility for statistical summaries of the auxiliary data releases, which are the intermediate step to the data integration method. First, we look at RMSE of the coefficient vector
4.2.4 Results
Figure 3 shows the relative efficiency of each estimator compared to

Simulated relative efficiency of
Using synthetic data (
We find that the utility of using the super-covariate data integration is lost for the DP transformations of auxiliary data, when there are a large number of covariates. However, the noise necessary to achieve DP additionally depends on the sensitivity of the statistics to removing one observation from the data. Thus, the sensitivity decreases as

Simulated variance for
As noted previously, these results are based on correctly specified regression models and RCT and auxiliary samples that arise from the same data generating process. In practice, neither of these assumptions may be true, and it can be preferred to use design-based estimators. Refer Appendix B for a discussion and simulations, which show that these results hold with a design-based estimator that is robust to model mis-specification and covariate shift.
Table 3 presents additional utility metrics for the disclosure limiting releases. As with Table 2, this can give us an idea of the utility of the releases for uses of the Gram matrix aside from the super-covariate data integration approach. First, in terms of the OLS estimated coefficients for a regression model with the auxiliary sample (
Additional utility metrics for disclosure limiting transformations of confidential auxiliary data
Release |
|
|
vs non-private gram matrix | |
---|---|---|---|---|
RMSE | Frobenius norm | Spectral norm | ||
|
||||
|
— | 0.01 | 0.00 | 0.00 |
|
— | 0.02 | 0.17 | 0.10 |
|
1 | 0.24 | 5.69 | 3.05 |
|
3 | 0.16 | 2.18 | 1.02 |
|
6 | 0.08 | 0.80 | 0.39 |
|
15 | 0.03 | 0.25 | 0.14 |
|
30 | 0.02 | 0.14 | 0.08 |
|
50 | 0.01 | 0.11 | 0.06 |
|
||||
|
— | 0.01 | 0.00 | 0.00 |
|
— | 0.03 | 0.34 | 0.19 |
|
1 | 0.22 | 28.20 | 12.03 |
|
3 | 0.18 | 10.30 | 4.36 |
|
6 | 0.15 | 6.30 | 2.48 |
|
15 | 0.11 | 2.33 | 0.84 |
|
30 | 0.05 | 0.72 | 0.30 |
|
50 | 0.03 | 0.44 | 0.19 |
|
||||
|
— | 0.01 | 0.00 | 0.00 |
|
— | 0.03 | 0.76 | 0.32 |
|
1 | 0.15 | 373.32 | 105.51 |
|
3 | 0.14 | 125.09 | 35.37 |
|
6 | 0.14 | 63.09 | 17.93 |
|
15 | 0.13 | 26.15 | 7.42 |
|
30 | 0.12 | 14.18 | 3.97 |
|
50 | 0.10 | 9.79 | 2.60 |
In summary, the utility of the super-covariate data integration approach is more impacted by the type of disclosure limiting transformation applied to the auxiliary data than the CW estimator. This makes sense because this data integration approach requires the full Gram matrix, with
5 Discussion
In this study, we presented disclosure limiting transformations of observational data that can be combined with experimental data to estimate the PATE and SATE. These disclosure limiting transformations included a differentially private Gram matrix and synthetic data. We found that leveraging these transformed versions of observational data to estimate the PATE greatly improved the MSE (and eliminated bias) as compared to methods using only the RCT data. We also found that transformed versions of auxiliary data could improve precision when estimating the SATE, beyond the precision achieved through covariate adjustment with RCT covariates alone, if there are few covariates or a large privacy budget is used for the DP Gram matrix release.
There is no broad discussion of using the Gram matrix of a data matrix as a disclosure limiting transformation in the literature, to the best of our knowledge. The Gram matrix is a reasonable place to start for disclosure limiting data releases since it provides useful summary statistics of the data and some protection against disclosure, which can be augmented with additional random noise. The Gram matrix also supports some flexible model fitting – any outcome and covariates can be chosen out of the columns in the data matrix. As has been discussed in the DP literature, a DP covariance matrix can be used for PCA, ridge, and LASSO regularized regression in addition to standard OLS.
In addition to disclosure risk and utility, there are practical considerations that could be taken into account when choosing a disclosure limiting technique. An important consideration is that the additional uncertainty introduced by privacy-transformations may be more challenging to communicate to data users for some techniques than for others. In the case of the DP Gram matrix release,
We focus on a small set of disclosure limiting techniques and causal estimators for discussion’s sake. Further evaluation of disclosure limiting transformations and causal estimators would be valuable. For example, the inverse propensity score weighted (IPSW) estimator is an estimator of the PATE, which combines experimental and observational data [63]. The IPSW estimator requires pooling experimental and observational data, and therefore, requires subject-level data. Future work could evaluate the utility of different methods for synthetic data generation of observational data for use in the IPSW estimator. The results also point to more work to be done to generate differentially private Gram matrices with higher utility for OLS estimation and inference and to be used in the data integration approach.
Integrating observational and experimental data in treatment effect estimation is a powerful and exciting direction in causal inference. There is a lost opportunity when RCT analysts cannot access relevant observational data. On the other hand, individuals who provide their data have the right to control their information and expect that sensitive information will not be released to the public. In practice, choosing a tolerable disclosure risk, balanced with data utility, is a policy decision. In this work, we present information to inform such a decision, illustrating the privacy-utility trade-off for different data privacy techniques when integrating private auxiliary data with experimental data for causal inference.
Acknowledgements
The authors thank the reviewers and editors for their thoughtful feedback and suggestions, which improved the manuscript. We also thank Jeremy Seeman for his valuable feedback.
-
Funding information: The research reported here was supported by the Institute of Education Sciences, US Department of Education, through Grant R305D210031 to the University of Michigan. The opinions expressed are those of the authors and do not represent views of the Institute or the US Department of Education nor other funders. C.Z.M. was additionally supported by the National Science Foundation RTG grant DMS-1646108.
-
Author contributions: All authors discussed and reviewed the manuscript and simulation results and have accepted responsibility for the entire content of this manuscript and consented to its submission. J.A.G.B. conceived the project. C.Z.M. and J.A.G.B. designed simulation studies. C.Z.M. conducted literature reviews, conducted all simulations, and drafted the manuscript. All authors discussed and provided feedback on the manuscript.
-
Conflict of interest: The authors state no conflicts of interest.
-
Data availability statement: All codes to generate data and replicate simulations are available at https://github.com/manncz/exp-obs-priv.
Appendix A Sensitivity calculations for differentially private Gram matrix algorithm
Take an
where
A.1 Mean sensitivity
We will look at the sensitivity for the empirical mean of one column. We can also think of this as the sensitivity to the sum of the values of one column, removing one observation. Assume that
So for the empirical mean,
A.2 Second (non-central) moment sensitivity
We will look at the sensitivity for the empirical expectation of the product of
B Design-based covariate adjustment with auxiliary data
Using a prediction of the outcome as a covariate in the regression estimator is effective to reduce variance in experimental estimates when the model is correctly specified and there is no covariate shift between the RCT and auxiliary study. In practice, neither of these assumptions may be true. Design-based methods for covariate adjustment do not require modeling assumptions on top of those that are typically employed with design-based analysis of randomized experiments. The covariate adjusted estimator proposed by Gagnon-Bartsch et al. [28] is robust to model mis-specification and accounts for possible covariate shift between the RCT and auxiliary study.
Gagnon-Bartsch et al. [28] is part of the literature proposing residualizing the outcomes in the IPW estimator with some function of the covariates, as in the augmented IPW estimator [20,23,25,65–67]. In general, such adjusted estimators take the form
Authors have proposed different functions
Using auxiliary observational data to estimate
Like the regression estimator discussed in Section 2.2, Gagnon-Bartsch et al. [28] only requires a model of the outcome of interest fit on the auxiliary data. We denote the specific variation in the adjusted IPW estimator in in the study by Gagnon-Bartsch et al. [28] as

Simulated relative efficiency of
References
[1] Colnet B, Mayer I, Chen G, Dieng A, Li R, Varoquaux G, et al. Causal inference methods for combining randomized trials and observational studies: a review. Stat Sci. 2024 Feb;39(1):165–91. https://projecteuclid-org.proxy.lib.umich.edu/journals/statistical-science/volume-39/issue-1/Causal-Inference-Methods-for-Combining-Randomized-Trials-and-Observational-Studies/10.1214/23-STS889.full. 10.1214/23-STS889Search in Google Scholar
[2] Snoke J, Bowen CM. How statisticians should Grapple with privacy in a changing data landscape. CHANCE. 2020 Oct;33(4):6–13. 10.1080/09332480.2020.1847947. Search in Google Scholar
[3] Fellegi IP. On the question of statistical confidentiality. J Amer Stat Assoc. 1972;67(337):7–18. https://www.jstor.org/stable/2284695. 10.1080/01621459.1972.10481199Search in Google Scholar
[4] Raghunathan TE. Synthetic data. Ann Rev Stat Appl. 2021;8(1):129–40. 10.1146/annurev-statistics-040720-031848. Search in Google Scholar
[5] Schwartz PM. The EU-US privacy collision: a turn to institutions and procedures. Harvard Law Rev. 2013 May;126(7):1966–2009. Search in Google Scholar
[6] Bellovin SM, Dutta PK, Reitinger N. Privacy and synthetic datasets [SSRN Scholarly Paper]. Rochester, NY: Stanford Technology Law Review; 2018. https://papers.ssrn.com/abstract=3255766. 10.2139/ssrn.3255766Search in Google Scholar
[7] Wood A, Altman M, Bembenek A, Bun M, Gaboardi M, Honaker J, et al. Differential privacy: a primer for a non-technical audience. Vand. J. Ent. & Tech. L. 2018;21:209.10.2139/ssrn.3338027Search in Google Scholar
[8] Kusner M, Sun Y, Sridharan K, Weinberger K. Private causal inference. In: 19th International Conference on Artificial Intelligence and Statistics. vol. 51. Cadiz, Spain: JMLR; 2016. Search in Google Scholar
[9] Wang L, Pang Q, Song D. Towards practical differentially private causal graph discovery. In: 34th Conference on Neural Information Processing Systems. Canada: Vancouver; 2020. Search in Google Scholar
[10] Ma P, Ji Z, Pang Q, Wang S. NoLeaks: differentially private causal discovery under functional causal model. IEEE Trans Inform Forensics Security. 2022;17:2324–38. https://ieeexplore.ieee.org/document/9798874/. 10.1109/TIFS.2022.3184263Search in Google Scholar
[11] Neyman J, Iwaskiewicz K, Kolodziejczyk S. Statistical problems in agricultural experimentation (with discussion). Suppl J R Stat Soc. 1935;2:107–80. 10.2307/2983637Search in Google Scholar
[12] Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educat Psychol. 1974;66(5):688–701. http://content.apa.org/journals/edu/66/5/688. 10.1037/h0037350Search in Google Scholar
[13] Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. J Amer Stat Assoc. 1996;91(434):444–55.10.1080/01621459.1996.10476902Search in Google Scholar
[14] Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. Am J Epidemiol. 2010 Jul;172(1):107–15. 10.1093/aje/kwq084Search in Google Scholar PubMed PubMed Central
[15] Stuart EA, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the generalizability of results from randomized trials. J R Stat Soc Ser A (Stat Soc). 2011;174(2):369–86. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-985X.2010.00673.x. 10.1111/j.1467-985X.2010.00673.xSearch in Google Scholar PubMed PubMed Central
[16] Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR. Generalizing study results: a potential outcomes perspective. Epidemiology (Cambridge, Mass). 2017 Jul;28(4):553–61. 10.1097/EDE.0000000000000664Search in Google Scholar PubMed PubMed Central
[17] Dahabreh IJ, Robertson SE, Tchetgen EJ, Stuart EA, Hernán MA. Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics. 2019 Jun;75(2):685–94. 10.1111/biom.13009Search in Google Scholar PubMed PubMed Central
[18] Lee D, Yang S, Dong L, Wang X, Zeng D, Cai J. Improving trial generalizability using observational studies. Biometrics. 2021;79(2):1–13. http://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13609. 10.1111/biom.13609Search in Google Scholar PubMed PubMed Central
[19] Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Amer Stat Assoc. 1952;47(260):663–85. http://www.jstor.org/stable/2280784. 10.1080/01621459.1952.10483446Search in Google Scholar
[20] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Amer Stat Assoc. 1994 Sep;89(427):846–66. 10.1080/01621459.1994.10476818. Search in Google Scholar
[21] Pocock SJ. The combination of randomized and historical controls in clinical trials. J Chronic Diseases. 1976 Mar;29(3):175–88. 10.1016/0021-9681(76)90044-8Search in Google Scholar PubMed
[22] Viele K, Berry S, Neuenschwander B, Amzal B, Chen F, Enas N, et al. Use of historical control data for assessing treatment effects in clinical trials. Pharm Stat. 2014;13(1):41–54. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951812/. 10.1002/pst.1589Search in Google Scholar PubMed PubMed Central
[23] Aronow PM, Middleton JA. A class of unbiased estimators of the average treatment effect in randomized experiments. J Causal Infer. 2013 May;1(1):135–54. http://www.degruyter.com/document/doi/10.1515/jci-2012-0009/html. 10.1515/jci-2012-0009Search in Google Scholar
[24] Deng A, Xu Y, Kohavi R, Walker T. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining - WSDM ’13. Rome, Italy: ACM Press; 2013. p. 123. http://dl.acm.org/citation.cfm?doid=2433396.2433413. 10.1145/2433396.2433413Search in Google Scholar
[25] Sales AC, Hansen BB, Rowan B. Rebar: Reinforcing a matching estimator with predictions from high-dimensional covariates. J Educ Behav Stat. 2018 Feb;43(1):3–31. 10.3102/1076998617731518. Search in Google Scholar
[26] Gui G. Combining observational and experimental data using first-stage covariates. 2020 Dec. http://arxiv.org/abs/2010.05117. 10.2139/ssrn.3662061Search in Google Scholar
[27] Opper IM. Improving average treatment effect estimates in small-scale randomized controlled trials. Annenberg Institute at Brown University; 2021. Publication Title: EdWorkingPapers.com. https://www.edworkingpapers.com/ai21-344. 10.7249/WRA1004-1Search in Google Scholar
[28] Gagnon-Bartsch JA, Sales AC, Wu E, Botelho AF, Erickson JA, Miratrix LW, et al. Precise unbiased estimation in randomized experiments using auxiliary observational data. J Causal Inference. 2023 Aug;11(1):20220011. https://www.degruyter.com/document/doi/10.1515/jci-2022-0011/html. Search in Google Scholar
[29] Sales AC, Prihar E, Gagnon-Bartsch J, Gurung A, Heffernan NT. More powerful A/B testing using auxiliary data and deep learning. In: Rodrigo MM, Matsuda N, Cristea AI, Dimitrova V, editors. Artificial intelligence in education. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2022. p. 524–7. 10.1007/978-3-031-11647-6_107Search in Google Scholar
[30] Duncan GT, Pearson RW. Enhancing access to microdata while protecting confidentiality: prospects for the future. Stat Sci. 1991;6(3):219–32. http://www.jstor.org/stable/2245411. 10.1214/ss/1177011681Search in Google Scholar
[31] Fienberg SE. Invited paper-confidentiality and data protection through disclosure limitation: evolving principles and technical advances. Philippine Stat. 2000;49(1–4):1–12. Search in Google Scholar
[32] Aggarwal CC, Yu PS. Privacy-preserving data mining: models and algorithms. New York, NY, United States: Springer; 2008. http://ebookcentral.proquest.com/lib/umichigan/detail.action?docID=367484. 10.1007/978-0-387-70992-5_2Search in Google Scholar
[33] Matthews GJ, Harel O. Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy. Stat Surveys. 2011 Jan;5(none):1–29. http://projecteuclid.org/journals/statistics-surveys/volume-5/issue-none/Data-confidentiality-A-review-of-methods-for-statistical-disclosure/10.1214/11-SS074.full. 10.1214/11-SS074Search in Google Scholar
[34] Salas J, Domingo-Ferrer J. Some basics on privacy techniques, anonymization and their big data challenges. Math Comput Sci. 2018 Sep;12(3):263–74. 10.1007/s11786-018-0344-6. Search in Google Scholar
[35] Slavkovic A, Seeman J. Statistical data privacy: a song of privacy and utility. 2022. http://arxiv.org/abs/2205.03336. Search in Google Scholar
[36] Bowen CM. Personal privacy and the public good: balancing data privacy and data utility. Urban Institute Research Report. 2021 Aug. https://www.urban.org/research/publication/personal-privacy-and-public-good-balancing-data-privacy-and-data-utility. Search in Google Scholar
[37] Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. In: Halevi S, Rabin T, editors. Theory of cryptography. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006. p. 265–84. 10.1007/11681878_14Search in Google Scholar
[38] Dwork C. Differential privacy. In: Bugliesi M, Preneel B, Sassone V, Wegener I, editors. Automata, Languages and Programming. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer; 2006. p. 1–12. 10.1007/11787006_1Search in Google Scholar
[39] Bowen CM, Garfinkel S. The philosophy of differential privacy. Notices of the American Mathematical Society. 2021 Nov;68(10):1. https://www.ams.org/notices/202110/rnoti-p1727.pdf. 10.1090/noti2363Search in Google Scholar
[40] Barrientos AF, Williams AR, Snoke J, Bowen CM. A feasibility study of differentially private summary statistics and regression analyses with evaluations on administrative and survey data. J Amer Stat Assoc. 2023;0(ja):1–24. 10.1080/01621459.2023.2270795. Search in Google Scholar
[41] Dwork C, Smith A. Differential privacy for statistics: what we know and what we want to learn. J Privacy Confidentiality. 2010 Apr;1(2):2. https://journalprivacyconfidentiality.org/index.php/jpc/article/view/570. 10.29012/jpc.v1i2.570Search in Google Scholar
[42] Dwork C, Roth A. The algorithmic foundations of differential privacy. Found Trends Theoret Comput Sci. 2013;9(3–4):211–407. http://www.nowpublishers.com/articles/foundations-and-trends-in-theoretical-computer-science/TCS-042. 10.1561/0400000042Search in Google Scholar
[43] McSherry FD. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Providence, Rhode Island; 2009. 10.1145/1559845.1559850Search in Google Scholar
[44] Jarmin RS. Disclosure avoidance for the 2020 census: an introduction. US: Census Bureau. 2021. https://www.census.gov/library/publications/2021/decennial/2020-census-disclosure-avoidance-handbook.html. Search in Google Scholar
[45] Rogers R, Subramaniam S, Peng S, Durfee D, Lee S, Kancha SK, et al. LinkedIn’s audience engagements API: a privacy preserving data analytics system at scale. 2020. http://arxiv.org/abs/2002.05839. 10.29012/jpc.782Search in Google Scholar
[46] Rubin DB. Discussion: statistical disclosure limitation. J Official Stat. 1993;9(2):461–8. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/discussion-statistical-disclosure-limitation2.pdf. Search in Google Scholar
[47] Little R. Statistical analysis of masked data. J Official Stat. 1993;9(2):407–26. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/statistical-analysis-of-masked-data.pdf. Search in Google Scholar
[48] Snoke J, Raab GM, Nowok B, Dibben C, Slavkovic A. General and specific utility measures for synthetic data. J R Stat Soc Ser A (Stat Soc). 2018;181(3):663–88. https://onlinelibrary.wiley.com/doi/abs/10.1111/rssa.12358. Search in Google Scholar
[49] Nowok B, Raab GM, Dibben C. Synthpop: Bespoke creation of synthetic data in R. J Stat Softw. 2016;74(11):1–26. https://www.jstatsoft.org/index.php/jss/article/view/v074i11. 10.18637/jss.v074.i11Search in Google Scholar
[50] Wilchek M, Wang Y. Synthetic differential privacy data generation for revealing bias modelling risks. In: 2021 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom); 2021. p. 1574–80. 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00211Search in Google Scholar
[51] Boedihardjo M, Strohmer T, Vershynin R. Private measures, random walks, and synthetic data. http://arxiv.org/abs/2204.09167. Search in Google Scholar
[52] Kamath G, Li J, Singhal V, Ullman J. Privately learning high-dimensional distributions. In: Proceedings of the Thirty-Second Conference on Learning Theory. PMLR; 2019. p. 1853–902. ISSN: 2640-3498. https://proceedings.mlr.press/v99/kamath19a.html. Search in Google Scholar
[53] Balle B, Wang YX. PMLR. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. Proceedings of the 35th Annual International Conference on Machine Learning. 2018. p. 394–403. Search in Google Scholar
[54] Wang YX. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. 2018. http://arxiv.org/abs/1803.02596. Search in Google Scholar
[55] Dwork C, Talwar K, Thakurta A, Zhang L. Analyze Gauss: optimal bounds for privacy-preserving principal component analysis. In: Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing. STOC ’14. New York, NY, USA: Association for Computing Machinery; 2014. p. 11–20. 10.1145/2591796.2591883. Search in Google Scholar
[56] Chanyaswad T, Dytso A, Poor HV, Mittal P. MVG mechanism: differential privacy under matrix-valued query. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security; 2018. p. 230–46. http://arxiv.org/abs/1801.00823. 10.1145/3243734.3243750Search in Google Scholar
[57] Ferrando C, Wang S, Sheldon D. Parametric bootstrap for differentially private confidence intervals. 2021. http://arxiv.org/abs/2006.07749. Search in Google Scholar
[58] Jiang W, Xie C, Zhang Z. Wishart mechanism for differentially private principal components analysis. Proceedings of the AAAI Conference on Artificial Intelligence. 2016 Feb;30(1):1. https://ojs.aaai.org/index.php/AAAI/article/view/10185. 10.1609/aaai.v30i1.10185Search in Google Scholar
[59] Foulds J, Geumlek J, Welling M, Chaudhuri K. On the theory and practice of privacy-preserving Bayesian data analysis. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI’16. Arlington, Virginia, USA: AUAI Press; 2016. p. 192–201. Search in Google Scholar
[60] Vu D, Slavkovic A. Differential privacy for clinical trial data: preliminary evaluations. In: 2009 IEEE International Conference on Data Mining Workshops; 2009. p. 138–43. ISSN: 2375-9259. https://ieeexplore.ieee.org/document/5360513. 10.1109/ICDMW.2009.52Search in Google Scholar
[61] Alabi D, McMillan A, Sarathy J, Smith A, Vadhan S. Differentially private simple linear regression. 2020. http://arxiv.org/abs/2007.05157. Search in Google Scholar
[62] Sheffet O. Old techniques in differentially private linear regression. In: Proceedings of the 30th International Conference on Algorithmic Learning Theory. PMLR; 2019. p. 789–827. ISSN: 2640-3498. https://proceedings.mlr.press/v98/sheffet19a.html. Search in Google Scholar
[63] Colnet B, Mayer I, Chen G, Dieng A, Li R, Varoquaux G, et al. Causal inference methods for combining randomized trials and observational studies: a review. 2021 Jul. http://arxiv.org/abs/2011.08047. Search in Google Scholar
[64] Yang S. genRCT; 2021. Original-date: 2021-08-03T03:51:46Z. https://github.com/idasomm/genRCT. Search in Google Scholar
[65] Scharfstein DO, Rotnitzky A, Robins JM. Rejoinder. J Amer Stat Assoc. 1999;94(448):1135–46. 10.1080/01621459.1999.10473869. Search in Google Scholar
[66] Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. In: Proceedings of the American Statistical Association. vol. 1999. Indianapolis, IN; 2000. p. 6–10. https://cdn1.sph.harvard.edu/wp-content/uploads/sites/343/2013/03/jsaprocpat1.pdf. Search in Google Scholar
[67] Wu E, Gagnon-Bartsch JA. The LOOP estimator: Adjusting for covariates in randomized experiments. Evaluat Rev. 2018 Aug;42(4):458–88. 10.1177/0193841X18808003. Search in Google Scholar PubMed
[68] Wu E, Gagnon-Bartsch J, Sales A. loop.estimator; 2022. Original-date: 2018-12-14T18:45:15Z. https://github.com/adamSales/rebarLoop. Search in Google Scholar
[69] Wu E, Sales A, Mann CZ, Gagnon-Bartsch J. dRCT; 2023. https://github.com/manncz/dRCT. Search in Google Scholar
© 2025 the author(s), published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.
Articles in the same Issue
- Research Articles
- Decision making, symmetry and structure: Justifying causal interventions
- Targeted maximum likelihood based estimation for longitudinal mediation analysis
- Optimal precision of coarse structural nested mean models to estimate the effect of initiating ART in early and acute HIV infection
- Targeting mediating mechanisms of social disparities with an interventional effects framework, applied to the gender pay gap in Western Germany
- Role of placebo samples in observational studies
- Combining observational and experimental data for causal inference considering data privacy
- Recovery and inference of causal effects with sequential adjustment for confounding and attrition
- Conservative inference for counterfactuals
- Treatment effect estimation with observational network data using machine learning
- Causal structure learning in directed, possibly cyclic, graphical models
- Mediated probabilities of causation
- Beyond conditional averages: Estimating the individual causal effect distribution
- Matching estimators of causal effects in clustered observational studies
- Ancestor regression in structural vector autoregressive models
- Single proxy synthetic control
- Bounds on the fixed effects estimand in the presence of heterogeneous assignment propensities
- Minimax rates and adaptivity in combining experimental and observational data
- Highly adaptive Lasso for estimation of heterogeneous treatment effects and treatment recommendation
- A clarification on the links between potential outcomes and do-interventions
- Review Article
- The necessity of construct and external validity for deductive causal inference
Articles in the same Issue
- Research Articles
- Decision making, symmetry and structure: Justifying causal interventions
- Targeted maximum likelihood based estimation for longitudinal mediation analysis
- Optimal precision of coarse structural nested mean models to estimate the effect of initiating ART in early and acute HIV infection
- Targeting mediating mechanisms of social disparities with an interventional effects framework, applied to the gender pay gap in Western Germany
- Role of placebo samples in observational studies
- Combining observational and experimental data for causal inference considering data privacy
- Recovery and inference of causal effects with sequential adjustment for confounding and attrition
- Conservative inference for counterfactuals
- Treatment effect estimation with observational network data using machine learning
- Causal structure learning in directed, possibly cyclic, graphical models
- Mediated probabilities of causation
- Beyond conditional averages: Estimating the individual causal effect distribution
- Matching estimators of causal effects in clustered observational studies
- Ancestor regression in structural vector autoregressive models
- Single proxy synthetic control
- Bounds on the fixed effects estimand in the presence of heterogeneous assignment propensities
- Minimax rates and adaptivity in combining experimental and observational data
- Highly adaptive Lasso for estimation of heterogeneous treatment effects and treatment recommendation
- A clarification on the links between potential outcomes and do-interventions
- Review Article
- The necessity of construct and external validity for deductive causal inference