Combining observational and experimental data for causal inference considering data privacy

Charlotte Z. Mann; Adam C. Sales; Johann A. Gagnon-Bartsch

doi:10.1515/jci-2022-0081

Article Open Access

Combining observational and experimental data for causal inference considering data privacy

Charlotte Z. Mann , Adam C. Sales and Johann A. Gagnon-Bartsch

Published/Copyright: March 11, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Causal Inference Volume 13 Issue 1

Abstract

Combining observational and experimental data for causal inference can improve treatment effect estimation. However, many observational datasets cannot be released due to data privacy considerations, so one researcher may not have access to both experimental and observational data. Nonetheless, a small amount of risk of disclosing sensitive information might be tolerable to organizations that house confidential data. In these cases, organizations can employ data privacy techniques, which decrease disclosure risk, potentially at the expense of data utility. In this study, we explore disclosure limiting transformations of observational data, which can be combined with experimental data to estimate the sample and population average treatment effects. We consider leveraging observational data to improve generalizability of treatment effect estimates, when a randomized controlled trial (RCT) is not representative of the population of interest, and to increase precision of treatment effect estimates. Through simulation studies, we illustrate the trade-off between privacy and utility when employing different disclosure limiting transformations. We find that leveraging transformed observational data in treatment effect estimation can still improve estimation over only using data from an RCT.

Keywords: data integration; statistical disclosure control; differential privacy

MSC 2010: 62D20; 91A90; 68P27; 62P99

1 Introduction

A growing literature is developing ways to combine experimental and observational studies for causal inferences [1]. While treatment effect estimates from randomized controlled trials (RCTs) can be free from confounding bias, observational studies generally provide a richer source of information on a population of interest. The internet has given rise to more observational datasets that may be used to address the same questions as randomized experiments. Therefore, there should be more opportunities to leverage observational and RCT data together for improved treatment effect estimation. However, in practice, it is not always the case that a researcher has access to both sources of data due to data privacy [2].

Many government agencies have useful data that they cannot release to the public in order to preserve data privacy. Data privacy refers to the right of individuals whom the data describe to control what information about themselves is shared [3,4]. Typically, sensitive data that are released to the public are sanitized in various ways, which potentially render the released data less useful. For example, only aggregate statistics or a sample of the data may be released, or small values are censored. There is a trade-off to consider, between data privacy, a right to be upheld, and the amount of information researchers can access to answer societal questions.

Balancing data privacy and releasing useful information is ultimately a policy decision. There are currently no overarching policies in the US regulating data privacy or confidentiality [5,6]. Rather, data privacy is legislated on a sector-by-sector basis [5]. For example, patient, student, financial, and additional data from US governmental agencies data are protected by separate acts: Health Insurance Portability and Accountability Act; Family Educational Rights and Privacy Act; Fair Credit Reporting Act; Confidential Information Protection and Statistical Efficiency Act [5,7]. On the other hand, members of the European Union and other countries throughout the world have adopted data privacy legislation with broad scopes across sectors [5]. We do not address the legality of different approaches to data privacy. Rather, we raise these issues to establish that data stewards operate in different legal contexts, which may or may not provide specific procedures for protecting data privacy.

We consider the setting in which analysts of an RCT could potentially use auxiliary observational data to improve treatment effect estimation through data integration. However, the relevant auxiliary data cannot be released in its raw form. We aim to address the primary research question: Can privacy-preserving releases of confidential observational data be used to improve causal estimation when integrated with the RCT data?

There are two broad ways that integrating observational data into experimental treatment effect estimates can improve causal estimation, (1) estimating treatment effects for a population of interest when the RCT sample is not representative of that population and (2) increasing precision of RCT estimates. We consider two previously developed data integration methods for causal inference, each of which addresses one of these aims. These methods are well-suited for our investigation because they only require summaries of the auxiliary observational data. Then, we consider ways that stewards of such observational data could transform and release the data to preserve data privacy, as discussed in the data privacy literature. We focus on transformations that can still be used in data integration methods for treatment effect estimation. Finally, through simulation studies, we illustrate the trade-off between privacy and utility when integrating different releases of the private, observational data in RCT treatment effect estimation, both to generalize to populations of interest and to improve experimental precision.

The work in this study is distinct from previous work on releasing privacy-protected causal estimates, such as in previous studies [8–10], in a couple of ways. First, we do not consider releasing a private causal estimate, but rather how a private data release itself could be used in causal estimation with data integration. Second, previous studies [8–10] rely on a causal framework different from the one we do in this study.

We aim for this work to inform data stewards who want to release data that are as useful as possible while balancing data privacy. We also aim to encourage conversation between the literature that combines experimental and observational data for causal inference and the data privacy literature.

This work is organized as follows. Section 2 establishes notation and the causal estimands and estimators of interest. Section 3 provides background on data privacy techniques and presents disclosure limiting transformations of auxiliary data. Section 4 describes simulation studies to evaluate the utility of the proposed transformations in treatment effect estimation that integrates transformed auxiliary data and RCT data and discusses the results. Section 5 gives the conclusion of this work.

2 Leveraging observational and experimental data in causal inference

The presentation in this section primarily follows the research by Colnet et al. [1]. Consider a randomized experiment with n subjects and related auxiliary observational data with m subjects, both sampled from the same population. For example, in a medical trial conducted to evaluate a treatment for sleep apnea, there may also be electronic health records for thousands of patients across a country with sleep apnea. We denote ℛ = { i : 1 , … , n } as the index set for subjects in the randomized experiment and O = { i : n + 1 , … , m + n } as the index set for subjects in the auxiliary (observational) data. Let S i be an indicator of whether subject i is in the RCT sample, and π i S = P ( S i = 1 ) be the probability of selection into the RCT. The probability of selection into the RCT may depend on subject i ’s observable characteristics, so that the RCT sample is not necessarily representative of the population. Let T be a binary treatment indicator, so T i = 1 , if subject i is assigned to treatment and T i = 0 , if subject i is assigned to control. We assume the probability that a subject in the RCT is assigned treatment, π i = P ( T i = 1 ) , 0 < π i < 1 , is known. In the setting we consider, we do not require the treatment to be observed in the observational data. For each subject in the RCT and observational samples, we observe a vector of covariates X i . Throughout this study, the following conventions will be used: fixed quantities are in lowercase, random quantities are in uppercase, scalars are non-bolded, and vectors or matrices are bolded.

Following Neyman [11] and Rubin [12], each subject has two potential outcomes, one which would be observed if treated, Y i t , and the other under control, Y i c . The observed outcome is a function of the potential outcomes and the treatment assignment: Y i = T i Y i t + ( 1 − T i ) Y i c . We assume no interference, i.e., Y i ⊥ { T j , j ≠ i } [13].

Researchers may be interested in a number of causal estimands. This work explores integrating experimental and observational data, under data privacy considerations, to estimate the population average treatment effect (PATE), as well as the RCT sample average treatment effect (SATE). The PATE is the expected treatment effect for a population of interest: τ PATE = E [ Y t − Y c ] . The SATE is the expected treatment effect for an RCT sample: τ SATE = E [ Y t − Y c ∣ S = 1 ] . Combining experimental and observational data can improve estimation of both the PATE and the SATE in different ways, as discussed further in this work.

2.1 Generalizing to populations of interest

Treatment effects estimated only with an RCT do not necessarily generalize to a population of interest. Consider estimating the PATE for the population from which subjects of an RCT were sampled. If the subjects in the RCT are systematically different from this population (i.e., there is some selection bias to inclusion in the RCT), and the treatment effect depends on subject characteristics, then an estimate of the PATE with only the RCT data is biased. Further, it will be unclear how “far off” the estimate is. In this case, we say that the RCT estimate does not generalize to the population. Observational data are often more representative of the population from which they were sampled. Therefore, to estimate the PATE, it is useful to leverage information from auxiliary observational data.

The literature proposes a number of estimates for the PATE, which integrate experimental and observational data [14–18]. Refer the study by Colnet et al. [1] for a full review. We focus on the calibration weighted (CW) estimator proposed by Lee et al. [18], for two primary reasons. First, a statistical summary of auxiliary data is sufficient for the CW estimator. As will be discussed in detail in Section 3, statistical summaries of data already do some of the work to limit disclosure of confidential data. Thus, the CW estimator is better-suited to take private releases of auxiliary data as an input than other estimators, which rely on the full auxiliary data, at the subject level. Second, the results of the review in the study by Colnet et al. [1] indicate that the CW estimator outperforms other options across different settings.

The CW estimator [18] is a variation of the inverse probability weighted (IPW) estimator [19,20],

τ ˆ IPW = ∑ i ∈ ℛ T i Y i π i − ( 1 − T i ) Y i 1 − π i .

The IPW estimator only uses data from the RCT ( i ∈ ℛ ), so it may be biased for the PATE if there is selection bias. The CW estimator re-weights the RCT subjects in the IPW estimator. The goal of the weighting is to align the empirical covariate distribution of the RCT subjects with that of the auxiliary subjects ( i ∈ O ), which are assumed to represent the population of interest. In addition to the assumptions discussed previously, the CW estimator relies on the following assumptions to identify the PATE: (1) positivity of selection into the RCT ( 0 < π i S < 1 ), and (2) that the conditional average treatment effect (CATE) for the RCT sample is equivalent to the population CATE ( E [ Y t − Y c ∣ X , S = 1 ] = E [ Y t − Y c ∣ X ] ).

The CW estimator is defined as

τ ˆ CW = ∑ i ∈ ℛ w ˆ ( X i ) T i Y i π i − ( 1 − T i ) Y i 1 − π i .

Each subject i ∈ ℛ is assigned a weight w ˆ ( ⋅ ) , which is estimated using covariates from the auxiliary study and the RCT, solving the optimization problem: min w 1 , … , w n ∑ i = 1 n w i log w i , subject to w i ≥ 0 , ∑ i = 1 n w i = 1 and

(1) ∑ i ∈ ℛ w i g ( X i ) = 1 m ∑ i ∈ O g ( X i ) .

Equation (1) is the key restriction on w i for generalizability – the goal is for the weighted sum of g ( X i ) in the RCT sample to be equal to the simple mean of g ( X i ) in the auxiliary sample. Two reasonable choices for g ( ⋅ ) would be g ( z ) = z and g ( z ) = z z ′ . With these choices, the first or second empirical moments of the RCT sample covariates are calibrated to those moments of the auxiliary sample covariates since 1 m ∑ i ∈ O g ( X i ) is a consistent estimator of E [ g ( X ) ] for the auxiliary study. Lee et al. [18] noted that the calibration weighting estimator is a consistent estimator for the PATE if either (1) the probability of RCT participation can be modeled as exp { η 0 T g ( X ) } for some η 0 or (2) the CATE is a linear function of g ( X ) .

2.2 Improving experimental precision

We next consider a context in which the treatment effect for only an RCT sample is of interest. Even though there may no longer be a larger population of interest, incorporating information from observational data can still prove useful, in this case by improving the precision of an estimate of the SATE.

RCTs often have small sample sizes, so RCT estimates of the SATE may lack precision. A common approach to improve precision is covariate adjustment. Covariate adjustment accounts for variance in Y i , which is not due to the treatment, but rather some observed covariates. A popular adjusted estimator for the SATE is the estimated coefficient for the treatment assignment in a regression model. One might consider a linear model Y i = α + τ T i + β X i + ε i with the standard ordinary least squares regression (OLS) assumptions. We denote the estimate of the coefficient on T i in this model, using OLS and the RCT sample only, as τ ˆ OLS ( X { i ∈ ℛ } ) . The increase in precision achieved by covariate adjustment depends on how much of the variance in Y i can be explained by the observed covariates.

Because observational studies typically have much larger sample sizes than randomized experiments ( m ≫ n ), including information from auxiliary observational data can improve precision more than covariate adjustment with the RCT sample alone. There are various ways that auxiliary information can be integrated into RCT analysis to improve precision [21–28]. For example, one part of the literature leverages historical controls from observational studies or previously run RCTs to improve precision in RCT estimates by pooling the different sources of data [21,22]. We focus on an approach that uses auxiliary observational data to construct a highly predictive covariate for the outcome of interest in the RCT as in previous studies [23,25,27,28]. We will call the general method the “super-covariate data integration approach.”

The first step of the super-covariate data integration approach is to fit a model of the outcome of interest (Y) with the auxiliary data only. The only goal of this model is to predict the outcome of interest as well as possible in the RCT sample. The only requirements for the auxiliary prediction model are that (1) it uses only covariates that would be considered pre-treatment or baseline in the RCT and (2) predictions can be generated for the RCT sample (i.e., the same baseline covariates need to be available in the RCT and auxiliary data). Let y ˆ O ( ⋅ ) denote such an auxiliary model. The second step of the approach is to generate outcome predictions for the RCT sample. Let Y ˆ i = y ˆ O ( X i ) denote a prediction for subject i from the model fit on the auxiliary data and Y ˆ O denote the n × 1 vector of such predictions for the RCT subjects ( i ∈ ℛ ). Then, Y ˆ O can be considered a baseline covariate because it is independent of the treatment assignment in the RCT sample. As the final step of the super-covariate approach, we replace the RCT covariates X { i ∈ ℛ } with Y ˆ O in any covariate-adjusted causal estimator. In this study, we will replace the covariates in the regression estimator (i.e., estimate τ with OLS in the model Y i = α + τ T i + β Y ˆ i + ε i ). We denote this regression estimator, using the auxiliary prediction as a covariate as τ ˆ OLS ( Y ˆ O ) .

Especially when there are a large number of covariates, a model fit on (large) auxiliary data can be more informative of the outcome of interest than a model fit on (small) RCT data. For this reason, we expect Y ˆ O to be a powerful covariate to adjust for the variability in the RCT outcome, which is not explained by the treatment assignment (i.e., a “super-covariate”). Thus, adjusting with the auxiliary predictions can improve precision beyond covariate adjustment with the RCT covariates [25,27,28].

We focus on this data integration approach for the SATE because it is highly flexible, relies on few assumptions, and is highly efficacious. All that is required from the auxiliary data is a model of the outcome of interest. Thus, as with the CW estimator, only a statistical summary of the auxiliary data is required. Additionally, this method relies on very few assumptions with regards to the auxiliary and RCT data. The auxiliary model need not be “correct” in any sense. The only requirement for the approach to improve precision is that the auxiliary model predicts the outcomes in the RCT setting well. Finally, there is a body of previous literature supporting the efficacy of this approach in general [25,27–29], so we are interested in whether that efficacy can be maintained even if using privacy-preserving transformations of auxiliary data, rather than the original data itself.

3 Disclosure limiting transformations of auxiliary data

In Section 2, we discussed two estimators that combine experimental and observational data. In practice, one entity may not have access to both types of data. We assume that analysts with access to a randomized experiment are interested in incorporating information from a relevant auxiliary observational, study. However, the data stewards of the auxiliary study cannot release the data ( ( Y i , X i ) , i ∈ O ) due to data privacy. Therefore, we consider transformations of the restricted auxiliary data that limit disclosure risk and can be used in the estimators discussed in Section 2 ( τ ˆ CW and τ ˆ OLS ( Y ˆ O ) ). In this section, we introduce data privacy frameworks (Section 3.1), discuss the properties of differentially private algorithms (Section 3.2), and define the disclosure limiting transformations that we compare in this work (Section 3.3).

3.1 Data privacy overview

With the rise of the internet and an explosion of data availability, computer scientists and statisticians have been considering issues of data privacy for the past four decades [refer refs 30–35, for reviews]. Two distinct but related concepts are data privacy and data confidentiality. Data privacy is defined as the right of individuals to control information about themselves [3]. Data confidentiality is the agreement between individuals and data stewards regarding the extent to which others can access any private/sensitive information provided [3,4]. Disclosure risk refers to the risk that an attacker could access sensitive information from the released data. Historically there was a focus on microdata, data that have information at an individual level and includes information that could be used to identify an individual; however, more recent work considers the risk of disclosure for statistical summaries [35]. The goal of data privacy methods is to reduce disclosure risk.

We can view data privacy techniques as falling into two primary frameworks: statistical disclosure control (SDC) and differential privacy (DP) (refer [4,35], for detailed discussions). SDC aims to uphold data privacy by maintaining data confidentiality. SDC techniques include, for example, synthetic data, cell suppression, data swapping, matrix masking, and noise additions to limit the risk for disclosure of individual identities and attributes [33]. There is not one measure of disclosure risk in the SDC framework. On the other hand, the DP framework provides a mathematical definition for disclosure risk, although for a specific type of risk (discussed in Section 3.2). We consider techniques under both frameworks in this study, and use the term disclosure limitation to mean limiting the risk of disclosing sensitive information, not specifically referring to SDC.

With any disclosure limiting procedure, there is a trade-off between privacy and utility [33,35,36]. To maximize utility, analysts would want the original data, and thus, there would be no data privacy. On the other hand, for there to be no disclosure risk, the data could not be released, so there would be no utility. Data stewards therefore must decide on a tolerable disclosure risk and a measure of data utility in order to balance the two competing forces. This is not a trivial task. As discussed in Section 1, depending on the context, there are not necessarily formal guidelines for determining a tolerable disclosure risk. Additionally, anticipating all of the desired uses of a dataset is not feasible. In this study, we focus on specific uses for the data releases, which are associated with specific metrics to assess utility. We do not specify a tolerable disclosure risk. Rather, the simulation studies in Section 4 explore the privacy-utility trade-off when employing different disclosure limiting transformations.

3.2 DP

Dwork et al. [37] and Dwork [38] introduced DP in the early 2000’s. DP is considered the first rigorous mathematical quantification of disclosure risk for privacy-preserving algorithms. DP algorithms limit the difference in distribution between outputs of the algorithm generated from datasets that differ by only one observation [39].

Consider a private dataset d and a dataset d ′ which is a subset of d , differing by one observation: d ( d , d ′ ) = 1 .^[1] Formally, a random algorithm K achieves ε -DP if

P ( K ( d ) ∈ R ) P ( K ( d ′ ) ∈ R ) ≤ exp ( ε ) ,

for all ( d , d ′ ) and all R ⊆ image ( K ) . This is the typical representation of the bound, which is agnostic to whether d or d ′ is in the denominator. The parameter ε > 0 is chosen by the researcher and can be thought of as a “privacy budget.” The smaller the privacy budget ε , the less risk of disclosure.

Algorithms that achieve ε -DP can be impractical as the outputs may not resemble the restricted statistics closely enough to be useful. Therefore, multiple relaxations have been developed including ( ε , δ ) -DP [41] which guarantees that

P ( K ( d ) ∈ R ) ≤ P ( K ( d ′ ) ∈ R ) exp ( ε ) + δ .

In other words, K achieves ε -DP with probability 1 − δ . We will focus on ( ε , δ ) -DP in this wotk. A common algorithm that achieves ( ε , δ ) -DP is the Gaussian mechanism [42], which adds random Gaussian noise to a statistic (called a query in the literature) calculated from the restricted data. Let f ( d ) ∈ R k , k ≥ 1 be a statistic, then K ( d , γ ) = f ( d ) + N k ( 0 , γ 2 I ) . K ( d , γ ) achieves ( ε , δ ) -DP when

γ = Δ f 2 log ( 1.25 ∕ δ ) ∕ ε , Δ f = max d , d ′ ∣ ∣ f ( d ′ ) − f ( d ) ∣ ∣ 2 .

The researcher chooses ε and δ , which together make up the privacy budget. Δ f is called the global sensitivity of the statistic or query, which is a measure of how much f ( ⋅ ) could possibly change between any ( d , d ′ ). To calculate the global sensitivity, for many typical statistics, like the empirical mean or empirical variance, one must know or assume bounds on the data d . Refer previous studies [39,40] for discussions of calculating global sensitivities.

DP algorithms maintain a couple of benefits. First, the definition makes no assumptions about the information that an attacker has. DP algorithms are also robust to post processing, so any transformation of a differentially private output is still differentially private. DP algorithms also have useful composition properties [40]. When multiple statistics are calculated from the same data, the amount of privacy budget used for each statistic is simply added to calculate the total privacy budget used, under Sequential Composition [42]. The composition of a ( ε 1 , δ 1 ) -DP algorithm and a ( ε 2 , δ 2 ) -DP algorithm applied to the same data is a ( ε 1 + ε 2 , δ 1 + δ 2 ) -DP algorithm. Therefore, to maintain ( ε , δ ) -DP among q different statistics f ( d ) from the same data, a ( ε ∕ q , δ ∕ q ) -DP algorithm is applied to each statistic. If a ( ε , δ ) -DP algorithm is repeatedly applied to disjoint sets of a dataset (e.g., histogram bins), then under Parallel Composition, the result is a ( ε , δ ) -DP algorithm (so the budget does not need to be split) [43].

There is a clear trade-off between the privacy budget ( ε , δ ) and the magnitude of the noise added to the statistic ( γ ). The magnitude of the noise added to statistics additionally depends on the sensitivity of statistics to changing one observation in the confidential data (therefore, the largest outlier) and increases with the number of statistics calculated from the data. Therefore, the magnitude of the noise added can be so large that the transformed data are no longer useful, if the privacy budget is small.

Organizations have started adopting DP for disclosure limitation in recent years [7]. Notably, the US Census Bureau implemented a differentially private algorithm for releases of the 2020 Census redistricting data with ε = 17.14 and δ = 1 0 − 10 [44]. There are no guidelines for choosing the privacy budget ε and δ , which is ultimately a policy choice [39]. In general, organizations currently use large privacy budgets because they make many queries from the same dataset. For example, in 2020 Google had a monthly privacy budget of ε ≈ 80 for Mobility Reports [45].

3.3 Disclosure limiting transformations

We will compare two major approaches for releasing confidential data while limiting disclosure risk – synthetic data (3.3.1) and a differentially private Gram matrix (3.3.2) – as described in the following.

3.3.1 Synthetic data

Synthetic data, introduced by Rubin [46], is a synthetic version of a dataset when raw microdata cannot be released. Rubin [46] and Little [47] viewed synthetic data as a missing data problem, where the sensitive information was missing and could be imputed with multiple imputation. In general, synthetic data replace sensitive information in the original data with values generated from statistical summaries of the original data [4,48]. The big idea, typically, is to generate synthetic data by sampling from the empirical joint and marginal distributions of the columns in the confidential data. For example, a popular parametric method for data synthesis generates variables sequentially, generating the next variable with predictions from classification and regression trees fit on the already synthesized variables [49]. We use this synthesis method in the simulations in this study.

Synthetic data are appealing as a disclosure limiting transformation of confidential data because they can technically be analyzed with the same methods as the original data. However, they have two major drawbacks. First, even though no observations from the confidential data are released in such synthetic data, there is still a risk of information disclosure. Recent work has discussed the risk of leaking information with typical synthetic data generation [6]. More recently, methods have been developed to generate differentially private synthetic data [50,51], in which case the disclosure risk is clear, via the choice of privacy budget. The second drawback of releasing synthetic data is that making valid inferences with synthetic data requires clear communication from the data steward to the public of how the data can be analyzed.

3.3.2 Noise infused gram matrix

The major benefit of synthetic data is that it is at the same observation level as the confidential data that cannot be released. However, the estimators discussed in Section 2 do not require the individual level auxiliary data. Estimating the PATE with τ ˆ CW ( ⋅ ) only requires the first or second empirical moments of the auxiliary data. Estimating the SATE with τ ˆ OLS ( Y ˆ O ) only requires predictions from a model fit of the covariates on the outcome of interest with the auxiliary data. Therefore, we consider statistical summaries of the auxiliary data to limit disclosure risk.

We define the data matrix D m × p + 2 with observations D i = ( 1 , Y i , X i ) . Let Y denote the m × 1 vector of observed outcomes, X denote the m × p matrix of covariates, and 1 denote a m × 1 vector of 1’s. Then, the gram matrix of the data matrix G = D ′ D ∕ m includes X ′ X ∕ m , X ′ Y ∕ m , μ ˆ Y = 1 ′ Y ∕ m , and μ ˆ x = 1 ′ X ∕ m . The coefficients β for an OLS model are estimated as β ˆ = ( X ′ X ) − 1 X ′ Y , so G is sufficient to estimate β ˆ for the auxiliary study to leverage in τ ˆ OLS ( Y ˆ O ) . G includes the first and second empirical moments of D , to leverage in τ ˆ CW ( ⋅ ) .

Therefore, the gram matrix of the auxiliary data G seems like a sensible place to start for a disclosure limiting transformation of the data. In fact, G = D ′ D ∕ m is a special version of matrix masking, an SDC technique [30]. However, this type of matrix masking does not meet current standards for disclosure limitation, so we propose releasing a noisy version of the gram matrix of the auxiliary data. Specifically, we consider a ( ε , δ ) -DP algorithm for releasing a noise infused gram matrix.

There is a body of literature that considers perturbed sufficient statistics for linear regression (OLS), penalized regression (such as Ridge and LASSO), and principle component analysis (PCA), which are all contained in G . Refer the study of Barrientos et al. [40] for a recent review. Kamath et al. [52], Balle and Wang [53], and Wang [54], building off of the work in Dwork et al. [55], proposed methods for adding Gaussian noise to a covariance matrix to achieve ( ε , δ ) -DP. Chanyaswad et al. [56] considered matrix-valued DP releases in general and illustrated a method adding multivariate-Gaussian noise with the Gram matrix. Another line of work proposes adding Laplace noise to the covariance matrix to achieve ε -DP [57–61]. Sheffet [62] proposed a mechanism for adding Wishart noise to achieve ( ε , δ ) -DP. Most of these approaches do not address the issue of variables in real datasets having largely different scales. Therefore, we take an approach similar to that of Dwork et al. [55], but adapt it to account for such differences in scale.

G is a symmetric matrix of statistics calculated from the data D . Therefore, we can think of the upper triangle of G as a vector of statistics with length k = 1 2 ( p 2 + 5 p + 2 ) and could apply the Gaussian mechanism as described in Section 3.2 to this vector with some privacy budget ( ε , δ ) as in the study of Dwork et al. [55]. However, the magnitude of the Gaussian noise added to each element of the vector depends on the largest difference of the upper triangle of G given that one observation is removed. The columns of D might be on vastly different scales and with different variances. This is therefore not the best approach to attain ( ε , δ ) -DP while adding as little noise as possible.

Instead, we divide the upper triangle of G , including the diagonal, into different elements, as illustrated in Figure 1, and divide the privacy budget among those elements. First, we assume that the sample size m is not confidential. Let 1 denote a m × 1 vector of 1’s. Define the vector of column means, μ ˆ = 1 ′ D ∕ m , the empirical moments of the columns cross-multiplied correlation matrix C ˆ p + 1 × p + 1 = ( Y , X ) ′ ( Y , X ) , and the vector of empirical second moments μ ˆ 2 = diag ( C ˆ ) . Then, G can be reconstructed as illustrated in Figure 1 with A = μ ˆ , B = μ ˆ 2 , and C is the upper triangle of C ˆ (excluding the diagonal).

$Figure 1 Illustration of how the gram matrix G {\boldsymbol{G}} can be partitioned into separate parts, which correspond to the column means (A), empirical second moments (B), and empirical moments of the columns cross multiplied (C).$

Figure 1

Illustration of how the gram matrix G can be partitioned into separate parts, which correspond to the column means (A), empirical second moments (B), and empirical moments of the columns cross multiplied (C).

We construct ( ε , δ ) -DP G as follows. First, we divide the privacy budget between μ ˆ , μ ˆ 2 , and the upper triangle of C ˆ proportionally to the number of elements in each. Therefore, μ ˆ and μ ˆ 2 are allocated 2 ( p + 1 ) ( p + 2 ) ( p + 3 ) − 2 and the upper triangle of C ˆ is allocated p ( p + 1 ) ( p + 2 ) ( p + 3 ) − 2 of the privacy budget. We apply the Gaussian mechanism to each element of μ ˆ , μ ˆ 2 , and the upper triangle of C ˆ separately. Therefore, we split the privacy budget for the upper triangle of C ˆ across the p ( p + 1 ) ∕ 2 elements to construct a p ( p + 1 ) ( p + 2 ) ( p + 3 ) − 2 ε , p ( p + 1 ) ( p + 2 ) ( p + 3 ) − 2 δ -DP release of the upper triangle of C ˆ . By parallel composition, we do not need to further split the privacy budget to construct 2 ( p + 1 ) ( p + 2 ) ( p + 3 ) − 2 ε , 2 ( p + 1 ) ( p + 2 ) ( p + 3 ) − 2 δ -DP releases of μ ˆ and μ ˆ 2 . Finally, we reconstruct a DP gram matrix, G ∗ , from the DP releases of μ ˆ and C ˆ . Due to sequential composition, the resulting noisy gram matrix G ∗ is ( ε , δ ) -DP.

Dividing the gram matrix into these elements allows adding a smaller magnitude of noise while still achieving DP. First, separating the matrix into the different elements accounts for the different sensitivities of first and second moments, and second, applying the Gaussian mechanism element-wise within each element accounts for scale differences between columns. Refer Appendix A for details of the sensitivity calculations that are used in the algorithm.

There is no guarantee that the resulting DP matrix will be positive definite – an important property of gram matrices. Therefore, we post-process G ∗ to ensure that it is positive-definite in a manner similar to that reported in the study by Barrientos et al. [40]. Namely, we set any negative eigenvalues to zero, add the median positive eigenvalue to all of the eigenvalues, and then reconstruct the matrix with the original eigenvectors and the transformed eigenvalues.

4 Simulation studies

We conduct two simulation studies to compare the utility of different auxiliary data releases in estimators of the PATE and SATE, respectively. As baselines, we use the difference-in-means estimator τ ˆ DM and regression estimator τ ˆ OLS ( X { i ∈ ℛ } ) , which only rely on RCT data. The difference-in-means estimator is the difference in the mean outcomes for the subjects assigned treatment and those assigned control. These are compared to the CW estimator and super-covariate data integration approach using regression, as discussed in Sections 3.3.1 and 3.3.2.

4.1 Generalizing to populations of interest

We consider a setting where there is an RCT and a related observational study which were both sampled from a population of interest. The aim is to estimate the PATE for this population of interest. The auxiliary study is representative of this population. However, the RCT sample is not representative of the population of interest due to selection bias. We consider a modified version of the simulations in the studies by Colnet et al. [63] and Lee et al. [18], which assumes a heterogeneous treatment effect.

4.1.1 Data generation

We emulate a hypothetical randomized experiment and auxiliary study assuming that the SATE in the RCT do not equal the desired PATE due to selection bias. We generate the 1 × p covariate vector X i from independent and identically distributed (i.i.d.) N ( 1 , I p ) distributions. We additionally generate a covariate X i S ∼ N ( 1 , 1 ) , which impacts both selection into the RCT and subject i ’s treatment effect. Let S i be an indicator of whether subject i is selected into the RCT and π i S = P ( S i = 1 ) . We model the probability of selection into the RCT with a logistic regression model

logit { π i S } = − 2 + β S X i + 0.5 X i S .

β S is generated for each value of p , then fixed, with 50% of the elements of β S set to be − 1 ∕ ( 0.5 p ) and the others are 0. Then, S i is generated from a Bernoulli distribution with probability π i S . If S i = 1 , subject i is included in the RCT sample. We generate X i , X i S , and S i 1,300 times so that approximately 100 subjects are selected into the RCT ( n ≈ 100 ). The auxiliary study sample ( m = 10 , 000 ) is then generated directly from the population, with X i ∼ N ( 1 , I p ) and X i S ∼ N ( 1 , 1 ) . The control potential outcomes are generated Y i c = 0.5 + β X i + ε i , ε i ∼ N ( 0 , 0.3 ) . β is fixed for each value of p , with 60% of the elements randomly selected to be 0.7 ∕ ( 0 . 6 p ) and the others are 0. Therefore, some covariates contribute to both the π i S and the outcome, one, or neither and the covariates explain 70% of the variance in Y i c . We let Y i t = Y i c + 0.5 X i S . Since E [ X S ] = 1 and τ PATE = 0.5 , based on the selection model, higher values of X i S are favored for selection into the RCT so the SATE will be larger than the PATE.

4.1.2 Simulation procedure

For p = 10 and p = 20 , we repeat the following procedure 1,000 times. First, we generate an observational and experimental dataset as described in Section 4.1.1. We then calculate each disclosure limiting transformation of the auxiliary data as described in Section 3.3, and the mean vector for the p + 1 covariates { X , X S } from each transformation. Denote the mean vector μ ˆ ( ⋅ ) , so μ ˆ ( G ) is the (true) mean vector calculated from G . We implement the differentially private algorithm (Section 3.3.2) to generate G ∗ with a range of ε between 1 and 30 and δ = 1 0 − 5 . We additionally generate a synthetic dataset ( D ˜ ) using the synthpop package in R [49], which implements sequential synthesis.

We then generate the treatment assignment T i ∼ Bern ( . 5 ) for the RCT sample, calculate the observed outcomes Y i = T i Y i t + ( 1 − T i ) Y i c , and calculate the estimators. To compare the difference-in-means and regression estimators to a calibrated weighted estimator with more comparable variance, we calculate the augmented CW estimator τ ˆ ACW − t ( ⋅ ) as described in the study by Lee et al. [18]. The weights are estimated to calibrate the empirical mean of the RCT to the empirical mean of the auxiliary study μ ˆ ( ⋅ ) ; i.e., in the optimization problem (equation (1)), 1 m ∑ i ∈ O g ( X i ) = μ ˆ ( ⋅ ) . The weights are estimated by solving the optimization problem with Lagrange multipliers and Newton’s equation (refer the study by Lee et al. [18], for more details). We adapt the genRCT package [64] code to calculate τ ˆ ACW − t ( ⋅ ) in R. We additionally estimate a 95% confidence interval for each estimator. We use the standard variance estimators for the difference-in-means and regression estimators and the bootstrap variance estimator suggested by Lee et al. [18] with 100 bootstraps for the τ ˆ ACW − t ( ⋅ ) estimator.

4.1.3 Utility metrics

We evaluate the utility of each private data release for estimating the PATE with two metrics. First, we look at the mean squared error (MSE) of the estimator. Second, we consider the coverage probability for a 95% confidence interval. We additionally consider utility metrics for estimates of the column means themselves in the auxiliary data. For this, we compare the square root of the MSE (RMSE) of the column means for each disclosure limiting transformation to the RMSE of the confidential auxiliary data ( G ).

4.1.4 Results

Figure 2 shows the empirical MSE for each estimator, estimated across the 1,000 simulations. We note that we do not show results for the DP release G ∗ with ε = 1 and p = 20 because the algorithm could not find a solution to the calibration weights in a large portion of the simulations. Blue represents the variance component of MSE and orange represents the squared bias component of the MSE. The difference-in-means and regression estimators, which only rely on data from the RCT have large estimated MSEs, which are primarily due to large squared biases, as compared to little or no bias of τ ˆ ACW − t ( ⋅ ) . Due to the shift in the distribution of X S in the RCT sample, the average SATE across simulations is 0.86, which is larger than the true PATE of 0.5. The difference-in-means and regression estimators are unbiased for the SATE, so they are upwardly biased for the PATE in this case. τ ˆ ACW − t ( ⋅ ) performs similar across different disclosure limiting transformations of the auxiliary data. The CW estimator using the DP Gram matrix releases, even with a small privacy budget (corresponding to high levels of privacy) performs similar to the original auxiliary data when p = 10 . The results are similar when p = 20 , although the variances of all estimators increases.

$Figure 2 MSE of estimators of the PATE calculated across 1,000 iterations for different numbers of covariates ( p p ). Error bars represent two simulation standard errors. The MSE is decomposed into the squared bias and the variance of the estimator. The first two estimators do not use data integration: τ ˆ DM {\hat{\tau }}^{{\rm{DM}}} is the difference-in-means estimator and τ ˆ OLS ( X { i ∈ ℛ } ) {\hat{\tau }}^{{\rm{OLS}}}\left({{\boldsymbol{X}}}_{\left\{i\in {\mathcal{ {\mathcal R} }}\right\}}) is the regression estimator with only RCT covariates. We do not show results for τ ˆ ACW ‒ t {\hat{\tau }}^{{\rm{ACW}}‒t} with G ∗ {{\boldsymbol{G}}}^{\ast } , ε = 1 \varepsilon =1 and p = 20 p=20 due to computational in-feasibility.$

Figure 2

MSE of estimators of the PATE calculated across 1,000 iterations for different numbers of covariates ( p ). Error bars represent two simulation standard errors. The MSE is decomposed into the squared bias and the variance of the estimator. The first two estimators do not use data integration: τ ˆ DM is the difference-in-means estimator and τ ˆ OLS ( X { i ∈ ℛ } ) is the regression estimator with only RCT covariates. We do not show results for τ ˆ ACW ‒ t with G ∗ , ε = 1 and p = 20 due to computational in-feasibility.

Table 1 shows the empirical coverage probability of the true PATE (0.5) for 95% confidence intervals, across the 1,000 simulations. The regression estimator has very poor coverage for the PATE, as does the difference-in-means estimator, which only has higher coverage because the variance is larger. The coverage of bootstrap 95% confidence intervals for the CW estimator is close to 0.95 cross the different disclosure limiting transformations of the auxiliary data, with the exception of G ∗ (DP Gram matrix) when ε = 1 and when ε = 3 and p = 20 .

Table 1

Estimated coverage probability for 95% confidence intervals for estimators of the PATE for different numbers of covariates ( p )

Estimate of the PATE	ε	Coverage
		p = 10	p = 20
RCT data only
τ ˆ DM	—	0.51 (0.010)	0.51 (0.013)
τ ˆ OLS ( X { i ∈ ℛ } )	—	0.08 (0.009)	0.13 (0.010)
Includes auxiliary data
τ ˆ ACW ‒ t ( G )	—	0.95 (0.007)	0.96 (0.005)
τ ˆ ACW ‒ t ( G ∗ )	1	0.94 (0.007)	—
τ ˆ ACW ‒ t ( G ∗ )	3	0.95 (0.007)	0.92 (0.012)
τ ˆ ACW ‒ t ( G ∗ )	6	0.95 (0.006)	0.95 (0.008)
τ ˆ ACW ‒ t ( G ∗ )	15	0.96 (0.005)	0.95 (0.007)
τ ˆ ACW ‒ t ( G ∗ )	30	0.95 (0.005)	0.95 (0.005)
τ ˆ ACW ‒ t ( D ˜ )	—	0.95 (0.006)	0.95 (0.006)

Note: Simulation standard errors in parentheses.

Table 2 shows the RMSE of the column means for the different releases of the confidential auxiliary data. This gives some insight into the utility of releasing G ∗ with different privacy budgets ε or synthetic data for uses of the column means other than the data integration estimator considered here. The RMSE of the column means for synthetic data is larger than the confidential data but only changes a small amount when the number of covariates increases. We find that when p = 10 , for ε ≥ 6 , the RMSE of the column means is essential, the same as the original confidential data. However, there is a large jump in the RMSE when ε = 1 and for small privacy budgets when there are more covariates ( p = 20 ).

Table 2

RMSE of column means for releases of confidential auxiliary (observational) data, varying the number of covariates p

Data release	ε	RMSE of column means ( × 1 0 ‒ 2 )
		p = 10	p = 20
G (No privacy)	—	0.98 (0.007)	0.99 (0.005)
D ˜	—	1.44 (0.009)	1.51 (0.008)
G ∗ (DP)	1	8.58 (0.089)	—
G ∗ (DP)	3	2.02 (0.026)	8.16 (0.059)
G ∗ (DP)	6	1.09 (0.009)	3.80 (0.030)
G ∗ (DP)	15	0.97 (0.007)	1.29 (0.009)
G ∗ (DP)	30	0.97 (0.007)	1.00 (0.005)

Note: Simulation standard errors in parentheses.

To summarize, when the PATE is the estimand of interest, leveraging auxiliary data in the analysis of an RCT with τ ˆ ACW − t ( ⋅ ) can greatly reduce the MSE of the treatment effect estimate, and this continues to be true when using auxiliary data that have undergone disclosure limiting transformations.

4.2 Improving experimental precision

In the second simulation study, the goal is to estimate the average treatment effect for an RCT sample. We assume that there is an RCT evaluating a specific treatment, and a related observational study which includes the outcome of interest. We do not necessarily observe the treatment in the auxiliary observational study. However, given that observational studies typically have large sample sizes, we can expect to get good estimates of model parameters for a model of the outcome on observed covariates, resulting in good predictions of the outcome. Further, if there are a large number of covariates, an observational study will likely better estimate model parameters for all covariates than a small RCT. As discussed in Section 2.2, predictions of the outcome based on a model trained on the auxiliary data can therefore be a very powerful covariate to include in RCT covariate adjustment. For simplicity, we consider a setting where the RCT and observational samples arise from the same, linear, data generating model (there is no covariate shift) and where the regression models match this data generating model. There are other covariate adjustment methods incorporating auxiliary data that account for covariate shift (Appendix B).

4.2.1 Data generation

We emulate a hypothetical RCT with n = 100 and an auxiliary (observational) dataset with m = 10,000 . For both, we generate X i from i.i.d N ( 0 , I p ) distributions. Then, Y i c = 0.5 + β ′ X i + ε i , where ε i ∼ N ( 0 , 0.3 ) . To make some covariates contribute to the outcome, while others are noise, 0.6 of the elements of β 1 × p are 0.7 ∕ ( 0.6 p ) , while the rest are set to 0. By this data generating process, V [ Y i c ] = 1 , and the proportion of variance explained by the covariates is 0.7. We let τ SATE = 0.5 , so Y i t = Y i c + 0.5 under the constant treatment effect assumption. In the auxiliary data, we assume that there is no treatment, so Y i = Y i c for i ∈ O .

4.2.2 Simulation procedure

For each value of p , we implement the below 100 times (100 data generations).

First, we generate an RCT and auxiliary dataset as described above. We then calculate each disclosure limiting transformation of the auxiliary data as described in Section 4.1.2. With each transformation of the auxiliary data, we calculate the OLS estimated coefficients for a model of the outcome on the covariates (auxiliary model). With each of these coefficient vectors, we predict the outcome for subjects in the RCT ( i ∈ ℛ ) (auxiliary predictions). Denote this prediction Y ˆ i ( ⋅ ) , so Y ˆ i ( G ) is the prediction of the outcome for subject i based on the coefficients calculated from G . Let Y ˆ O ( G ) denote the n × 1 length vector of these predictions for the RCT subjects. The auxiliary data are set aside.

Since we are interested in the average treatment effect for an RCT sample, we treat the outcomes and covariates as fixed. We then generate 1,000 treatment assignment vectors T i ∼ Bern ( 0.5 ) , i ∈ ℛ . For each treatment assignment vector, we generate the observed outcomes and calculate each estimator: τ ˆ DM , τ ˆ OLS ( x { i ∈ ℛ } ) , and τ ˆ OLS ( Y ˆ O ( ⋅ ) ) for each disclosure limiting transformation. Thus, we get an estimate of the distribution of the comparison estimators, based on a simulated distribution of the treatment assignment. Because n = 100 , the regression estimator fit only on the RCT covariates, τ ˆ OLS ( x { i ∈ ℛ } ) , runs into dimensionality issues as p grows. Therefore, we allow a maximum of 20 RCT covariates to be included in the regression model. With p = 50 , we choose 20 covariates that have a non-zero coefficient in the data generating model, emulating analyses of the RCT using variable selection and/or expert knowledge that would select predictive covariates.

4.2.3 Utility metrics

Because we use unbiased estimators for the SATE, we evaluate the utility of the data integration approach, with different auxiliary data releases, using the variance of the given point estimator. We calculate the empirical variance of the estimates across the 1,000 treatment assignment vectors for each data generation for this comparison. To more easily compare the variances, we consider the relative efficiency of each point estimator as compared to τ ˆ OLS ( X { i ∈ ℛ } ) , which only relies on RCT data. The relative efficiency is calculated as the ratio of the (simulated) variance of τ ˆ OLS ( X { i ∈ ℛ } ) and the variance of each estimator. The relative efficiency can also be interpreted as a multiplicative effect on the required sample size for a given estimator to achieve the same efficiency as τ ˆ OLS ( X { i ∈ ℛ } ) . Relative efficiencies greater than 1 indicate that the corresponding estimator is more precise than τ ˆ OLS ( X { i ∈ ℛ } ) , while relative efficiencies less than 1 indicate the opposite. For the main results, we average the variances and relative efficiencies of each estimator across the 100 data generations and calculate a standard error as the standard deviation of the 100 metrics divided by 100 .

We additionally consider utility for statistical summaries of the auxiliary data releases, which are the intermediate step to the data integration method. First, we look at RMSE of the coefficient vector β ˆ calculated from an auxiliary data release. This is a typical utility metric used for OLS in the DP literature [40]. Finally, we consider the Frobenius and spectral norms of the difference between the confidential Gram matrix G and the Gram matrix resulting from each transformation.

4.2.4 Results

Figure 3 shows the relative efficiency of each estimator compared to τ ˆ OLS ( X { i ∈ ℛ } ) , for p = 10 , 20, and 50 covariates. Let us first focus on the performance of the super covariate data integration approach if we had access to the confidential auxiliary data ( G ). Since the regression model fit with the RCT was correctly specified for p = 10 and p = 20 , we expected there to be only small gains in precision integrating the observational data. When p = 50 , the regression estimator fit on only the RCT covariates is limited in the number of predictive covariates that can be included in the model, so there are greater gains to using predictions fit on the auxiliary data. All of the estimators fall somewhere between the difference-in-means estimator ( ∅ , orange circle) and the data integration approach with the confidential auxiliary data ( G , purple triangle).

$Figure 3 Simulated relative efficiency of τ ˆ OLS ( ⋅ ) {\hat{\tau }}^{{\rm{OLS}}}\left(\cdot ) estimator with different covariates compared with the τ ˆ OLS ( X { i ∈ ℛ } ) {\hat{\tau }}^{{\rm{OLS}}}\left({{\boldsymbol{X}}}_{\left\{i\in {\mathcal{ {\mathcal R} }}\right\}}) estimator (using RCT covariates only). The relative efficiency, or sample size multiplier is calculated as τ ˆ OLS ( X { i ∈ ℛ } ) ∕ τ ˆ OLS ( ⋅ ) {\hat{\tau }}^{{\rm{OLS}}}\left({{\boldsymbol{X}}}_{\left\{i\in {\mathcal{ {\mathcal R} }}\right\}})/{\hat{\tau }}^{{\rm{OLS}}}\left(\cdot ) , so values larger than 1 mean that the estimator is more efficient. Y ˆ O ( ⋅ ) {\hat{{\boldsymbol{Y}}}}_{{\mathcal{O}}}\left(\cdot ) is the vector of predictions for the RCT sample from a model fit on the auxiliary data using a certain release of the auxiliary data. ε \varepsilon is the privacy budget for DP transformations. Error bars represent two simulation standard errors.$

Figure 3

Simulated relative efficiency of τ ˆ OLS ( ⋅ ) estimator with different covariates compared with the τ ˆ OLS ( X { i ∈ ℛ } ) estimator (using RCT covariates only). The relative efficiency, or sample size multiplier is calculated as τ ˆ OLS ( X { i ∈ ℛ } ) ∕ τ ˆ OLS ( ⋅ ) , so values larger than 1 mean that the estimator is more efficient. Y ˆ O ( ⋅ ) is the vector of predictions for the RCT sample from a model fit on the auxiliary data using a certain release of the auxiliary data. ε is the privacy budget for DP transformations. Error bars represent two simulation standard errors.

Using synthetic data ( D ˜ , blue square) in the super-covariate data integration approach performs similar to using the confidential data itself ( G ), always outperforming using the RCT covariates alone. The performance of the DP Gram matrix release ( G ∗ , shades of green triangles) depends on the number of covariates and the privacy budget, ε . When there are a small number of covariates p = 10 , using Y ˆ O ( G ∗ ) as a covariate in the regression estimator performs as well as using the RCT covariates when ε is greater than 6. However, when p = 20 , a privacy budget of ε = 15 or greater is needed for the data integration approach to outperform regression with the RCT covariates alone, using Y ˆ O ( G ∗ ) . When p = 50 , the data integration approach with the DP Gram matrix never outperforms using the RCT covariates alone, and for most of the privacy budgets, it has similar variance to the difference-in-means estimator. The DP transformations lose utility quickly as p increases because the required magnitude of noise for the Gaussian mechanism increases with p 2 . These results clearly illustrate the privacy-utility trade-off, with the relative efficiency of τ ˆ OLS ( Y ˆ O ( G ∗ ) ) increasing as the privacy budget increases.

We find that the utility of using the super-covariate data integration is lost for the DP transformations of auxiliary data, when there are a large number of covariates. However, the noise necessary to achieve DP additionally depends on the sensitivity of the statistics to removing one observation from the data. Thus, the sensitivity decreases as m increases. Figure 4 shows the variance of τ ˆ OLS with different covariates, varying the size of the auxiliary study. It illustrates that the utility of Y ˆ O ( G ∗ ) as a covariate stays consistent with the utility of Y ˆ O ( G ) as the size of the auxiliary study gets very large.

$Figure 4 Simulated variance for τ ˆ OLS ( ⋅ ) {\hat{\tau }}^{{\rm{OLS}}}\left(\cdot ) with p = 10 p=10 covariates using only RCT data (circles) and incorporating auxiliary information (triangles), increasing the size of the auxiliary data, m m . Y ˆ O ( ⋅ ) {\hat{{\boldsymbol{Y}}}}_{{\mathcal{O}}}\left(\cdot ) is the vector of predictions for the RCT sample from a model fit on the auxiliary data using a certain release of the auxiliary data. ε \varepsilon is the privacy budget for DP transformations. Error bars represent two simulation standard errors.$

Figure 4

Simulated variance for τ ˆ OLS ( ⋅ ) with p = 10 covariates using only RCT data (circles) and incorporating auxiliary information (triangles), increasing the size of the auxiliary data, m . Y ˆ O ( ⋅ ) is the vector of predictions for the RCT sample from a model fit on the auxiliary data using a certain release of the auxiliary data. ε is the privacy budget for DP transformations. Error bars represent two simulation standard errors.

As noted previously, these results are based on correctly specified regression models and RCT and auxiliary samples that arise from the same data generating process. In practice, neither of these assumptions may be true, and it can be preferred to use design-based estimators. Refer Appendix B for a discussion and simulations, which show that these results hold with a design-based estimator that is robust to model mis-specification and covariate shift.

Table 3 presents additional utility metrics for the disclosure limiting releases. As with Table 2, this can give us an idea of the utility of the releases for uses of the Gram matrix aside from the super-covariate data integration approach. First, in terms of the OLS estimated coefficients for a regression model with the auxiliary sample ( β ˆ ), the synthetic data ( D ˜ ) performs very similar to the original confidential auxiliary data ( G ). When p = 10 and ε > 6 , the DP Gram matrix also estimates the coefficients with similar RMSE. The RMSE of β ˆ is much larger for the DP Gram matrices ( G ∗ ) with a large number of covariates and small privacy budgets, than the confidential data ( G ). There is a similar pattern with the Frobenius and spectral norms of the different Gram matrix releases.

Table 3

Additional utility metrics for disclosure limiting transformations of confidential auxiliary data

Release	ε	β ˆ	vs non-private gram matrix
		RMSE	Frobenius norm	Spectral norm
P = 10
G (No privacy)	—	0.01	0.00	0.00
D ˜	—	0.02	0.17	0.10
G ∗ (DP)	1	0.24	5.69	3.05
G ∗ (DP)	3	0.16	2.18	1.02
G ∗ (DP)	6	0.08	0.80	0.39
G ∗ (DP)	15	0.03	0.25	0.14
G ∗ (DP)	30	0.02	0.14	0.08
G ∗ (DP)	50	0.01	0.11	0.06
P = 20
G (No privacy)	—	0.01	0.00	0.00
D ˜	—	0.03	0.34	0.19
G ∗ (DP)	1	0.22	28.20	12.03
G ∗ (DP)	3	0.18	10.30	4.36
G ∗ (DP)	6	0.15	6.30	2.48
G ∗ (DP)	15	0.11	2.33	0.84
G ∗ (DP)	30	0.05	0.72	0.30
G ∗ (DP)	50	0.03	0.44	0.19
P = 50
G (No privacy)	—	0.01	0.00	0.00
D ˜	—	0.03	0.76	0.32
G ∗ (DP)	1	0.15	373.32	105.51
G ∗ (DP)	3	0.14	125.09	35.37
G ∗ (DP)	6	0.14	63.09	17.93
G ∗ (DP)	15	0.13	26.15	7.42
G ∗ (DP)	30	0.12	14.18	3.97
G ∗ (DP)	50	0.10	9.79	2.60

β ˆ is the coefficient vector from an OLS model of the outcome on coefficients. The matrix norms are the norms of the difference between the non-private Gram matrix G and the corresponding release. We calculate a Gram matrix from D ˜ to compute the norms.

In summary, the utility of the super-covariate data integration approach is more impacted by the type of disclosure limiting transformation applied to the auxiliary data than the CW estimator. This makes sense because this data integration approach requires the full Gram matrix, with 1 2 ( p 2 + 5 p + 2 ) elements with random noise added to them, vs only the mean vector of p elements. If a large privacy budget ε is acceptable, then releasing a DP Gram matrix for the purpose of data integration to improve efficiency with this approach could still be effective.

5 Discussion

In this study, we presented disclosure limiting transformations of observational data that can be combined with experimental data to estimate the PATE and SATE. These disclosure limiting transformations included a differentially private Gram matrix and synthetic data. We found that leveraging these transformed versions of observational data to estimate the PATE greatly improved the MSE (and eliminated bias) as compared to methods using only the RCT data. We also found that transformed versions of auxiliary data could improve precision when estimating the SATE, beyond the precision achieved through covariate adjustment with RCT covariates alone, if there are few covariates or a large privacy budget is used for the DP Gram matrix release.

There is no broad discussion of using the Gram matrix of a data matrix as a disclosure limiting transformation in the literature, to the best of our knowledge. The Gram matrix is a reasonable place to start for disclosure limiting data releases since it provides useful summary statistics of the data and some protection against disclosure, which can be augmented with additional random noise. The Gram matrix also supports some flexible model fitting – any outcome and covariates can be chosen out of the columns in the data matrix. As has been discussed in the DP literature, a DP covariance matrix can be used for PCA, ridge, and LASSO regularized regression in addition to standard OLS.

In addition to disclosure risk and utility, there are practical considerations that could be taken into account when choosing a disclosure limiting technique. An important consideration is that the additional uncertainty introduced by privacy-transformations may be more challenging to communicate to data users for some techniques than for others. In the case of the DP Gram matrix release, G ∗ , the additional uncertainty is rather clear – we add noise with a certain distribution to the Gram matrix, the variance of which is parameterized with a couple of parameters. On the other hand, estimating valid standard errors associated with synthetic data requires multiple replicates of the synthetic data and survey sampling methods.

We focus on a small set of disclosure limiting techniques and causal estimators for discussion’s sake. Further evaluation of disclosure limiting transformations and causal estimators would be valuable. For example, the inverse propensity score weighted (IPSW) estimator is an estimator of the PATE, which combines experimental and observational data [63]. The IPSW estimator requires pooling experimental and observational data, and therefore, requires subject-level data. Future work could evaluate the utility of different methods for synthetic data generation of observational data for use in the IPSW estimator. The results also point to more work to be done to generate differentially private Gram matrices with higher utility for OLS estimation and inference and to be used in the data integration approach.

Integrating observational and experimental data in treatment effect estimation is a powerful and exciting direction in causal inference. There is a lost opportunity when RCT analysts cannot access relevant observational data. On the other hand, individuals who provide their data have the right to control their information and expect that sensitive information will not be released to the public. In practice, choosing a tolerable disclosure risk, balanced with data utility, is a policy decision. In this work, we present information to inform such a decision, illustrating the privacy-utility trade-off for different data privacy techniques when integrating private auxiliary data with experimental data for causal inference.

Acknowledgements

The authors thank the reviewers and editors for their thoughtful feedback and suggestions, which improved the manuscript. We also thank Jeremy Seeman for his valuable feedback.

Funding information: The research reported here was supported by the Institute of Education Sciences, US Department of Education, through Grant R305D210031 to the University of Michigan. The opinions expressed are those of the authors and do not represent views of the Institute or the US Department of Education nor other funders. C.Z.M. was additionally supported by the National Science Foundation RTG grant DMS-1646108.
Author contributions: All authors discussed and reviewed the manuscript and simulation results and have accepted responsibility for the entire content of this manuscript and consented to its submission. J.A.G.B. conceived the project. C.Z.M. and J.A.G.B. designed simulation studies. C.Z.M. conducted literature reviews, conducted all simulations, and drafted the manuscript. All authors discussed and provided feedback on the manuscript.
Conflict of interest: The authors state no conflicts of interest.
Data availability statement: All codes to generate data and replicate simulations are available at https://github.com/manncz/exp-obs-priv.

Appendix A Sensitivity calculations for differentially private Gram matrix algorithm

Take an n × p dataset X . Let x and y be some columns of X , and x i is the i th element of the vector x . We are interested in calculating the ℓ 2 sensitivity

Δ f = max X , X ′ ∣ ∣ f ( X ′ ) − f ( X ) ∣ ∣ 2 ,

where d ( X , X ′ ) = 1 or in other words, they differ by one observation. We will assume that d ( X , X ′ ) = 1 indicates that X ′ is a subset of X , with one observation removed. For simplicity, and without loss of generality, let us just say that observation i = 1 is the one that is removed in the calculations below.

A.1 Mean sensitivity

We will look at the sensitivity for the empirical mean of one column. We can also think of this as the sensitivity to the sum of the values of one column, removing one observation. Assume that x is bounded below by b and above by B .

n Δ f = max X , X ′ ∑ i = 1 n x i − ∑ i = 2 n x i = max X , X ′ ∣ x 1 ∣ = max ( ∣ B ∣ , ∣ b ∣ ) .

So for the empirical mean, Δ f = 1 n max ( ∣ B ∣ , ∣ b ∣ ) .

A.2 Second (non-central) moment sensitivity

We will look at the sensitivity for the empirical expectation of the product of x and y , defined by 1 n ∑ i = 1 n x i y i . Assume that x is bounded below by b x and above by B x and x is bounded below by b y and above by B y . Then,

n Δ f = max X , X ′ ∑ i = 1 n x i y i − ∑ i = 2 n x i y i = max X , X ′ ∣ x 1 y 1 ∣ = max ( ∣ B x ∣ , ∣ b x ∣ ) max ( ∣ B y ∣ , ∣ b y ∣ ) .

B Design-based covariate adjustment with auxiliary data

Using a prediction of the outcome as a covariate in the regression estimator is effective to reduce variance in experimental estimates when the model is correctly specified and there is no covariate shift between the RCT and auxiliary study. In practice, neither of these assumptions may be true. Design-based methods for covariate adjustment do not require modeling assumptions on top of those that are typically employed with design-based analysis of randomized experiments. The covariate adjusted estimator proposed by Gagnon-Bartsch et al. [28] is robust to model mis-specification and accounts for possible covariate shift between the RCT and auxiliary study.

Gagnon-Bartsch et al. [28] is part of the literature proposing residualizing the outcomes in the IPW estimator with some function of the covariates, as in the augmented IPW estimator [20,23,25,65–67]. In general, such adjusted estimators take the form

τ ˆ f I P W = 1 n ∑ i ∈ ℛ T i Y i − f ˆ ( x i ) π − ( 1 − T i ) Y i − f ˆ ( x i ) 1 − π .

Authors have proposed different functions f ( ⋅ ) for adjustment. Aronow and Middleton [23] remained agnostic to an exact choice for f ( ⋅ ) , but note that the choice of f ( ⋅ ) could impact efficiency. Sales et al. [25] used f ( ⋅ ) = y i c (so f ˆ ( x i ) estimates the control potential outcome for subject i ). Wu and Gagnon-Bartsch [67] proposed f ( ⋅ ) = m i = π y i t + ( 1 − π ) y i c . m i is not observed, but if f ˆ ( x i ) = m i , τ ˆ f I P W = τ S A T E , minimizing the estimator’s variance. Therefore, the better f ˆ ( x i ) estimates m i , the more precise τ ˆ f I P W will be.

Using auxiliary observational data to estimate f ( ⋅ ) can further improve precision because the auxiliary data likely has a much larger sample size than the randomized experiment. Aronow and Middleton, and Sales et al. [23,25] suggested using auxiliary data directly to estimate f ( ⋅ ) , for example, letting f ˆ ( ⋅ ) be a regression model fit on the auxiliary data. Gagnon-Bartsch et al. [28] instead suggested using predictions of the outcome based on a regression model fit on the auxiliary data as a covariate to estimate m i in the study by Wu and Gagnon-Bartsch [67].

Like the regression estimator discussed in Section 2.2, Gagnon-Bartsch et al. [28] only requires a model of the outcome of interest fit on the auxiliary data. We denote the specific variation in the adjusted IPW estimator in in the study by Gagnon-Bartsch et al. [28] as τ ˆ S S ( ⋅ ) . We run the same simulations as in Section 4.2, using { Y ˆ i ( ⋅ ) , x i } as covariates in the ensemble approach of τ ˆ S S ( ⋅ ) instead of the regression estimator. The ensemble approach interpolates between a model using the RCT covariates, and a model using only the auxiliary prediction, so it is robust to covariate shift. We implement the estimator using the loop.estimator package in R [68] (now replaced by the dRCT package [69] since the time of original submission). Figure A1 shows that the adjusted IPW estimator τ ˆ S S ( ⋅ ) performs similarly to the regression estimator in this setting. However, τ ˆ S S ( ⋅ ) is better at guarding against super noisy auxiliary predictions, and always is more efficient than the difference-in-means estimator, even when there are small privacy budgets and p = 20 or p = 50 .

$Figure A1 Simulated relative efficiency of τ ˆ S S ( ⋅ ) {\hat{\tau }}^{SS}\left(\cdot ) estimator with different covariates compared with the τ ˆ OLS ( X { i ∈ ℛ } ) {\hat{\tau }}^{{\rm{OLS}}}\left({{\boldsymbol{X}}}_{\left\{i\in {\mathcal{ {\mathcal R} }}\right\}}) estimator (using RCT covariates only). The relative efficiency, or sample size multiplier, is calculated as τ ˆ OLS ( X { i ∈ ℛ } ) ∕ τ ˆ {\hat{\tau }}^{{\rm{OLS}}}\left({{\boldsymbol{X}}}_{\left\{i\in {\mathcal{ {\mathcal R} }}\right\}})/\hat{\tau } , where τ ˆ \hat{\tau } is the given estimator. Values larger than 1 mean that the estimator is more efficient. Y ˆ O ( ⋅ ) {\hat{{\boldsymbol{Y}}}}_{{\mathcal{O}}}\left(\cdot ) is the vector of predictions for the RCT sample from a model fit on the auxiliary data using a certain release of the auxiliary data. ε \varepsilon is the privacy budget for DP transformations. Error bars represent two simulation standard errors.$

Figure A1

Simulated relative efficiency of τ ˆ S S ( ⋅ ) estimator with different covariates compared with the τ ˆ OLS ( X { i ∈ ℛ } ) estimator (using RCT covariates only). The relative efficiency, or sample size multiplier, is calculated as τ ˆ OLS ( X { i ∈ ℛ } ) ∕ τ ˆ , where τ ˆ is the given estimator. Values larger than 1 mean that the estimator is more efficient. Y ˆ O ( ⋅ ) is the vector of predictions for the RCT sample from a model fit on the auxiliary data using a certain release of the auxiliary data. ε is the privacy budget for DP transformations. Error bars represent two simulation standard errors.

References

[1] Colnet B, Mayer I, Chen G, Dieng A, Li R, Varoquaux G, et al. Causal inference methods for combining randomized trials and observational studies: a review. Stat Sci. 2024 Feb;39(1):165–91. https://projecteuclid-org.proxy.lib.umich.edu/journals/statistical-science/volume-39/issue-1/Causal-Inference-Methods-for-Combining-Randomized-Trials-and-Observational-Studies/10.1214/23-STS889.full. 10.1214/23-STS889Search in Google Scholar

[2] Snoke J, Bowen CM. How statisticians should Grapple with privacy in a changing data landscape. CHANCE. 2020 Oct;33(4):6–13. 10.1080/09332480.2020.1847947. Search in Google Scholar

[3] Fellegi IP. On the question of statistical confidentiality. J Amer Stat Assoc. 1972;67(337):7–18. https://www.jstor.org/stable/2284695. 10.1080/01621459.1972.10481199Search in Google Scholar

[4] Raghunathan TE. Synthetic data. Ann Rev Stat Appl. 2021;8(1):129–40. 10.1146/annurev-statistics-040720-031848. Search in Google Scholar

[5] Schwartz PM. The EU-US privacy collision: a turn to institutions and procedures. Harvard Law Rev. 2013 May;126(7):1966–2009. Search in Google Scholar

[6] Bellovin SM, Dutta PK, Reitinger N. Privacy and synthetic datasets [SSRN Scholarly Paper]. Rochester, NY: Stanford Technology Law Review; 2018. https://papers.ssrn.com/abstract=3255766. 10.2139/ssrn.3255766Search in Google Scholar

[7] Wood A, Altman M, Bembenek A, Bun M, Gaboardi M, Honaker J, et al. Differential privacy: a primer for a non-technical audience. Vand. J. Ent. & Tech. L. 2018;21:209.10.2139/ssrn.3338027Search in Google Scholar

[8] Kusner M, Sun Y, Sridharan K, Weinberger K. Private causal inference. In: 19th International Conference on Artificial Intelligence and Statistics. vol. 51. Cadiz, Spain: JMLR; 2016. Search in Google Scholar

[9] Wang L, Pang Q, Song D. Towards practical differentially private causal graph discovery. In: 34th Conference on Neural Information Processing Systems. Canada: Vancouver; 2020. Search in Google Scholar

[10] Ma P, Ji Z, Pang Q, Wang S. NoLeaks: differentially private causal discovery under functional causal model. IEEE Trans Inform Forensics Security. 2022;17:2324–38. https://ieeexplore.ieee.org/document/9798874/. 10.1109/TIFS.2022.3184263Search in Google Scholar

[11] Neyman J, Iwaskiewicz K, Kolodziejczyk S. Statistical problems in agricultural experimentation (with discussion). Suppl J R Stat Soc. 1935;2:107–80. 10.2307/2983637Search in Google Scholar

[12] Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educat Psychol. 1974;66(5):688–701. http://content.apa.org/journals/edu/66/5/688. 10.1037/h0037350Search in Google Scholar

[13] Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. J Amer Stat Assoc. 1996;91(434):444–55.10.1080/01621459.1996.10476902Search in Google Scholar

[14] Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. Am J Epidemiol. 2010 Jul;172(1):107–15. 10.1093/aje/kwq084Search in Google Scholar PubMed PubMed Central

[15] Stuart EA, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the generalizability of results from randomized trials. J R Stat Soc Ser A (Stat Soc). 2011;174(2):369–86. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-985X.2010.00673.x. 10.1111/j.1467-985X.2010.00673.xSearch in Google Scholar PubMed PubMed Central

[16] Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR. Generalizing study results: a potential outcomes perspective. Epidemiology (Cambridge, Mass). 2017 Jul;28(4):553–61. 10.1097/EDE.0000000000000664Search in Google Scholar PubMed PubMed Central

[17] Dahabreh IJ, Robertson SE, Tchetgen EJ, Stuart EA, Hernán MA. Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics. 2019 Jun;75(2):685–94. 10.1111/biom.13009Search in Google Scholar PubMed PubMed Central

[18] Lee D, Yang S, Dong L, Wang X, Zeng D, Cai J. Improving trial generalizability using observational studies. Biometrics. 2021;79(2):1–13. http://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13609. 10.1111/biom.13609Search in Google Scholar PubMed PubMed Central

[19] Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Amer Stat Assoc. 1952;47(260):663–85. http://www.jstor.org/stable/2280784. 10.1080/01621459.1952.10483446Search in Google Scholar

[20] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Amer Stat Assoc. 1994 Sep;89(427):846–66. 10.1080/01621459.1994.10476818. Search in Google Scholar

[21] Pocock SJ. The combination of randomized and historical controls in clinical trials. J Chronic Diseases. 1976 Mar;29(3):175–88. 10.1016/0021-9681(76)90044-8Search in Google Scholar PubMed

[22] Viele K, Berry S, Neuenschwander B, Amzal B, Chen F, Enas N, et al. Use of historical control data for assessing treatment effects in clinical trials. Pharm Stat. 2014;13(1):41–54. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951812/. 10.1002/pst.1589Search in Google Scholar PubMed PubMed Central

[23] Aronow PM, Middleton JA. A class of unbiased estimators of the average treatment effect in randomized experiments. J Causal Infer. 2013 May;1(1):135–54. http://www.degruyter.com/document/doi/10.1515/jci-2012-0009/html. 10.1515/jci-2012-0009Search in Google Scholar

[24] Deng A, Xu Y, Kohavi R, Walker T. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining - WSDM ’13. Rome, Italy: ACM Press; 2013. p. 123. http://dl.acm.org/citation.cfm?doid=2433396.2433413. 10.1145/2433396.2433413Search in Google Scholar

[25] Sales AC, Hansen BB, Rowan B. Rebar: Reinforcing a matching estimator with predictions from high-dimensional covariates. J Educ Behav Stat. 2018 Feb;43(1):3–31. 10.3102/1076998617731518. Search in Google Scholar

[26] Gui G. Combining observational and experimental data using first-stage covariates. 2020 Dec. http://arxiv.org/abs/2010.05117. 10.2139/ssrn.3662061Search in Google Scholar

[27] Opper IM. Improving average treatment effect estimates in small-scale randomized controlled trials. Annenberg Institute at Brown University; 2021. Publication Title: EdWorkingPapers.com. https://www.edworkingpapers.com/ai21-344. 10.7249/WRA1004-1Search in Google Scholar

[28] Gagnon-Bartsch JA, Sales AC, Wu E, Botelho AF, Erickson JA, Miratrix LW, et al. Precise unbiased estimation in randomized experiments using auxiliary observational data. J Causal Inference. 2023 Aug;11(1):20220011. https://www.degruyter.com/document/doi/10.1515/jci-2022-0011/html. Search in Google Scholar

[29] Sales AC, Prihar E, Gagnon-Bartsch J, Gurung A, Heffernan NT. More powerful A/B testing using auxiliary data and deep learning. In: Rodrigo MM, Matsuda N, Cristea AI, Dimitrova V, editors. Artificial intelligence in education. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2022. p. 524–7. 10.1007/978-3-031-11647-6_107Search in Google Scholar

[30] Duncan GT, Pearson RW. Enhancing access to microdata while protecting confidentiality: prospects for the future. Stat Sci. 1991;6(3):219–32. http://www.jstor.org/stable/2245411. 10.1214/ss/1177011681Search in Google Scholar

[31] Fienberg SE. Invited paper-confidentiality and data protection through disclosure limitation: evolving principles and technical advances. Philippine Stat. 2000;49(1–4):1–12. Search in Google Scholar

[32] Aggarwal CC, Yu PS. Privacy-preserving data mining: models and algorithms. New York, NY, United States: Springer; 2008. http://ebookcentral.proquest.com/lib/umichigan/detail.action?docID=367484. 10.1007/978-0-387-70992-5_2Search in Google Scholar

[33] Matthews GJ, Harel O. Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy. Stat Surveys. 2011 Jan;5(none):1–29. http://projecteuclid.org/journals/statistics-surveys/volume-5/issue-none/Data-confidentiality-A-review-of-methods-for-statistical-disclosure/10.1214/11-SS074.full. 10.1214/11-SS074Search in Google Scholar

[34] Salas J, Domingo-Ferrer J. Some basics on privacy techniques, anonymization and their big data challenges. Math Comput Sci. 2018 Sep;12(3):263–74. 10.1007/s11786-018-0344-6. Search in Google Scholar

[35] Slavkovic A, Seeman J. Statistical data privacy: a song of privacy and utility. 2022. http://arxiv.org/abs/2205.03336. Search in Google Scholar

[36] Bowen CM. Personal privacy and the public good: balancing data privacy and data utility. Urban Institute Research Report. 2021 Aug. https://www.urban.org/research/publication/personal-privacy-and-public-good-balancing-data-privacy-and-data-utility. Search in Google Scholar

[37] Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. In: Halevi S, Rabin T, editors. Theory of cryptography. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006. p. 265–84. 10.1007/11681878_14Search in Google Scholar

[38] Dwork C. Differential privacy. In: Bugliesi M, Preneel B, Sassone V, Wegener I, editors. Automata, Languages and Programming. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer; 2006. p. 1–12. 10.1007/11787006_1Search in Google Scholar

[39] Bowen CM, Garfinkel S. The philosophy of differential privacy. Notices of the American Mathematical Society. 2021 Nov;68(10):1. https://www.ams.org/notices/202110/rnoti-p1727.pdf. 10.1090/noti2363Search in Google Scholar

[40] Barrientos AF, Williams AR, Snoke J, Bowen CM. A feasibility study of differentially private summary statistics and regression analyses with evaluations on administrative and survey data. J Amer Stat Assoc. 2023;0(ja):1–24. 10.1080/01621459.2023.2270795. Search in Google Scholar

[41] Dwork C, Smith A. Differential privacy for statistics: what we know and what we want to learn. J Privacy Confidentiality. 2010 Apr;1(2):2. https://journalprivacyconfidentiality.org/index.php/jpc/article/view/570. 10.29012/jpc.v1i2.570Search in Google Scholar

[42] Dwork C, Roth A. The algorithmic foundations of differential privacy. Found Trends Theoret Comput Sci. 2013;9(3–4):211–407. http://www.nowpublishers.com/articles/foundations-and-trends-in-theoretical-computer-science/TCS-042. 10.1561/0400000042Search in Google Scholar

[43] McSherry FD. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Providence, Rhode Island; 2009. 10.1145/1559845.1559850Search in Google Scholar

[44] Jarmin RS. Disclosure avoidance for the 2020 census: an introduction. US: Census Bureau. 2021. https://www.census.gov/library/publications/2021/decennial/2020-census-disclosure-avoidance-handbook.html. Search in Google Scholar

[45] Rogers R, Subramaniam S, Peng S, Durfee D, Lee S, Kancha SK, et al. LinkedIn’s audience engagements API: a privacy preserving data analytics system at scale. 2020. http://arxiv.org/abs/2002.05839. 10.29012/jpc.782Search in Google Scholar

[46] Rubin DB. Discussion: statistical disclosure limitation. J Official Stat. 1993;9(2):461–8. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/discussion-statistical-disclosure-limitation2.pdf. Search in Google Scholar

[47] Little R. Statistical analysis of masked data. J Official Stat. 1993;9(2):407–26. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/statistical-analysis-of-masked-data.pdf. Search in Google Scholar

[48] Snoke J, Raab GM, Nowok B, Dibben C, Slavkovic A. General and specific utility measures for synthetic data. J R Stat Soc Ser A (Stat Soc). 2018;181(3):663–88. https://onlinelibrary.wiley.com/doi/abs/10.1111/rssa.12358. Search in Google Scholar

[49] Nowok B, Raab GM, Dibben C. Synthpop: Bespoke creation of synthetic data in R. J Stat Softw. 2016;74(11):1–26. https://www.jstatsoft.org/index.php/jss/article/view/v074i11. 10.18637/jss.v074.i11Search in Google Scholar

[50] Wilchek M, Wang Y. Synthetic differential privacy data generation for revealing bias modelling risks. In: 2021 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom); 2021. p. 1574–80. 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00211Search in Google Scholar

[51] Boedihardjo M, Strohmer T, Vershynin R. Private measures, random walks, and synthetic data. http://arxiv.org/abs/2204.09167. Search in Google Scholar

[52] Kamath G, Li J, Singhal V, Ullman J. Privately learning high-dimensional distributions. In: Proceedings of the Thirty-Second Conference on Learning Theory. PMLR; 2019. p. 1853–902. ISSN: 2640-3498. https://proceedings.mlr.press/v99/kamath19a.html. Search in Google Scholar

[53] Balle B, Wang YX. PMLR. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. Proceedings of the 35th Annual International Conference on Machine Learning. 2018. p. 394–403. Search in Google Scholar

[54] Wang YX. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. 2018. http://arxiv.org/abs/1803.02596. Search in Google Scholar

[55] Dwork C, Talwar K, Thakurta A, Zhang L. Analyze Gauss: optimal bounds for privacy-preserving principal component analysis. In: Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing. STOC ’14. New York, NY, USA: Association for Computing Machinery; 2014. p. 11–20. 10.1145/2591796.2591883. Search in Google Scholar

[56] Chanyaswad T, Dytso A, Poor HV, Mittal P. MVG mechanism: differential privacy under matrix-valued query. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security; 2018. p. 230–46. http://arxiv.org/abs/1801.00823. 10.1145/3243734.3243750Search in Google Scholar

[57] Ferrando C, Wang S, Sheldon D. Parametric bootstrap for differentially private confidence intervals. 2021. http://arxiv.org/abs/2006.07749. Search in Google Scholar

[58] Jiang W, Xie C, Zhang Z. Wishart mechanism for differentially private principal components analysis. Proceedings of the AAAI Conference on Artificial Intelligence. 2016 Feb;30(1):1. https://ojs.aaai.org/index.php/AAAI/article/view/10185. 10.1609/aaai.v30i1.10185Search in Google Scholar

[59] Foulds J, Geumlek J, Welling M, Chaudhuri K. On the theory and practice of privacy-preserving Bayesian data analysis. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI’16. Arlington, Virginia, USA: AUAI Press; 2016. p. 192–201. Search in Google Scholar

[60] Vu D, Slavkovic A. Differential privacy for clinical trial data: preliminary evaluations. In: 2009 IEEE International Conference on Data Mining Workshops; 2009. p. 138–43. ISSN: 2375-9259. https://ieeexplore.ieee.org/document/5360513. 10.1109/ICDMW.2009.52Search in Google Scholar

[61] Alabi D, McMillan A, Sarathy J, Smith A, Vadhan S. Differentially private simple linear regression. 2020. http://arxiv.org/abs/2007.05157. Search in Google Scholar

[62] Sheffet O. Old techniques in differentially private linear regression. In: Proceedings of the 30th International Conference on Algorithmic Learning Theory. PMLR; 2019. p. 789–827. ISSN: 2640-3498. https://proceedings.mlr.press/v98/sheffet19a.html. Search in Google Scholar

[63] Colnet B, Mayer I, Chen G, Dieng A, Li R, Varoquaux G, et al. Causal inference methods for combining randomized trials and observational studies: a review. 2021 Jul. http://arxiv.org/abs/2011.08047. Search in Google Scholar

[64] Yang S. genRCT; 2021. Original-date: 2021-08-03T03:51:46Z. https://github.com/idasomm/genRCT. Search in Google Scholar

[65] Scharfstein DO, Rotnitzky A, Robins JM. Rejoinder. J Amer Stat Assoc. 1999;94(448):1135–46. 10.1080/01621459.1999.10473869. Search in Google Scholar

[66] Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. In: Proceedings of the American Statistical Association. vol. 1999. Indianapolis, IN; 2000. p. 6–10. https://cdn1.sph.harvard.edu/wp-content/uploads/sites/343/2013/03/jsaprocpat1.pdf. Search in Google Scholar

[67] Wu E, Gagnon-Bartsch JA. The LOOP estimator: Adjusting for covariates in randomized experiments. Evaluat Rev. 2018 Aug;42(4):458–88. 10.1177/0193841X18808003. Search in Google Scholar PubMed

[68] Wu E, Gagnon-Bartsch J, Sales A. loop.estimator; 2022. Original-date: 2018-12-14T18:45:15Z. https://github.com/adamSales/rebarLoop. Search in Google Scholar

[69] Wu E, Sales A, Mann CZ, Gagnon-Bartsch J. dRCT; 2023. https://github.com/manncz/dRCT. Search in Google Scholar

Received: 2022-12-15

Revised: 2024-07-16

Accepted: 2025-01-24

Published Online: 2025-03-11

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2022-0081

Keywords for this article

data integration; statistical disclosure control; differential privacy

Creative Commons

BY 4.0