Abstract
We consider causal inference in observational studies with choice-based sampling, in which subject enrollment is stratified on treatment choice. Choice-based sampling has been considered mainly in the econometrics literature, but it can be useful for biomedical studies as well, especially when one of the treatments being compared is uncommon. We propose new methods for estimating the population average treatment effect under choice-based sampling, including doubly robust methods motivated by semiparametric theory. A doubly robust, locally efficient estimator may be obtained by replacing nuisance functions in the efficient influence function with estimates based on parametric models. The use of machine learning methods to estimate nuisance functions leads to estimators that are consistent and asymptotically efficient under broader conditions. The methods are compared in simulation experiments and illustrated in the context of a large observational study in obstetrics. We also make suggestions on how to choose the target proportion of treated subjects and the sample size in designing a choice-based observational study.
Acknowledgements
We thank the Editor (Dr. Melanie Prague) and three anonymous reviewers for constructive comments on earlier versions of the paper. Chunling Liu’s research was supported by Hong Kong Polytechnic University (grant # G-YBCU) and General Research Funding of Hong Kong (grant # 15327216).
Appendix
A Semiparametric Theory
In this appendix, we derive the efficient influence functions for estimating
where ϖ is unknown and the conditional densities are unknown and unspecified. We will note later that the true value of ϖ, if known, does not change the efficient influence functions.
Consider a regular parametric submodel indexed by θ with
with
where
As a functional of the nonparametric model, the parameter
We will show that the canonical gradient of
It is easy to verify that
for every regular parametric submodel. By construction,
is equal to
This can be demonstrated by verifying the following:
where the last line follows from the fact that
Using a similar argument, it can be shown that the efficient influence function for
is given by
Consequently, the efficient influence function for
where
The preceding discussion treats ϖ as unknown. We now consider the case that ϖ is known and argue that
Clearly,
and similarly for the other two parameters.
To gain some insights on
where
It follows that
Noting that
Consequently,
Through conditioning on W (in addition to A), it can be shown that
B Asymptotic Theory for Sections 2.2–2.5
Standard regularity conditions in the M-estimation theory (e.g., [38], Chapter 5) are assumed. These include identifiability and smoothness (in parameters) of parametric models, existence of integrable envelopes that permit use of the dominated convergence theorem, and certain Donsker properties that help deal with random functions. Techniques for verifying the Donsker property can be found in van der Vaart and Wellner [39]. Let
Asymptotics for δ ˆ or and δ ˜ or
We assume here that the OR model m(a,w;β) is correct. Under regularity conditions, we have
The dominated convergence theorem can be used to show that the mapping of (β, ϖ) to
By the delta method,
where
Because
Similarly, it can be shown that
and
We note that
and therefore
It follows that
Asymptotics for δ ˆ wt and δ ˜ wt
We assume here that the sample PS model
and argue as in the preceding proof that
where
Similarly, for
If the population PS p(w) is fully known and substituted into the weighted estimators, then the term
Asymptotics for δ ˆ dr.p and δ ˜ dr.p
We write
We assume that
To demonstrate the DR property of
The first term on the right-hand side has mean δ, as argued in Section 2.2. The definition of m as a conditional mean function implies that the other two terms on the right-hand side have mean 0, regardless of
The first line on the right-hand side has mean δ, as argued in Section 2.3. The definition of r as a density ratio implies that the other two terms on the right-hand side have mean 0, regardless of
Assuming correct specification of one or both of the OR and PS models, it can be argued as before that
where
The DR property of
if the OR and PS models are both correct.
Similarly, it can be shown that
where
If the OR model is correct,
Asymptotics for δ ˆ dr.np
We write
where
In what follows, we demonstrate that
assuming that condition (9) holds and that
where
Simple algrbraic manipulations yield
It follows from the Cauchy-Schwartz inequality that
where
Comparing
Assumption (2) and equation (3) together imply that
Similarly, it can be shown that
Add the last two equations, we obtain
The same arguments can be used to show that
under condition (9) and the Donsker condition stated earlier.
C Technical Details for Section 2.6
Alternative OR Estimators
The alternative OR estimators are given by
Under a correct OR model, it can be shown as before that
where
If
and therefore
Given the similarity of
Justification for Equation (11)
Simple algebraic manipulations yield
The last line is identically 0 because
which follows from equation (3).
Alternative Weighted Estimators
The alternative weighted estimators are given by
Assume the population PS model is correct, and define
Then
where
Alternative DR Estimators
The alternative DR.P estimators can be derived as follows. Without using sampling weights, a naive DR estimator of δ would be
which can be rearranged as
Replacing
Assuming correct specification of one or both of the OR and population PS models, it can be argued as before that
where
If both models are correct, then
Note that
Following similar arguments, alternative DR.NP estimators can be obtained as
where
The alternative DR.SS estimators are similar to the alternative DR.NP estimators except for the use of sample splitting. They also behave similarly to the (respective) alternative DR.NP estimators. A detailed discussion is omitted.
D Additional Simulation Results
Figures 6–11 present the additional simulation results mentioned but not reported in Section 3.1.

Estimation error (estimator minus true value of δ) distributions for

Estimation error (estimator minus true value of δ) distributions for

Estimation error (estimator minus true value of δ) distributions for

Coverage proportions for

Coverage proportions for

Coverage proportions for
E R Code
library(SuperLearner) logit = function(p) log(p/(1-p)) expit = function(u) 1/(1+exp(-u)) # outcome regression method based on a parametric OR model # W does not have one as the first column # if non-zero, inter represents the columns of W that interact with a est.or = function(y, a, W, pi, inter=0, family=gaussian) { n = length(y) treated = (a>0.5); control = !treated one = rep(1, n); zero = rep(0, n) D = cbind(one, a, W) if (inter[1]!=0) D = cbind(D, a*W[,inter]) or.mod = glm(y∼0+D, family=family) D1 = cbind(one, one, W) if (inter[1]!=0) D1 = cbind(D1, one*W[,inter]) m1 = predict(or.mod, newdata=list(D=D1), type="response") mu1 = pi*mean(m1[treated])+(1-pi)*mean(m1[control]) D0 = cbind(one, zero, W) if (inter[1]!=0) D0 = cbind(D0, zero*W[,inter]) m0 = predict(or.mod, newdata=list(D=D0), type="response") mu0 = pi*mean(m0[treated])+(1-pi)*mean(m0[control]) c(mu1, mu0, mu1-mu0) } # weighted method based on a parametric PS model # W does not have one as the first column est.wtd = function(y, a, W, pi, family=binomial) { treated = (a>0.5); control = !treated pi.s = mean(treated) ps.mod = glm(a∼W, family=family) ps.s = ps.mod$fitted.values # PS in the sample ps = expit(logit(ps.s)-logit(pi.s)+logit(pi)) # PS in the population w1 = 1/ps[treated]; w0 = 1/(1-ps[control]) mu1 = pi*mean(w1*y[treated]) mu0 = (1-pi)*mean(w0*y[control]) c(mu1, mu0, mu1-mu0) } # DR.P method based on parametric OR and PS models # W.ps and W.or do not have one as the first column # if non-zero, inter represents the columns of W.or # that interact with a in the OR model est.dr1 = function(y, a, W.ps, W.or, pi, inter.or=0, family.or=gaussian, family.ps=binomial) { n = length(y) treated = (a>0.5); control = !treated pi.s = mean(treated) one = rep(1, n); zero = rep(0, n) D = cbind(one, a, W.or) if (inter.or[1]!=0) D = cbind(D, a*W.or[,inter.or]) or.mod = glm(y∼0+D, family=family.or) D1 = cbind(one, one, W.or) if (inter.or[1]!=0) D1 = cbind(D1, one*W.or[,inter.or]) m1 = predict(or.mod, newdata=list(D=D1), type="response") D0 = cbind(one, zero, W.or) if (inter.or[1]!=0) D0 = cbind(D0, zero*W.or[,inter.or]) m0 = predict(or.mod, newdata=list(D=D0), type="response") ps.mod = glm(a∼W.ps, family=family.ps) ps.s = ps.mod$fitted.values ps = expit(logit(ps.s)-logit(pi.s)+logit(pi)) r = exp(logit(ps.s)-logit(pi.s)) mu1 = mean((pi*y[treated]/ps[treated])-((1-pi)*m1[treated]/r[treated]))+ (1-pi)*mean(m1[control]) mu0 = mean(((1-pi)*y[control]/(1-ps[control]))-pi*m0[control]*r[control])+ pi*mean(m0[treated]) c(mu1, mu0, mu1-mu0) } # DR.NP method based on super learner estimates of OR and PS functions # cv is the number of folds in the super learners est.dr2 = function(y, a, W.ps, W.or, pi, SL.lib.or, SL.lib.ps, family.or=gaussian(), family.ps=binomial(), cv=5) { n = length(y) treated = (a>0.5); control = !treated pi.s = mean(treated) SL.run0 = SuperLearner(Y=y[control], X=W.or[control,], newX=W.or, family=family.or, SL.library=SL.lib.or, cvControl=list(V=cv)) m0 = SL.run0$SL.predict mu0.1 = mean(m0[treated]); mu0.0 = mean(m0[control]) SL.run1 = SuperLearner(Y=y[treated], X=W.or[treated,], newX=W.or, family=family.or, SL.library=SL.lib.or, cvControl=list(V=cv)) m1 = SL.run1$SL.predict mu1.1 = mean(m1[treated]); mu1.0 = mean(m1[control]) SL.run2 = SuperLearner(Y=a, X=W.ps, family=family.ps, SL.library=SL.lib.ps, cvControl=list(V=cv)) ps.s = SL.run2$SL.predict ps = expit(logit(ps.s)-logit(pi.s)+logit(pi)) r = exp(logit(ps.s)-logit(pi.s)) tau1 = (pi*mu1.1/pi.s)-((1-pi)*mu1.0/(1-pi.s)) tau0 = (pi*mu0.1/pi.s)-((1-pi)*mu0.0/(1-pi.s)) psi1 = (pi*a*y/(pi.s*ps))-((1-pi)*a*m1/(pi.s*r))+ ((1-pi)*(1-a)*m1/(1-pi.s))-(a-pi.s)*tau1 psi0 = ((1-pi)*(1-a)*y/((1-pi.s)*(1-ps)))-(pi*(1-a)*m0*r/(1-pi.s))+ (pi*a*m0/pi.s)-(a-pi.s)*tau0 psi = psi1-psi0; psi.mat = cbind(psi1,psi0,psi) pe = colMeans(psi.mat); se = sqrt(diag(var(psi.mat))/n) cbind(pe,se) } # DR.SS method based on sample splitting and # super learner estimates of OR and PS functions # K is the number of folds in sample splitting # cv is the number of folds in the super learners est.dr3 = function(y, a, W.ps, W.or, pi, SL.lib.or, SL.lib.ps, family.or=gaussian(), family.ps=binomial(), cv=5, K=5) { n = length(y); m0 = rep(NA,n); m1 = m0; ps.s = m0 treated = (a>0.5); control = !treated pi.s = mean(treated) st = sample(1:K, n, replace=TRUE) # stratum in sample splitting for (k in 1:K) { val = (st==k); trn = !val trn.treated = trn&treated; trn.control = trn&control SL.run0 = SuperLearner(Y=y[trn.control], X=W.or[trn.control,], newX=W.or[val,], family=family.or, SL.library=SL.lib.or, cvControl=list(V=cv)) m0[val] = SL.run0$SL.predict SL.run1 = SuperLearner(Y=y[trn.treated], X=W.or[trn.treated,], newX=W.or[val,], family=family.or, SL.library=SL.lib.or, cvControl=list(V=cv)) m1[val] = SL.run1$SL.predict SL.run2 = SuperLearner(Y=a[trn], X=W.ps[trn,], newX=W.ps[val,], family=family.ps, SL.library=SL.lib.ps, cvControl=list(V=cv)) ps.s[val] = SL.run2$SL.predict } mu0.1 = mean(m0[treated]); mu0.0 = mean(m0[control]) mu1.1 = mean(m1[treated]); mu1.0 = mean(m1[control]) ps = expit(logit(ps.s)-logit(pi.s)+logit(pi)) r = exp(logit(ps.s)-logit(pi.s)) tau1 = (pi*mu1.1/pi.s)-((1-pi)*mu1.0/(1-pi.s)) tau0 = (pi*mu0.1/pi.s)-((1-pi)*mu0.0/(1-pi.s)) psi1 = (pi*a*y/(pi.s*ps))-((1-pi)*a*m1/(pi.s*r))+ ((1-pi)*(1-a)*m1/(1-pi.s))-(a-pi.s)*tau1 psi0 = ((1-pi)*(1-a)*y/((1-pi.s)*(1-ps)))-(pi*(1-a)*m0*r/(1-pi.s))+ (pi*a*m0/pi.s)-(a-pi.s)*tau0 psi = psi1-psi0; psi.mat = cbind(psi1,psi0,psi) pe = colMeans(psi.mat); se = sqrt(diag(var(psi.mat))/n) cbind(pe,se) }References
[1] Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701.10.1037/h0037350Suche in Google Scholar
[2] Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55.10.21236/ADA114514Suche in Google Scholar
[3] Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–60.10.1097/00001648-200009000-00011Suche in Google Scholar PubMed
[4] Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc. 1984;79:516–24.10.1017/CBO9780511810725.018Suche in Google Scholar
[5] Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39:33–8.10.1017/CBO9780511810725.019Suche in Google Scholar
[6] Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–72.10.1111/j.1541-0420.2005.00377.xSuche in Google Scholar PubMed
[7] Cao W, Tsiatis AA, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96;723–34.10.1093/biomet/asp033Suche in Google Scholar PubMed PubMed Central
[8] Rotnitzky A, Lei Q, Sued M, Robins JM. Improved double-robust estimation in missing data and causal inference models. Biometrika. 2012;99:439–56.10.1093/biomet/ass013Suche in Google Scholar PubMed PubMed Central
[9] Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse mopdels (with discussion). J Am Stat Assoc. 1999;94:1096–146.10.1080/01621459.1999.10473862Suche in Google Scholar
[10] Tan Z. Bounded, efficient, and doubly robust estimation with inverse weighting. Biometrika. 2010;97:661–82.10.1093/biomet/asq035Suche in Google Scholar
[11] van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Spring-Verlag, 2003.10.1007/978-0-387-21700-0Suche in Google Scholar
[12] van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2011.10.1007/978-1-4419-9782-1Suche in Google Scholar
[13] Imai K, King G, Stuart EA. Misunderstandings between experimentalists and observationalists about causal inference. J Royal Stat Soc Ser A (Stat Soc). 2008;171:481–502.10.12987/9780300199307-010Suche in Google Scholar
[14] Wang W, Scharfstein D, Tan Z, MacKenzie EJ. Causal inference in outcome-dependent two-phase sampling designs. J Royal Stat Soc Ser B (Stat Method). 2009;71;947–69.10.1111/j.1467-9868.2009.00712.xSuche in Google Scholar
[15] Nie L, Zhang Z, Rubin D, Chu J. Likelihood reweighting methods to reduce potential bias in noninferiority trials which rely on historical data to make inference. Ann Appl Stat. 2013;7:1796–813.10.1214/13-AOAS655Suche in Google Scholar
[16] Zhang Z, Nie L, Soon G, Hu Z. New methods for treatment effect calibration, with applications to non-inferiority trials. Biometrics. 2016;72;20–9.10.1111/biom.12388Suche in Google Scholar PubMed
[17] Hu Z, Qin J. Generalizability of causal inference in observational studies under retrospective convenience sampling. Stat Med. 2018;37:2874–83.10.1002/sim.7808Suche in Google Scholar PubMed
[18] Heckman JT, Todd PE. A note on adapting propensity score matching and selection models to choice based samples. Econom J. 2009;12:S230–4.10.3386/w15179Suche in Google Scholar
[19] Kennedy EH, Sjolander A, Small DS. Semiparametric causal inference in matched cohort studies. Biometrika. 2015;102:739–46.10.1093/biomet/asv025Suche in Google Scholar
[20] Bickel PJ, Klaassen CA, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Baltimore, MD: Johns Hopkins University Press, 1993.Suche in Google Scholar
[21] Tsiatis AA. Semiparametric theory and missing data. New York: Springer, 2006.Suche in Google Scholar
[22] Hahn J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica. 1998;66:315–31.10.2307/2998560Suche in Google Scholar
[23] Shinozaki T, Matsuyama Y. Doubly robust estimation of standardized risk difference and ratio in the exposed population. Epidemiology. 2015;26:873–7.10.1097/EDE.0000000000000363Suche in Google Scholar PubMed
[24] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction, 2nd ed. New York: Springer-Verlag, 2009.10.1007/978-0-387-84858-7Suche in Google Scholar
[25] Benkeser D, van der Laan MJ. The highly adaptive lasso estimator. In: Proceedings of the International Conference on Data Science and Advanced Analytics, 2016:689–96.10.1109/DSAA.2016.93Suche in Google Scholar PubMed PubMed Central
[26] Chen X, White H. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Trans Inf Theory. 1999;45:682–91.10.1109/18.749011Suche in Google Scholar
[27] Kennedy EH. Nonparametric causal effects based on incremental propensity score interventions. J Am Stat Assoc. in press, 2018. doi: 10.1080/01621459.2017.1422737.doi: 10.1080/01621459.2017.1422737Suche in Google Scholar
[28] Ma S, Zhu L, Zhang Z, Tsai CL, Carroll RJ. A robust and efficient approach to causal inference based sparse sufficient dimension reduction. Ann Stat. 2019;47:1505–35.10.1214/18-AOS1722Suche in Google Scholar PubMed PubMed Central
[29] van der Laan MJ. A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. Int J Biostat. 2017;13. DOI: 10.1515/ijb-2015-0097.Suche in Google Scholar PubMed PubMed Central
[30] Polley EC, Rose S, van der Laan MJ. Super learning. In: van der Laan MJ, Rose S (eds.). Targeted learning. New York: Springer, 2011:43–66.10.1007/978-1-4419-9782-1_3Suche in Google Scholar
[31] van der Laan MJ, Polley EC, Hubbard AE. Super Learner. Stat Appl Genet Mol Biol. 2007;6, Article 5.Suche in Google Scholar
[32] van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. UC Berkeley Division of Biostatistics Working Paper Series, paper 130, 2003. http://biostats.bepress.com/ucbbiostat/paper130.Suche in Google Scholar
[33] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey WK. Double machine learning for treatment and structural parameters. Technical report, cemmap working paper, Centre for Microdata Methods and Practice, 2016.10.1920/wp.cem.2016.4916Suche in Google Scholar
[34] Zheng W, van der Laan MJ. Cross-validated targeted minimum-loss-based estimation. In: van der Laan MJ, Rose S (eds.). Targeted learning. New York: Springer, 2011:459–74.10.1007/978-1-4419-9782-1_27Suche in Google Scholar
[35] Zhang J, Troendle J, Reddy UM, Laughon SK, Branch DW, Burkman R. Contemporary cesarean delivery practice in the United States. Am J Obstetrics Gynecol. 2010;203:326. e1–10.10.1016/j.ajog.2010.06.058Suche in Google Scholar PubMed PubMed Central
[36] Benkeser D, Carone M, van der Laan MJ, Gilbert PB. Doubly robust nonparametric inference on the average treatment effect. Biometrika. 2017;104:863–80.10.1093/biomet/asx053Suche in Google Scholar PubMed PubMed Central
[37] Berger RL, Boos DD. P values maximized over a confidence set for the nuisance parameter. J Am Stat Assoc. 1994;89:1012–6.10.1080/01621459.1994.10476836Suche in Google Scholar
[38] van der Vaart AW. Asymptotic statistics. Cambridge, UK: Cambridge University Press, 1998.10.1017/CBO9780511802256Suche in Google Scholar
[39] van der Vaart AW, Wellner JA. Weak convergence and empirical processes with applications to statistics. New York: Springer-Verlag, 1996.10.1007/978-1-4757-2545-2Suche in Google Scholar
[40] Breiman L, Friedman J, Olshen R, Stone C. Classification and regression trees. New York: Wadsworth, 1984.Suche in Google Scholar
[41] Hastie TJ, Tibshirani RJ. Generalized additive models. New York: Chapman & Hall/CRC, 1990.Suche in Google Scholar
© 2019 Walter de Gruyter GmbH, Berlin/Boston
Artikel in diesem Heft
- Estimating the Population Average Treatment Effect in Observational Studies with Choice-Based Sampling
- Significance Tests for Boosted Location and Scale Models with Linear Base-Learners
- Hazard Ratio Estimators after Terminating Observation within Matched Pairs in Sibling and Propensity Score Matched Designs
- Parametric Regression Analysis with Covariate Misclassification in Main Study/Validation Study Designs
- The Performance of Fixed-Horizon, Look-Ahead Procedures Compared to Backward Induction in Bayesian Adaptive-Randomization Decision-Theoretic Clinical Trial Design
- Variational Inference for Coupled Hidden Markov Models Applied to the Joint Detection of Copy Number Variations
- A Joint Poisson State-Space Modelling Approach to Analysis of Binomial Series with Random Cluster Sizes
- The Youden Index in the Generalized Receiver Operating Characteristic Curve Context
- Double Poisson-Tweedie Regression Models
- Semiparametric Inference for Proportional Mean Past Life Model
Artikel in diesem Heft
- Estimating the Population Average Treatment Effect in Observational Studies with Choice-Based Sampling
- Significance Tests for Boosted Location and Scale Models with Linear Base-Learners
- Hazard Ratio Estimators after Terminating Observation within Matched Pairs in Sibling and Propensity Score Matched Designs
- Parametric Regression Analysis with Covariate Misclassification in Main Study/Validation Study Designs
- The Performance of Fixed-Horizon, Look-Ahead Procedures Compared to Backward Induction in Bayesian Adaptive-Randomization Decision-Theoretic Clinical Trial Design
- Variational Inference for Coupled Hidden Markov Models Applied to the Joint Detection of Copy Number Variations
- A Joint Poisson State-Space Modelling Approach to Analysis of Binomial Series with Random Cluster Sizes
- The Youden Index in the Generalized Receiver Operating Characteristic Curve Context
- Double Poisson-Tweedie Regression Models
- Semiparametric Inference for Proportional Mean Past Life Model