Abstract
Randomized controlled trials (RCTs) admit unconfounded design-based inference – randomization largely justifies the assumptions underlying statistical effect estimates – but often have limited sample sizes. However, researchers may have access to big observational data on covariates and outcomes from RCT nonparticipants. For example, data from A/B tests conducted within an educational technology platform exist alongside historical observational data drawn from student logs. We outline a design-based approach to using such observational data for variance reduction in RCTs. First, we use the observational data to train a machine learning algorithm predicting potential outcomes using covariates and then use that algorithm to generate predictions for RCT participants. Then, we use those predictions, perhaps alongside other covariates, to adjust causal effect estimates with a flexible, design-based covariate-adjustment routine. In this way, there is no danger of biases from the observational data leaking into the experimental estimates, which are guaranteed to be exactly unbiased regardless of whether the machine learning models are “correct” in any sense or whether the observational samples closely resemble RCT samples. We demonstrate the method in analyzing 33 randomized A/B tests and show that it decreases standard errors relative to other estimators, sometimes substantially.
1 Introduction
Randomized controlled trials (RCTs) are famously free of confounding bias. Indeed, a class of estimators, often referred to as “design-based” [1] or “randomization based” [2], estimates treatment effects without assuming any statistical model other than whatever is implied by the experimental design itself. Design-based statistical estimators are typically guaranteed to be unbiased. Their associated inference – standard errors, hypothesis tests, and confidence intervals – also comes with accuracy guarantees. In many cases, these apply regardless of the sample size and require only very weak regularity conditions.
While RCTs can reliably provide unbiased estimates, they are often limited in terms of precision. The statistical precision of RCT-based estimates is inherently limited by the RCT’s sample size, which itself is typically subject to a number of practical constraints.
In contrast, large observational datasets can frequently be brought to bear on some of the same questions addressed by an RCT. Analysis of observational data, unlike RCTs, typically requires a number of untestable modeling assumptions, chief among them the assumption of no unmeasured confounding. Consequently, treatment effect estimates from observational data cannot boast the same guarantees to accuracy as estimates from RCTs. That said, in many cases, they boast a much larger sample – and, hence, greater precision – than equivalent RCTs.
In many cases, observational and RCT data coexist within the very same database. For instance, covariate and outcome data for a biomedical RCT may be drawn from a database of electronic health records, and that same database may contain equivalent records for patients who did not participate in the study and were not randomized. Along similar lines, covariate and outcome data for an RCT designed to evaluate the impact of an educational intervention might be drawn from a state administrative database, and that database may also contain information on hundreds of thousands of students who did not participate in the RCT. We refer to these individuals, who are nonparticipants of the RCT but who are in the same database, as the remnant from the study [3]. We ask, how can we use the remnant to improve power to detect effects in RCTs?
An example from the field of education is www.ETRIALStestbed.org (formerly the ASSISTments TestBed [4,5]). The TestBed is an A/B testing program designed for conducting education research that runs within ASSISTments and has been made accessible to third-party education researchers. Using the TestBed, a researcher can propose A/B tests to run within ASSISTments. That is, a researcher may specify two contrasting conditions, such as video- or text-based instructional feedback, and a particular homework topic, such as “Adding Whole Numbers,” or “Factoring Quadratic Equations.” Then, students working on that topic are individually randomized between the two conditions. The researcher could then compare the relative impact of video- vs text-based feedback on an outcome variable of interest such as homework completion. The anonymized data associated with the study, consisting of several levels of granularity and rich covariates describing both historical prestudy and within-study student interaction, are made available to the researcher. The TestBed currently hosts over 100 such RCTs, and several of these RCTs have recently been analyzed, e.g., refs [6–11].
In the ASSISTments TestBed example, a given RCT is likely to consist of just a few hundred students assigned to a specific homework assignment, limiting statistical power and precision. For instance, in one typical ASSISTments TestBed A/B test, a total of 294 students were randomized between two conditions, leading to a standard error of roughly four percentage points when estimating the effect on homework completion. This standard error is too large to either determine the direction of a treatment effect or rule out clinically meaningful effect sizes. But the ASSISTments database contains data on hundreds of thousands of other ASSISTments users, many of whom may have completed similar homework assignments, or who may have even completed an identical homework assignment but in a previous time period.
This article outlines an approach to estimate treatment effects in an RCT while incorporating high-dimensional covariate data, large observational remnant data, and machine learning prediction algorithms to improve precision. It does so without compromising the accuracy guarantees of traditional design-based RCT estimators, yielding unbiased point estimates and sampling variance estimates that are conservative in expectation; the approach is design-based, relying only on the randomization within the RCT to make these guarantees. In particular, the method prevents “bias leakage”: bias that might have occurred due to differences between the remnant and the experimental sample, biased or incorrect modeling of covariates, or other data analysis flaws, does not leak into the RCT estimator. We combine recent causal methods for within-RCT covariate adjustment with other methods that have sought to incorporate high-dimensional remnant data into RCT estimators. In particular, we focus on the challenge of precisely estimating treatment effects from a set of 33 TestBed experiments [12], using prior log data from experimental participants and nonparticipants in the ASSISTments system.
The nexus of machine learning and causal inference has recently experienced rapid and exciting development. This has included novel methods to analyze observational studies, e.g., ref. [13], to estimate subgroup effects, e.g., ref. [14], or to optimally allocate treatment, e.g., ref. [15]. Other developments share our goal, i.e., improving the precision of average treatment effect estimates from RCTs. These include the flexible approaches of refs [16–18], all of which can incorporate arbitrary prediction methods; [19], which uses the Lasso regression estimator to analyze experiments; and the targeted learning framework [20,21], which combines ensemble machine learning with semiparametric maximum likelihood estimation.
A large literature has explored the possibility of improving precision in RCTs by pooling the controls in the RCT with historical controls from observational datasets or from other similar RCTs. This literature dates back at least to [22]; for a review, see ref. [23]. Much of this work uses a Bayesian framework, although frequentist approaches exist as well [24]. In many of these methods, biases can be arbitrary large depending on the choice of historical controls. Other recent efforts have sought to improve precision in RCT estimates by using the results of separate models fit on observational data. These include ref. [25], which fits a covariate model to preexperimental data and then uses it to reduce standard errors of online A/B tests; Ref. [26], which uses the RCT to de-bias a broken IV estimate obtained from observational data and then further combines this with an independent RCT-based estimate; and [27], which develops a variant of the sample-splitting estimator that we review below and suggests a role for auxiliary data as well.
Other literature has sought to combine effect estimates from experimental and observational studies, often under the framework of “data fusion” [28]; these methods require observational data on both treated and untreated subjects. In addition to variance reduction, these methods may also seek to generalize the results of RCTs to other populations or other outcome variables, improve the design of RCTs, detect problems in observational studies, or accomplish other goals [29–35]. For recent reviews, see refs [36,37].
A parallel literature in survey methodology discusses the possibility of combining probability and nonprobability samples in order to increase precision, especially for small area estimation [38–41].
In this article, our goal is to estimate the average treatment effect within the RCT, and our focus is on using observational data – nonrandomized subjects in the control or treatment conditions, or both, or neither – to improve the precision of the estimate. The main idea is to use observational data to train an algorithm that predicts RCT outcomes and to use the resulting predictions in the randomized sample as a new covariate. While this approach will work with any covariate adjustment technique, we suggest an approach based on the principal of “first, do no harm,” meaning that we prioritize retaining the advantages of randomized experiments highlighted earlier. In particular, we seek to ensure that our method (1) does not introduce any bias, (2) will not harm precision and ideally will improve precision, and (3) does not require any additional statistical assumptions beyond those typically made in design-based analysis of RCTs.
This article is organized as follows. Section 2 reviews background material, including design-based RCT analysis and covariate adjustment. Section 3 discusses incorporating remnant data and presents our main methodological contribution. In Section 4, we apply the method to estimate treatment effects in 33 TestBed experiments. Section 5 concludes this article.
2 Methodological background
2.1 Causal inference from experiments
Consider a randomized experiment to estimate the average effect of a binary treatment T on an outcome Y. There are N subjects, indexed by
Following refs [42,43], let potential outcomes
Define the treatment effect for i as
If both
We will use this framework to analyze the 33 TestBed experiments. These experiments are examples of “Bernoulli experiments,” in which each
We will now introduce some statistical elements that we will use as the ingredients for our approach. Let
will also play a prominent role. Note that
Let
be subject i’s signed inverse probability weights;
since it is the difference between the Horvitz-Thomson estimates of
The sampling variance of
and
Strangely,
and its associated variance estimator
where
2.2 Design-based covariate adjustment
The reason for error when estimating
The approach we will take to combining covariate adjustment with randomization has antecedents in [2,16–18, 21,45–53], among others. We will focus on exactly unbiased estimators, despite the fact that a small amount of bias in finite sample is often acceptable, especially in the presence of other considerations. In fact, the covariate adjustment techniques we will develop have advantageous properties beyond unbiasedness (see, e.g., Section 4.3.3). That said, our main methodological contributions (in Section 3) are compatible with alternative techniques, including those that may be biased in finite samples. We will frame our arguments around bias since we find it to be the easiest way to formalize confounding, which we see as the most pressing threat to estimators that include observational data.
In a Bernoulli experiment, note that
and this therefore suggests using imputations
For
Since by design the distribution of
where we use the facts that
The unbiasedness of
Crucially, this unbiasedness holds even if
The estimate
Compare
Compared with (2), (7) replaces
2.3 Sample splitting
Successful covariate adjustment requires imputations
This may be achieved by sample splitting, also referred to in this context as cross-estimation or cross-fitting. In a Bernoulli experiment, rather than fitting global imputation algorithms
In this leave-one-out context,
and the estimated average treatment effect is then again given by
When we wish to explicitly specify the covariates and imputation method that are used within
Building upon (7), and following [53], the variance of
be the mean-squared-error of control imputations
This variance estimate will typically be somewhat conservative. This is due to the fact that
Note that by (9),
which is similar in form to the variance estimate typically used in a two-sample t-test, namely,
A special case occurs when the potential outcomes are imputed by simply taking the mean of the observed outcomes (after dropping observation i). That is, we set
and similarly for
In short, when using mean imputation for the potential outcomes, the leave-one-out sample splitting procedure essentially simplifies to a standard t-test. The effect estimate is identical, and the variance estimate is nearly identical.[1] This is highly reassuring. Any imputation strategy that improves upon mean imputation in terms of mean squared error will reduce the variance of
3 Incorporating observational data
Modern field trials are often conducted within a very data-rich context, in which rich high-dimensional covariate data are automatically, or already, collected for all experiment participants. For instance, in the TestBed experiments, system administrators have access to log data for every problem and skill builder each participating student worked before the onset of the experiment. In other contexts, such as healthcare or education, rich administrative data are often available. In fact, these covariates are available for a much wider population than just the experimental participants – in the TestBed case, there are log data for all ASSISTments users. In other education or healthcare examples, administrative data are often available for every student or patient in the system, not just for those who were randomized to a treatment or control condition. Often, as in the TestBed case, the outcome variable Y is also drawn from administrative or log data. We refer to subjects within the same data system in which the experiment took place – i.e., for whom covariate and outcome data are available – but who were not part of the experiment, as the “remnant” from the experiment. The remnant from a TestBed experiment consists of all ASSISTments users for whom log data are available but who did not participate in the experiment, of whom there are several hundred thousand.
Simply pooling data from the remnant with data from the experiment undermines the randomization, since students in the remnant were not randomized between conditions. This section will describe an alternative approach – a set of unbiased effect estimators that use the remnant to improve precision. The estimators all begin by using the remnant to fit or train a model predicting potential outcomes as a function of covariates and using that model to impute potential outcomes for units in the experiment. They differ in how they use those imputations and build on each other. The following subsection discusses a simple residualizing estimator, Section 3.2 discusses sample splitting to improve that estimator, and Section 3.3 discusses incorporating an additional set of covariate-adjustment models fit to data from the experimental subjects themselves.
We will focus on the case in which the treatment condition in the remnant is constant, irrelevant, or just unobserved. For instance, in the TestBed dataset, the RCTs typically test an experimental intervention against “business as usual,” and subjects in the remnant were all exposed to the control condition. Extension to cases in which T is observed in the remnant is straightforward and will be discussed briefly in Section 5.
3.1 Covariate adjustment using the remnant
Design based covariate adjustment requires imputation models
Regardless of the interpretation, the logic of Section 2.2 would suggest using
In what follows, we will refer specifically to (12) as “the remnant estimator.”
The remnant estimator
and
The goal of residualization is to improve precision. Since
Comparing this expression to
Importantly for practitioners, as long as only remnant data are used,
Unfortunately, in some cases (see, e.g., Section 4), the remnant estimator may have greater sampling variance than the
Thus, residualizing with
3.2 Flexibly incorporating Remnant-based imputations
Consider a “generalized remnant estimator”
where b is some prespecified constant. Note that in the special case
The challenge is that we do not know a priori how well
where we obtain
and similarly for
The estimator
Proposition 1
Let
where
Proof
See Appendix B.□
Notably, although this proposition is asymptotic in nature, we expect it to be relevant even in relatively small samples, given that
Importantly, because the
In any event, regardless of the properties of
3.3 Combining Remnant-based and within-RCT covariate adjustment
The estimator
Define
or in other words,
In general, the precision of the estimator will depend on the performance of the imputation strategy, and in particular, its ability to integrate information from the remnant, via
On the other hand, if the
This suggests imputing potential outcomes using a specialized ensemble learner [64]: a weighted average of linear regression using just
which is an interpolation between
where
4 Estimating effects in 33 online experiments
4.1 Data from the ASSISTments TestBed
We apply and evaluate the methods described in this work to a set of 33 randomized controlled experiments run within the ASSISTments TestBed, described in Section 1. These A/B tests contrast a variety of pedagogical conditions in modules teaching 6th, 7th, and 8th grade mathematics content. For our purposes, the outcome of interest was completion of the module, a binary variable.
In general, once a TestBed proposal is approved, based on Institutional Review Board and content quality criteria, its experimental conditions are embedded into an ASSISTments assignment. This is then assigned to students, either by a group of teachers recruited by the researcher or, more commonly, by the existing population of teachers using ASSISTments in their classrooms. As an example, consider an experiment comparing text-based hints to video hints. The proposing researcher would create the alternative hints and embed them into particular assignable content, a “problem set.” Then, any time a teacher assigns that problem set to his or her students, those students are randomized to one of the conditions, and, when they request hints, receive them as either text or video.
There are several types of problem sets that researchers can utilize when developing their experiments. In the case of the 33 experiments observed in this work, the problem sets are mastery-based assignments called skill builders. As opposed to more traditional assignments requiring students to complete all problems assigned, skill builders require students to demonstrate a sufficient level of understanding in order to complete the assignment. By default, students must simply answer three consecutive problems correctly without the use of computer-provided aid such as hints or scaffolding (a type of aid that breaks the problem into smaller steps). In this way, completion acts as a measure of knowledge and understanding as well as persistence and learning, as students will be continuously given more problems until they are able to reach the completion threshold. ASSISTments also includes a daily limit of ten problems to encourage students to seek help if they are struggling to reach the threshold.
After the completion of a TestBed experiment, the proposing researcher may download a dataset which includes students’ treatment assignments and their performance within the skill builder, including an indicator for completion. In addition, the dataset includes aggregated features that describe student performance within the learning platform prior to random assignment for each respective experiment. Summary statistics for the nine covariates we used in our analyses, pooled across experiments, are displayed in Table 1. These include the numbers of problems worked, and assignments and homework assigned, percent of problems correct on first try, assignments completed, and homework completed at the student and class level, and students’ genders, as guessed by an internal ASSISTments algorithm based on first names. We imputed missing covariate values separately within each experiment. When possible, we used the mean of observed values from students in the same classroom; otherwise we used the grand mean. We combined this data with disaggregated log data from students’ individual prior assignments.
Summary statistics for aggregate prior ASSISTments performance used as within-sample covariates: number of problems worked, and assignments and homework assigned, percent of problems correct on first try, assignments completed, and homework completed at the student and class level, and students’ genders, as guessed by ASSISTments based on first names
| Mean | SD | % missing | |
|---|---|---|---|
| Problem count | 601.13 | 784.45 | 2 |
| Percent correct | 0.68 | 0.13 | 2 |
| Assignments assigned | 104.25 | 413.94 | 13 |
| Percent completion | 0.89 | 0.21 | 13 |
| Class percent completion | 0.90 | 0.13 | 22 |
| Homework assigned | 25.97 | 29.90 | 50 |
| Homework percent completion | 0.93 | 0.16 | 59 |
| Class homework percent completion | 0.93 | 0.09 | 56 |
| Guessed gender | Male: 36% | Female: 36% | Unknown: 28% |
4.2 Imputations from the Remnant
We also gathered analogous data from a large remnant of students who did not participate in any of the 33 experiments we analyzed. Ideally, the remnant would consist of previous ASSISTments students who had worked on the skill builders on which the 33 experiments had been run. If that were the case, we would have considered 33 outcomes of interest, say
Rather than use the entire set of past ASSISTments users to build a remnant, we selected students who resembled those who participated in the 33 experiments. For the 11 experiments that we were able to match to other prior work, the remnant consisted of previous students who had worked on at least one of the skill builders in the experiments. For the remaining 22 experiments, we first observed the collection of problem sets given to students in the experiments before being assigned. The remnant consisted of all other ASSISTments users who had been assigned to at least one of those assignments. In other words, the remnant consisted of students who did not participate in any of the 33 experiments, but had worked on some of the same content as those who did. In all, the remnant consisted of 141,039 distinct students. Sample sizes and skill builder completion rates in the 33 experiments are given in Table A1.
We gathered records of up to 10 assigned skill builders for each student in the remnant, and for each skill builder recorded the number of problems the student started, completed, requested help on, and answered correctly, the total amount of time spent, and assignment completion (i.e., skill mastery). Then, we fit a type of recurrent neural network [65] called long short-term memory (LSTM) [66] to the resulting panel data. The model attempts to detect within-student trends in assignment completion and speed (i.e., the number of problems needed for skill mastery); please see Appendix C for further details. By using 10-fold cross validation within the remnant, we estimated the area under the ROC curve as 0.82 and a root mean squared error of 0.34 for the dependent measure of next assignment completion.
After fitting and validating the model in the remnant, we used it to predict skill builder completion for each subject in each of the 33 experiments. To do so, we gathered log data for each student from up to ten previous assigned skill builders. (Students in the experiments with no prior data were dropped from all analyses.) By using the model fit in the remnant, we predicted whether each student would complete his or her next assigned skill builder. The resulting predictive probabilities were used as
4.3 Results
In each of the 33 experiments, we calculated five different unbiased ATE estimates: (1) the simple difference-in-means estimator
Since each of these estimates is unbiased, we will focus on their estimated sampling variances. To aid interpretability, we will express contrasts between the sampling variances of two methods in terms of sample size. The estimated sampling variance of each estimator we consider is inversely proportional to sample size (see, e.g., equation (9)). Therefore, reducing the sampling variance of an estimator by, say, 1/2 is equivalent to doubling its sample size. Under that reasoning, the following discussion will refer to the ratio of estimated sampling variances as a “sample size multiplier.”
4.3.1 Remnant-based adjustment: comparing
τ
ˆ
RE
and
τ
ˆ
SS
[
x
r
,
LS
]
Figure 1 compares
![Figure 1
A dotplot showing sample size multipliers (i.e., sampling variance ratios) comparing
τ
ˆ
DM
{\hat{\tau }}^{{\rm{DM}}}
,
τ
ˆ
RE
{\hat{\tau }}^{{\rm{RE}}}
, and
τ
ˆ
SS
[
x
r
,
LS
]
{\hat{\tau }}^{{\rm{SS}}}\left[{x}^{r},{\rm{LS}}]
on the 33 ASSISTments TestBed experiments.](/document/doi/10.1515/jci-2022-0011/asset/graphic/j_jci-2022-0011_fig_001.jpg)
A dotplot showing sample size multipliers (i.e., sampling variance ratios) comparing
The leftmost plot contrasts
In contrast, the
The rightmost panel of Figure 1 compares
4.3.2 Incorporating standard covariates
Figure 2 compares
![Figure 2
A dotplot showing sample size multipliers (i.e., sampling variance ratios) comparing
τ
ˆ
SS
[
x
˜
,
EN
]
{\hat{\tau }}^{{\rm{SS}}}\left[\tilde{{\boldsymbol{x}}},{\rm{EN}}]
to
τ
ˆ
SS
[
x
r
,
LS
]
{\hat{\tau }}^{{\rm{SS}}}\left[{x}^{r},{\rm{LS}}]
,
τ
ˆ
SS
[
x
;
RF
]
{\hat{\tau }}^{{\rm{SS}}}\left[{\boldsymbol{x}};{\rm{RF}}]
, and
τ
ˆ
DM
{\hat{\tau }}^{{\rm{DM}}}
, respectively, on the 33 ASSISTments TestBed experiments.](/document/doi/10.1515/jci-2022-0011/asset/graphic/j_jci-2022-0011_fig_002.jpg)
A dotplot showing sample size multipliers (i.e., sampling variance ratios) comparing
The middle panel compares the sampling variances of
The rightmost panel compares the sampling variances of
4.3.3 Covariate adjustment with ANCOVA
The methodological development in Section 3 focused on the covariate-adjusted estimator
with ordinary least squares, where
Figure 3 compares the estimated sampling variances of
![Figure 3
A dotplot showing sample size multipliers (i.e., sampling variance ratios), from contrasts between the difference-in means estimator
τ
ˆ
DM
{\hat{\tau }}^{{\rm{DM}}}
, sample-splitting estimators
τ
ˆ
SS
[
x
r
,
LS
]
{\hat{\tau }}^{{\rm{SS}}}\left[{x}^{r},{\rm{LS}}]
and
τ
ˆ
SS
[
x
˜
,
EN
]
{\hat{\tau }}^{{\rm{SS}}}\left[\tilde{{\boldsymbol{x}}},{\rm{EN}}]
, and ANCOVA estimators
β
ˆ
[
x
r
]
\hat{\beta }\left[{x}^{r}]
and
β
ˆ
[
x
˜
]
\hat{\beta }\left[\tilde{{\boldsymbol{x}}}]
with HC2 standard errors, on the 33 ASSISTments TestBed experiments.](/document/doi/10.1515/jci-2022-0011/asset/graphic/j_jci-2022-0011_fig_003.jpg)
A dotplot showing sample size multipliers (i.e., sampling variance ratios), from contrasts between the difference-in means estimator
Across the board, the precision gains afforded by
5 Discussion
Randomized experiments and observational studies have complementary strengths. Randomized experiments allow for unbiased estimates with minimal statistical assumptions, but often suffer from small sample sizes. Observational studies, by contrast, may offer huge sample sizes, but typically suffer from confounding biases, which must be adjusted for, often through statistical modeling with questionable assumptions. In this article, we have attempted to combine the strengths of both. More specifically, we have sought to improve the precision of randomized experiments by exploiting the rich information available in a large observational dataset.
Our approach may be summarized as “first, do no harm.” A randomized experiment may be analyzed by taking a simple difference in means, which on its own provides a valid design-based unbiased estimate. The rationale for a more complicated analysis would be to improve precision. Our goal has therefore been to ensure that, in attempting to improve precision by incorporating observational data, we have not actually made matters worse. In particular, we have sought to ensure that (1) no biases in the observational data may “leak” into the analysis, (2) we can reasonably expect to improve precision, not harm it, and (3) inference may be justified by the experimental randomization, without the need for additional statistical modeling assumptions.
In this article, we focused on covariate adjustment using
The results from the 33 A/B tests we analyzed suggest that incorporating information gleaned from the remnant of an experiment can indeed improve causal inference – but it does not always do so. The extent to which the remnant can help improve precision depends on the quality of the remnant-based predictions, and this in turn depends on both the quality of the remnant data and the algorithm
The focus of this article was to show that these methods can improve statistical precision without incurring a statistical cost – i.e., without potentially increasing bias or standard errors. However, gathering remnant data and using it to train an algorithm may require substantial human and/or computational resources. Therefore, it is crucial for applied researchers to be able to anticipate in advance the extent to which our methods will outperform estimators that use only RCT data. These cost–benefit calculations can take place at two different points in the research process: before collecting any remnant data, and after collecting data from the remnant but before using it to train a predictive algorithm. Before collecting data from the remnant, researchers may be able to use observed properties of RCT data, along with anticipated, but yet unobserved, properties of the remnant to decide whether to proceed. For instance, some initial empirical results, currently under review, suggest that our methods have the potential to improve statistical precision across a wide range of RCT sample sizes, but that the most dramatic improvements tend to occur when the RCT sample size is small or moderate. Intuition suggests that the greatest contribution of auxiliary data will occur when a large number of covariates are available, but there is little prior information on which covariates are the most important. If remnant data are available, analysts may decide whether to use them to train a predictive algorithm based on explicit comparisons between covariate distributions in the remnant and in the RCT (Appendix D). Intuition suggests that our methods hold the greatest promise when covariates in the remnant and RCT are most similar.
These, and other questions will be best answered by applying our methods in a wide variety of contexts. While we have focused on the ASSISTments platform in this article, the future work will explore what other sources of auxiliary data, and corresponding prediction algorithms, may be particularly well suited to improving the precision of RCTs typically encountered in education research. Indeed, one of the advantages of developing models on observational data in this manner is that a wide variety of models may be explored, tested, and iteratively improved upon before they are applied to an RCT.
In particular, it will be interesting to consider cases in which the experimental condition varies – and is recorded – in the remnant. For instance, the remnant from an RCT contrasting two common medical procedures may include medical records from previous patients who underwent one or the other procedure. In that case, analysts may train remnant models to impute both potential outcomes as, say,
Acknowledgements
We would like to thank Ben Hansen and Charlotte Mann for helpful discussions and Ethan Prihar for help with computation. We would also like to thank the two anonymous reviewers for their comments.
-
Funding information: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D210031. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education. E. Wu was supported by NSF RTG grant DMS-1646108. N. Heffernan oversaw the creation of the 33 experiments and provided the data from ASSISTments; we want to acknowledge the funding that created/related to ASSISTments from (1) NSF (e.g., 2118725, 2118904, 1950683, 1917808, 1931523, 1940236, 1917713, 1903304, 1822830, 1759229, 1724889, 1636782, and 1535428), (2) IES (e.g., R305N210049, R305D210031, R305A170137, R305A170243, R305A180401, R305D210036, R305A120125, and R305R220012), (3) GAANN (e.g., P200A180088 and P200A150306), (4) EIR (U411B190024 and S411B210024), (5) ONR (N00014-18-1-2768), and (6) Schmidt Futures. None of the opinions expressed here are those of the funders.
-
Conflict of interest: The authors state no conflict of interest.
-
Data availability statement: Code and data are available at https://osf.io/d9ujq/.
Appendix A Summary of A/B test data
Table A1 gives sample sizes and skill builder completion rates in the 33 experiments discussed in the paper.
B Proof of Proposition
Proposition 1
Let
where
Proof
We first explicitly define
Comparing (A1) to (10), we see that in order to prove the desired result, it is sufficient to show that
Let
Sample sizes and % homework completion – the outcome of interest – by treatment group in each of the 33 A/B tests
| n | % complete | n | % complete | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Experiment | Trt | Ctl | Trt | Ctl | Experiment | Trt | Ctl | Trt | Ctl |
| 1 | 956 | 961 | 94 | 93 | 18 | 165 | 170 | 92 | 89 |
| 2 | 329 | 363 | 98 | 96 | 19 | 259 | 246 | 82 | 85 |
| 3 | 649 | 610 | 86 | 88 | 20 | 199 | 213 | 85 | 88 |
| 4 | 201 | 228 | 97 | 95 | 21 | 258 | 276 | 82 | 80 |
| 5 | 910 | 887 | 73 | 72 | 22 | 188 | 193 | 89 | 85 |
| 6 | 931 | 900 | 61 | 64 | 23 | 242 | 266 | 81 | 76 |
| 7 | 360 | 344 | 88 | 88 | 24 | 279 | 235 | 72 | 69 |
| 8 | 492 | 463 | 79 | 81 | 25 | 269 | 288 | 65 | 59 |
| 9 | 215 | 211 | 93 | 92 | 26 | 225 | 232 | 73 | 74 |
| 10 | 231 | 197 | 92 | 91 | 27 | 267 | 256 | 63 | 62 |
| 11 | 607 | 578 | 68 | 63 | 28 | 228 | 244 | 68 | 64 |
| 12 | 370 | 384 | 83 | 82 | 29 | 239 | 258 | 54 | 48 |
| 13 | 338 | 289 | 88 | 84 | 30 | 74 | 92 | 91 | 84 |
| 14 | 478 | 476 | 76 | 73 | 31 | 69 | 67 | 91 | 87 |
| 15 | 193 | 209 | 89 | 93 | 32 | 76 | 81 | 62 | 70 |
| 16 | 404 | 451 | 73 | 69 | 33 | 15 | 11 | 73 | 55 |
| 17 | 264 | 274 | 84 | 85 | |||||
To complete the proof, it suffices to show that
where
Here, the
where
C Deep learning in the Remnant to impute completion
We used the remnant to train a variant of a recurrent neural network [65] called a long short-term memory (LSTM) network [66] to predict students’ assignment completion. Deep learning models, and particularly LSTM networks, have been previously applied successfully to model similar temporal relationships in various areas of educational research [74,75].
Neural networks, including recurrent networks such as those explored here, are universal function approximators [76,77]. These models are commonly represented as “layers” of neurons; these feed from a set of inputs, through one or more “hidden” layers, to an output layer, where, in the basic case, the output of each layer is determined by equation (A6). In that equation, W is a set of learned weights, comparable to the coefficients learned in a regression model. The activation function
where
Recurrent networks build upon this formulation to add layers that utilize not only the outputs of preceding layers, but also incorporate values from previous time steps within a supplied series; in time series data, the model estimates for a particular time step may be better informed by information from previous time steps, and a recurrent network structure is designed to take advantage of this likelihood. The LSTM networks explored here incorporate a set of “gates” that regulate the flow of data from both preceding layers and a “cell memory” that is calculated through previous time steps. The output of this LSTM layer is given by equations A7–A12.
where t is given as recurrent layer
In the aforementioned equations, gates
As a recurrent network, the model is trained by iteratively updating the weight matrices (W in the above equations) through a procedure known as backpropagation through time [78] combined with a stochastic gradient descent method called Adam [79]. These methods are informed by a cost function (sometimes called a loss function) that is calculated through the comparison of model predictions with supplied ground truth labels. In this work, we adopted a network structure that incorporates multi-task learning [80] as a means of regularization. In other words, our model ultimately produces two sets of predictions corresponding with two outcomes of interest: student completion and inverse mastery speed, each on the subsequent assignment. By optimizing model weights in regard to these two outcomes, the process helps prevent the model from overfitting to either outcome; as student completion of their next assignment is the outcome explored in this work, the second outcome of inverse mastery speed is used only for this regularization purpose and is not utilized in subsequent analyses. Given that student completion is binary and inverse mastery speed is a continuous measure, the formula of which is described in Table A2, the cost function for our model training was calculated as a linear combination of two separate cost functions. Binary cross-entropy is used in the case of next assignment completion, as shown in equation (A13), while RMSE (equation (A14)) is used in the case of inverse mastery speed on the next assignment. The final cost function is then given as equation (A15), which is calculated over smaller smaller “batches” of samples over multiple training cycles known as epochs.
The training of the model continues by calculating the cost and iteratively updating model weights over multiple epochs until a stopping criterion is met. In this regard, we hold out 30% of the training data as a validation set. Model performance is calculated on this validation set after each epoch of training. Training ceases once the model performance on this validation set stops improving (i.e., the difference of model performance from one epoch to the next falls below a designed threshold). To avoid stopping the training process too early due to small fluctuations in model performance on the validation set early in the training procedure, a 5-epoch moving average of validation cost is used as the stopping criterion.
Assignment-level features in LSTM model
| Input feature | Description |
|---|---|
| Problems started | The number of problems started by the student. (Untransformed & Sq.Root) |
| Problems completed | The number of problems completed by the student. (Untransformed & Sq.Root) |
| Inverse mastery speed | The inverse of the number of problems needed to complete the mastery assignment, or 0 where the student did not complete. (Untransformed & Sq.Root) |
| Percent correct | The percentage of problems answered correctly on the first attempt without the use of hints. (Untransformed & Sq.Root) |
| Assignment completion | Whether the current assignment was completed by the student. |
| Attempts per problem | The number of attempts taken to correctly answer each problem. (Avg. & Sq.Root) |
| First response time | The time taken per problem before making the first action. (Avg.) |
| Problem duration | The time, in seconds, needed to solve each problem. (Avg.) |
| Days with activity | The number of distinct days on which the student worked on each problem in the assignment. (Avg.) |
| Attempted problem first | Whether, on each problem, the first action was an attempt to answer (as opposed to a help request). (Avg.) |
| Requested answer hint | Whether, on each problem, the student needed to be given the answer to progress. (Avg.) |
The specific model structure used in this work observed an LSTM network comprised of 3 layers. We used 16 covariates to describe each single time step, which then feeds into a hidden LSTM layer of 100 nodes, which is used to inform an output layer of two units corresponding with the previously described two outcomes of interest. The input features used in this model, described in Table A2, represent transformed and nontransformed versions of several metrics that describe different aspects of student performance within a single assignment. We considered sequences of at most ten worked skill builder assignments (c.f. Section 4.1), to predict student completion on a subsequent skill builder assignment.
We specified the LSTM model’s hyperparameters (e.g., number of LSTM nodes, delta of stopping criterion, weight update step size) based on previously successful model structures and training procedures within the context of education. We evaluated the model using a 10-fold cross validation within the remnant to gain a measure of model fit (leading to an ROC area under the curve of 0.82 and root mean squared error of 0.34 for the dependent measure of next assignment completion). After this evaluation, the model is then re-trained using the full set of remnant data. This trained model is then used within the analyses described in Section 4.
D Comparing covariates in the remnant to the RCT
The requirement (5) that imputations
This restriction, however, does not extend to covariate data
Here, we discuss a technique we attempted, although we do not believe that it achieved its aim.
The intuition behind our approach is based roughly on “
To calculate this measure for TestBed A/B tests, we first flattened each subject’s covariate data by averaging their assignment-level statistics and also including a covariate equal to the number of included assignments. Then, we chose
Figure A1 shows the results. Each panel corresponds to a different A/B test and displays a boxplot of
![Figure A1
Boxplots comparing the distribution of
d
¯
i
5
{\bar{d}}_{i}^{5}
for each ASSISTments TestBed A/B test against the analogous distribution for the corresponding remnant. Panels are ordered from lowest to highest according to
V
ˆ
(
τ
ˆ
DM
)
/
V
ˆ
(
τ
ˆ
SS
[
x
r
,
LS
]
)
\hat{{\mathbb{V}}}\left({\hat{\tau }}^{{\rm{DM}}})\hspace{0.1em}\text{/}\hspace{0.1em}\hat{{\mathbb{V}}}\left({\hat{\tau }}^{{\rm{SS}}}\left[{x}^{r},{\rm{LS}}])
.](/document/doi/10.1515/jci-2022-0011/asset/graphic/j_jci-2022-0011_fig_004.jpg)
Boxplots comparing the distribution of
Unfortunately, no pattern is apparent, suggesting that
References
[1] Schochet PZ. Statistical theory for the RCT-YES software: Design-based causal inference for RCTs. NCEE 2015-4011. Washington, D.C.: National Center for Education Evaluation and Regional Assistance; 2015.Search in Google Scholar
[2] Rosenbaum PR. Covariance adjustment in randomized experiments and observational studies. Stat Sci. 2002;17(3):286–327.10.1214/ss/1042727942Search in Google Scholar
[3] Sales AC, Hansen BB, Rowan B. Rebar: Reinforcing a matching estimator with predictions from high-dimensional covariates. J Educ Behav Stat. 2018;43(1):3–31.10.3102/1076998617731518Search in Google Scholar
[4] Heffernan NT, Heffernan CL. The ASSISTments ecosystem: building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. Int J Artif Intell Educ. 2014;24(4):470–97.10.1007/s40593-014-0024-xSearch in Google Scholar
[5] Ostrow KS, Selent D, Wang Y, Van Inwegen EG, Heffernan NT, Williams JJ. The assessment of learning infrastructure (ALI): the theory, practice, and scalability of automated assessment. In: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM; 2016. p. 279–88.10.1145/2883851.2883872Search in Google Scholar
[6] Fyfe ER. Providing feedback on computer-based algebra homework in middle-school classrooms. Comput Human Behav. 2016;63:568–74.10.1016/j.chb.2016.05.082Search in Google Scholar
[7] Walkington C, Clinton V, Sparks A. The effect of language modification of mathematics story problems on problem-solving in online homework. Instruct Sci. 2019;47:1–31.10.1007/s11251-019-09481-6Search in Google Scholar
[8] Prihar E, Syed M, Ostrow K, Shaw S, Sales A, Heffernan N. Exploring common trends in online educational experiments. In: Proceedings of the 15th International Conference on Educational Data Mining; 2022. p. 27.Search in Google Scholar
[9] Vanacore K, Gurung A, Mcreynolds A, Liu A, Shaw S, Heffernan N. Impact of non-cognitive interventions on student learning behaviors and outcomes: an analysis of seven large-scale experimental inventions. In: LAK23: 13th International Learning Analytics and Knowledge Conference. LAK2023. New York, NY, USA: Association for Computing Machinery; 2023. p. 165–74. 10.1145/3576050.3576073.Search in Google Scholar
[10] Gurung A, Baral S, Vanacore KP, Mcreynolds AA, Kreisberg H, Botelho AF, et al. Identification, exploration, and remediation: can teachers predict common wrong answers? In: LAK23: 13th International Learning Analytics and Knowledge Conference. LAK2023. New York, NY, USA: Association for Computing Machinery; 2023. p. 399–410. 10.1145/3576050.3576109.Search in Google Scholar
[11] Gurung A, Vanacore KP, McReynolds AA, Ostrow KS, Sales AC, Heffernan N. How common are common wrong answers? exploring remediation at scale. In: Proceedings of the Tenth ACM Conference on Learning@ Scale (L@S’23). New York, NY, USA: ACM; 2023.10.1145/3573051.3593390Search in Google Scholar
[12] Selent D, Patikorn T, Heffernan N. Assistments dataset from multiple randomized controlled experiments. In: Proceedings of the Third (2016) ACM Conference on Learning@ Scale. ACM; 2016. p. 181–4.10.1145/2876034.2893409Search in Google Scholar
[13] Diamond A, Sekhon JS. Genetic matching for estimating causal effects: a general multivariate matching method for achieving balance in observational studies. Rev Econom Stat. 2013;95(3):932–45.10.1162/REST_a_00318Search in Google Scholar
[14] Künzel SR, Stadie BC, Vemuri N, Ramakrishnan V, Sekhon JS, Abbeel P. Transfer learning for estimating causal effects using neural networks. INFORMS. 2019.Search in Google Scholar
[15] Rzepakowski P, Jaroszewicz S. Decision trees for uplift modeling with single and multiple treatments. Knowledge Inform Syst. 2012;32(2):303–27.10.1007/s10115-011-0434-0Search in Google Scholar
[16] Aronow PM, Middleton JA. A class of unbiased estimators of the average treatment effect in randomized experiments. J Causal Inference. 2013;1(1):135–54.10.1515/jci-2012-0009Search in Google Scholar
[17] Wager S, Du W, Taylor J, Tibshirani RJ. High-dimensional regression adjustments in randomized experiments. Proc Natl Academy Sci. 2016;113(45):12673–8.10.1073/pnas.1614732113Search in Google Scholar PubMed PubMed Central
[18] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, et al. Double/debiased machine learning for treatment and structural parameters. Econometrics J. 2018;21(1):C1–68.10.1111/ectj.12097Search in Google Scholar
[19] Bloniarz A, Liu H, Zhang CH, Sekhon JS, Yu B. Lasso adjustments of treatment effect estimates in randomized experiments. Proc Natl Acad Sci. 2016;113(27):7383–90.10.1073/pnas.1510506113Search in Google Scholar PubMed PubMed Central
[20] Rosenblum M, Van Der Laan MJ. Simple, efficient estimators of treatment effects in randomized trials using generalized linear models to leverage baseline variables. Int J Biostat. 2010;6(1). https://doi.org/10.2202/1557-4679.1138.10.2202/1557-4679.1138Search in Google Scholar PubMed PubMed Central
[21] Van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer Science & Business Media; 2011.10.1007/978-1-4419-9782-1Search in Google Scholar
[22] Pocock SJ. The combination of randomized and historical controls in clinical trials. J Chronic Diseases. 1976;29(3):175–88.10.1016/0021-9681(76)90044-8Search in Google Scholar PubMed
[23] Viele K, Berry S, Neuenschwander B, Amzal B, Chen F, Enas N, et al. Use of historical control data for assessing treatment effects in clinical trials. Pharmaceut Stat. 2014;13(1):41–54.10.1002/pst.1589Search in Google Scholar PubMed PubMed Central
[24] Yuan J, Liu J, Zhu R, Lu Y, Palm U. Design of randomized controlled confirmatory trials using historical control data to augment sample size for concurrent controls. J Biopharmaceut Stat. 2019;29(3):558–73.10.1080/10543406.2018.1559853Search in Google Scholar PubMed
[25] Deng A, Xu Y, Kohavi R, Walker T. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining; 2013. p. 123–32.10.1145/2433396.2433413Search in Google Scholar
[26] Gui G. Combining observational and experimental data using first-stage covariates. 2020. arXiv: http://arXiv.org/abs/arXiv:201005117.10.2139/ssrn.3662061Search in Google Scholar
[27] Opper IM. Improving average treatment effect estimates in small-scale randomized controlled trials. EdWorkingPapers. 2021. https://edworkingpapers.org/sites/default/files/ai21-344.pdf.10.7249/WRA1004-1Search in Google Scholar
[28] Bareinboim E, Pearl J. Causal inference and the data-fusion problem. Proce Natl Acad Sci. 2016;113(27):7345–52.10.1073/pnas.1510507113Search in Google Scholar PubMed PubMed Central
[29] Hartman E, Grieve R, Ramsahai R, Sekhon JS. From sample average treatment effect to population average treatment effect on the treated: combining experimental with observational studies to estimate population treatment effects. J R Stat Soc Ser A. 2015;10:1111.10.1111/rssa.12094Search in Google Scholar
[30] Athey S, Chetty R, Imbens G. Combining experimental and observational data to estimate treatment effects on long term outcomes. 2020. http://arXiv.org/abs/arXiv:200609676.Search in Google Scholar
[31] Rosenman ET, Owen AB. Designing experiments informed by observational studies. J Causal Inference. 2021;9(1):147–71.10.1515/jci-2021-0010Search in Google Scholar
[32] Rosenman ET, Basse G, Owen AB, Baiocchi M. Combining observational and experimental datasets using shrinkage estimators. Biometrics. 2020;1–13. https://doi.org/10.1111/biom.13827.10.1111/biom.13827Search in Google Scholar PubMed
[33] Rosenman ET, Owen AB, Baiocchi M, Banack HR. Propensity score methods for merging observational and experimental datasets. Stat Med. 2022;41(1):65–86.10.1002/sim.9223Search in Google Scholar PubMed
[34] Chen S, Zhang B, Ye T.Minimax rates and adaptivity in combining experimental and observational data. 2021. http://arXiv.org/abs/arXiv:210910522.Search in Google Scholar
[35] Kallus N, Puli AM, Shalit U. Removing hidden confounding by experimental grounding. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, editors. Advances in neural information processing systems. Vol. 31. Curran Associates, Inc.; 2018. p. 10888–97.Search in Google Scholar
[36] Degtiar I, Rose S. A review of generalizability and transportability. Annual Rev Stat Appl. 2023;10:501–24.10.1146/annurev-statistics-042522-103837Search in Google Scholar
[37] Colnet B, Mayer I, Chen G, Dieng A, Li R, Varoquaux G, et al. Causal inference methods for combining randomized trials and observational studies: a review. 2020. arXiv: http://arXiv.org/abs/arXiv:201108047.Search in Google Scholar
[38] Breidt FJ, Opsomer JD. Model-assisted survey estimation with modern prediction techniques. Stat Sci. 2017;32(2):190–205.10.1214/16-STS589Search in Google Scholar
[39] Erciulescu AL, Cruze NB, Nandram B. Statistical challenges in combining survey and auxiliary data to produce official statistics. J Official Stat (JOS). 2020;36(1):63–88.10.2478/jos-2020-0004Search in Google Scholar
[40] Dagdoug M, Goga C, Haziza D. Model-assisted estimation through random forests in finite population sampling. J Amer Stat Assoc. 2021;118:1234–51.10.1080/01621459.2021.1987250Search in Google Scholar
[41] McConville KS, Moisen GG, Frescino TS. A tutorial on model-assisted estimation with application to forest inventory. Forests. 2020;11(2):244.10.3390/f11020244Search in Google Scholar
[42] Neyman J. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Stat Sci. 1923;5:463–80. 1990; transl. by D.M. Dabrowska and T.P. Speed.10.1214/ss/1177012031Search in Google Scholar
[43] Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66(5):688.10.1037/h0037350Search in Google Scholar
[44] Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Amer Stat Assoc. 1952;47(260):663–85.10.1080/01621459.1952.10483446Search in Google Scholar
[45] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Amer Stat Assoc. 1994;89(427):846–66.10.1080/01621459.1994.10476818Search in Google Scholar
[46] Scharfstein DO, Rotnitzky A, Robins JM. Rejoinder. J Amer Stat Assoc. 1999;94(448):1135–46.10.1080/01621459.1999.10473869Search in Google Scholar
[47] Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. In: Proceedings of the American Statistical Association. vol. 1999. Indianapolis, IN; 2000. p. 6–10.Search in Google Scholar
[48] Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–73.10.1111/j.1541-0420.2005.00377.xSearch in Google Scholar PubMed
[49] van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1). https://doi.org/10.2202/1557-4679.1043.10.2202/1557-4679.1043Search in Google Scholar
[50] Tsiatis AA, Davidian M, Zhang M, Lu X. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Stat Med. 2008;27(23):4658–77.10.1002/sim.3113Search in Google Scholar PubMed PubMed Central
[51] Moore KL, van der Laan MJ. Covariate adjustment in randomized trials with binary outcomes: targeted maximum likelihood estimation. Stat Med. 2009;28(1):39–64.10.1002/sim.3445Search in Google Scholar PubMed PubMed Central
[52] Belloni A, Chernozhukov V, Hansen C. Inference on treatment effects after selection among high-dimensional controls. Rev Econom Stud. 2014;81(2):608–50.10.1093/restud/rdt044Search in Google Scholar
[53] Wu E, Gagnon-Bartsch JA. The LOOP estimator: adjusting for covariates in randomized experiments. Evaluat. Rev. 2018;42(4):458–88.10.1177/0193841X18808003Search in Google Scholar PubMed
[54] Freedman DA. On regression adjustments to experimental data. Adv Appl Math. 2008;40(2):180–93.10.1016/j.aam.2006.12.003Search in Google Scholar
[55] Hahn J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica. 1998;66:315–31.10.2307/2998560Search in Google Scholar
[56] Rothe C. The value of knowing the propensity score for estimating average treatment effects. IZA Discussion Papers. 2016. (9989).10.2139/ssrn.2797560Search in Google Scholar
[57] Breiman L. Random forests. Machine Learn. 2001;45(1):5–32.10.1023/A:1010933404324Search in Google Scholar
[58] Jiang K, Mukherjee R, Sen S, Sur P. A new central limit theorem for the augmented IPW estimator: variance inflation, cross-fit covariance and beyond. 2022. arXiv: http://arXiv.org/abs/arXiv:220510198.Search in Google Scholar
[59] Smucler E, Rotnitzky A, Robins JM. A unifying approach for doubly-robust ℓ1 regularized estimation of causal contrasts. 2019. arXiv: http://arXiv.org/abs/arXiv:19040373.Search in Google Scholar
[60] Wu E, Gagnon-Bartsch JA. Design-based covariate adjustments in paired experiments. J Educ Behav Stat. 2021;46(1):109–32.10.3102/1076998620941469Search in Google Scholar
[61] Aronow PM, Green DP, Lee DKK. Sharp bounds on the variance in randomized experiments. Ann Statist. 2014;42(3):850–71.10.1214/13-AOS1200Search in Google Scholar
[62] Freedman D, Pisani R, Purves R, Adhikari A. Statistics. New York: WW Norton & Company; 2007.Search in Google Scholar
[63] Sales AC, Botelho A, Patikorn TM, Heffernan NT. Using big data to sharpen design-based inference in A/B tests. In: Proceedings of the 11th International Conference on Educational Data Mining. International Educational Data Mining Society; 2018. p. 479–86.Search in Google Scholar
[64] Opitz D, Maclin R. Popular ensemble methods: an empirical study. J Artif Intell Res. 1999;11:169–98.10.1613/jair.614Search in Google Scholar
[65] Williams RJ, Zipser D. A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1989;1(2):270–80.10.1162/neco.1989.1.2.270Search in Google Scholar
[66] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.10.1162/neco.1997.9.8.1735Search in Google Scholar PubMed
[67] Walsh D, Miller D, Hall D, Walsh J, Fisher C, Schuler A. Prognostic covariate adjustment: a novel method to reduce trial sample sizes while controlling type I error; 2022. Talk presented at the Joint Statistical Meetings. https://ww2.amstat.org/meetings/jsm/2022/onlineprogram/AbstractDetails.cfm?abstractid=320608.Search in Google Scholar
[68] MacKinnon JG, White H. Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. J Econom. 1985;29(3):305–25.10.1016/0304-4076(85)90158-7Search in Google Scholar
[69] Blair G, Cooper J, Coppock A, Humphreys M, Sonnet L. Estimatr: fast estimators for design-based inference; 2021. R package version 0.30.2. https://CRAN.R-project.org/package=estimatr.Search in Google Scholar
[70] R Development Core Team. R: a language and environment for statistical computing. Vienna, Austria; 2011. ISBN 3-900051-07-0. http://www.R-project.org/.Search in Google Scholar
[71] Lin W. Agnostic notes on regression adjustments to experimental data: reexamining freedmanas critique. Ann Appl Stat. 2013;7(1):295–318.10.1214/12-AOAS583Search in Google Scholar
[72] Guo K, Basse G. The generalized oaxaca-blinder estimator. J Amer Stat Assoc. 2021;118:1–13.10.1080/01621459.2021.1941053Search in Google Scholar
[73] Seber GA, Lee AJ. Linear regression analysis. Vol. 329. Hoboken, NJ: John Wiley & Sons; 2012.Search in Google Scholar
[74] Piech C, Bassen J, Huang J, Ganguli S, Sahami M, Guibas LJ, et al. Deep knowledge tracing. In: Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc.; 2015. p. 505–13.Search in Google Scholar
[75] Botelho AF, Baker RS, Heffernan NT. Improving sensor-free affect detection using deep learning. In: International Conference on Artificial Intelligence in Education. Cham, Switzerland: Springer; 2017. p. 40–51.10.1007/978-3-319-61425-0_4Search in Google Scholar
[76] Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989;2(5):359–66.10.1016/0893-6080(89)90020-8Search in Google Scholar
[77] Schäfer AM, Zimmermann HG. Recurrent neural networks are universal approximators. In: International Conference on Artificial Neural Networks. Berlin: Springer; 2006. p. 632–40.10.1007/11840817_66Search in Google Scholar
[78] Werbos PJ. Backpropagation through time: what it does and how to do it. Proc IEEE. 1990;78(10):1550–60.10.1109/5.58337Search in Google Scholar
[79] Kingma DP, Ba J. Adam: a method for stochastic optimization. 2014. arXiv: http://arXiv.org/abs/arXiv:14126980.Search in Google Scholar
[80] Caruana R. Multitask learning. Machine Learn. 1997;28(1):41–75.10.1023/A:1007379606734Search in Google Scholar
© 2023 the author(s), published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.
Articles in the same Issue
- Research Articles
- Adaptive normalization for IPW estimation
- Matched design for marginal causal effect on restricted mean survival time in observational studies
- Robust inference for matching under rolling enrollment
- Attributable fraction and related measures: Conceptual relations in the counterfactual framework
- Causality and independence in perfectly adapted dynamical systems
- Sensitivity analysis for causal decomposition analysis: Assessing robustness toward omitted variable bias
- Instrumental variable regression via kernel maximum moment loss
- Randomization-based, Bayesian inference of causal effects
- On the pitfalls of Gaussian likelihood scoring for causal discovery
- Double machine learning and automated confounder selection: A cautionary tale
- Randomized graph cluster randomization
- Efficient and flexible mediation analysis with time-varying mediators, treatments, and confounders
- Minimally capturing heterogeneous complier effect of endogenous treatment for any outcome variable
- Quantitative probing: Validating causal models with quantitative domain knowledge
- On the dimensional indeterminacy of one-wave factor analysis under causal effects
- Heterogeneous interventional effects with multiple mediators: Semiparametric and nonparametric approaches
- Exploiting neighborhood interference with low-order interactions under unit randomized design
- Robust variance estimation and inference for causal effect estimation
- Bounding the probabilities of benefit and harm through sensitivity parameters and proxies
- Potential outcome and decision theoretic foundations for statistical causality
- 2D score-based estimation of heterogeneous treatment effects
- Identification of in-sample positivity violations using regression trees: The PoRT algorithm
- Model-based regression adjustment with model-free covariates for network interference
- All models are wrong, but which are useful? Comparing parametric and nonparametric estimation of causal effects in finite samples
- Confidence in causal inference under structure uncertainty in linear causal models with equal variances
- Special Issue on Integration of observational studies with randomized trials - Part II
- Personalized decision making – A conceptual introduction
- Precise unbiased estimation in randomized experiments using auxiliary observational data
- Conditional average treatment effect estimation with marginally constrained models
- Testing for treatment effect twice using internal and external controls in clinical trials
Articles in the same Issue
- Research Articles
- Adaptive normalization for IPW estimation
- Matched design for marginal causal effect on restricted mean survival time in observational studies
- Robust inference for matching under rolling enrollment
- Attributable fraction and related measures: Conceptual relations in the counterfactual framework
- Causality and independence in perfectly adapted dynamical systems
- Sensitivity analysis for causal decomposition analysis: Assessing robustness toward omitted variable bias
- Instrumental variable regression via kernel maximum moment loss
- Randomization-based, Bayesian inference of causal effects
- On the pitfalls of Gaussian likelihood scoring for causal discovery
- Double machine learning and automated confounder selection: A cautionary tale
- Randomized graph cluster randomization
- Efficient and flexible mediation analysis with time-varying mediators, treatments, and confounders
- Minimally capturing heterogeneous complier effect of endogenous treatment for any outcome variable
- Quantitative probing: Validating causal models with quantitative domain knowledge
- On the dimensional indeterminacy of one-wave factor analysis under causal effects
- Heterogeneous interventional effects with multiple mediators: Semiparametric and nonparametric approaches
- Exploiting neighborhood interference with low-order interactions under unit randomized design
- Robust variance estimation and inference for causal effect estimation
- Bounding the probabilities of benefit and harm through sensitivity parameters and proxies
- Potential outcome and decision theoretic foundations for statistical causality
- 2D score-based estimation of heterogeneous treatment effects
- Identification of in-sample positivity violations using regression trees: The PoRT algorithm
- Model-based regression adjustment with model-free covariates for network interference
- All models are wrong, but which are useful? Comparing parametric and nonparametric estimation of causal effects in finite samples
- Confidence in causal inference under structure uncertainty in linear causal models with equal variances
- Special Issue on Integration of observational studies with randomized trials - Part II
- Personalized decision making – A conceptual introduction
- Precise unbiased estimation in randomized experiments using auxiliary observational data
- Conditional average treatment effect estimation with marginally constrained models
- Testing for treatment effect twice using internal and external controls in clinical trials