Treatment effect estimation with observational network data using machine learning

Corinne Emmenegger; Meta-Lina Spohn; Timon Elmer; Peter Bühlmann

doi:10.1515/jci-2023-0082

Article Open Access

Treatment effect estimation with observational network data using machine learning

Corinne Emmenegger , Meta-Lina Spohn , Timon Elmer and Peter Bühlmann

Published/Copyright: April 3, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Journal of Causal Inference Volume 13 Issue 1

Abstract

Causal inference methods for treatment effect estimation usually assume independent units. However, this assumption is often questionable because units may interact, resulting in spillover effects between them. We develop augmented inverse probability weighting (AIPW) for estimation and inference of the expected average treatment effect (EATE) with observational data from a single (social) network with spillover effects. In contrast to overall effects such as the global average treatment effect, the EATE measures, in expectation and on average over all units, how the outcome of a unit is causally affected by its own treatment, marginalizing over the spillover effects from other units. We develop cross-fitting theory with plugin machine learning to obtain a semiparametric treatment effect estimator that converges at the parametric rate and asymptotically follows a Gaussian distribution. The asymptotics are developed using the dependency graph rather than the network graph, which makes explicit that we allow for spillover effects beyond immediate neighbors in the network. We apply our AIPW method to the Swiss StudentLife Study data to investigate the effect of hours spent studying on exam performance accounting for the students’ social network.

Keywords: dependent data; interference; observed confounding; semiparametric inference; spillover effects

MSC 2010: 62D20; 62G20

1 Introduction

Classical causal inference approaches for treatment effect estimation with observational data usually assume independent units. This assumption is part of the common stable unit treatment value assumption (SUTVA) [1]. However, independence is often violated in practice due to interactions among units that lead to so-called spillover effects. For example, the vaccination against an infectious disease (treatment) of a person (unit) may not only influence this person’s health status (outcome) but may also protect the health status of other people the person is interacting with [2,3]. In the presence of spillover effects, standard algorithms fail to separate correlation from causation, and spurious associations due to network dependence contribute to the replication crisis [4] and may yield biased causal effect estimators and invalid inference [2,4–8]. New approaches are required to guarantee valid causal inference from observational data with spillover effects.

We consider the following types of spillover effects: (i) causal effects of other units’ treatments on a given unit’s outcome, referred to as interference in the literature [5,9], and (ii) causal effects of other units’ covariates on a given unit’s treatment or outcome.^[1] The spillover effects a unit receives are governed by proximity of this unit to other units in a known undirected network G . The edges of this network represent some kind of interaction or relationship of the respective units such as friendship, geographical closeness, or shared department in a company.

In this article, the causal effect of interest and target of inference is the expected average treatment effect (EATE) [3] in an observational setting. The EATE measures, in expectation and on average over all units, how the outcome of a unit is causally affected by its own treatment in the presence of spillover effects from other units. The EATE is the statistical parameter when the question is how, on average for all units, the outcome of a specific unit is influenced when only its own treatment is altered. In the infectious disease example, the EATE measures the average expected difference in health status of an individual assigned to the vaccination versus not, marginalizing over unit-specific covariates and spillover effects of other people. This corresponds to the medical effect of the vaccine in a person’s body. This interpretation highlights that the EATE is not an estimand for policy evaluation, where, for example, one is interested in capturing the effect of jointly vaccinating a sample of the population.

We now formalize the EATE following the study by Sofrygin and van der Laan [10]. For each unit i = 1 , 2 , … , N , let W i ∈ { 0 , 1 } be the dichotomous treatment, Y i be the response, and C i be the covariates of unit i . The N units are connected in a fixed undirected network G in which they may exhibit spillover effects of the two aforementioned types (i) and (ii) from their immediate neighbors and/or units further away. Let P L N be the observational distribution of O = ( W i , C i , Y i ) i = 1 , … , N , where L is the distribution of W = ( W 1 , … , W N ) given C = ( C 1 , … , C N ) . Let P L ˜ N be the distribution of O ˜ = ( W i ˜ , C i , Y i ˜ ) i = 1 , … , N , where the conditional distribution L of W given C has been replaced by the user-defined distribution L ˜ . This distribution L ˜ describes the intervention on the treatment vector W that the researcher is interested in. We can then define the EATE as

θ N 0 = θ N 0 ( 1 ) − θ N 0 ( 0 ) ,

where

θ N 0 ( w ) ≔ 1 N ∑ i = 1 N E P L ˜ i ( w ) N [ Y i do ( W = L ˜ i ( w ) ) ] ,

where we use the do-notation of Pearl [11] and

L ˜ i ( w ) = ( W 1 , … , W i − 1 , w , W i + 1 , … , W N ) , for w ∈ { 0 , 1 } ,

represents the intervention on the unit-specific treatment W i (setting it to constant w ), but the distribution of treatments W j for the other N − 1 units j in the network are left unchanged. In particular, the intervention L ˜ i ( w ) is independent of C . Thus, θ N 0 ( 1 ) evaluates a collection of unit-specific distributions, ( L ˜ 1 ( 1 ) , … , L ˜ N ( 1 ) ) , which cannot be rewritten as a single intervention L ˜ on the whole treatment vector W . By denoting the EATE by θ N 0 , it remains implicit that it is defined conditional on a specific network G , while it is explicit that it is a function of the given sample size N . Consequently, the EATE’s true value can vary depending on the sample size and the network structure.

To simplify notation, we rewrite the EATE by

θ N 0 = 1 N ∑ i = 1 N θ i 0 ,

where

θ i 0 = E W − i , C − i , C i [ E [ Y i d o ( W i = 1 ) − Y i d o ( W i = 0 ) ∣ W − i , C − i , C i ] ] ,

and W − i = ( W 1 , … , W i − 1 , W i + 1 , … , W N ) and C − i = ( C 1 , … , C i − 1 , C i + 1 , … , C N ) . Thus, the EATE equals the average of the unit-specific treatment effects θ i 0 , i.e., the expected difference in outcomes Y i if the treatment was assigned to unit i versus if it was retained from unit i . The unit-specific treatment effects may not be the same for all units because the outcomes may have different distributions conditional on W − i and C − i across units due to the spillover effects. In the setting without spillover effects, the distribution of Y i does not depend on W − i and C − i , for each i = 1 , … , N , and thus, the EATE coincides with the average treatment effect (ATE) if spillover effects are absent [12,13].

We impose the following key assumption (that is standard in this literature [6,10,14]): the spillover effects can be summarized by lower-dimensional features, i.e., we will use domain knowledge-informed features that are arbitrary functions of the network G and the treatment and covariate vectors of all units [15,16]. The features are assumed to capture all pathways through which spillover effects take place. For example, Cai et al. [17] and Leung [18] model the purchase of a weather insurance (outcome) of farmers in rural China as a function of attending a training session (treatment) and the proportion of friends who attend the session (feature on direct neighbors in the network).

In the following, we will assume a structural equation model (SEM) to impose our assumptions on the data-generating mechanism of the joint distribution of ( W i , C i , Y i ) i = 1 … , N . The outcome and propensity score model of the SEM may be highly complex and nonsmooth and include interactions and high-dimensional variables. We then follow an augmented inverse probability weighting (AIPW) [19] approach to estimate the EATE θ N 0 in the context of this model. We estimate the outcome and propensity score models with arbitrary machine learning algorithms and plug them into our AIPW estimand identifying θ N 0 . These machine learning estimators may be highly complex and suffer from regularization bias and slow converge rates. However, the use of sample splitting with cross-fitting [20] allows us to address these issues. Limiting the growth of dependencies between units, our estimator of the EATE is consistent, converges at the N -rate, and asymptotically follows a Gaussian distribution. This allows us to construct confidence intervals and p -values.

1.1 Our contribution and comparison to literature

Our work is most related to the literature on semiparametric treatment effect estimation and inference with observational data from a single network. Tchetgen Tchetgen et al. [21] developed a network version of the g-formula [22] and performed outcome regression, assuming that the data can be represented as a chain graph, which is a graphical model that is generally incompatible with our SEM approach [23]. An SEM approach is also used by van der Laan [14], Sofrygin and van der Laan [10], and Ogburn et al. [6]. These works considered a similar model as we do and proposed semiparametric treatment effect estimation by targeted maximum likelihood (TMLE) [24–26]. van der Laan [14] and Ogburn et al. [6] primarily considered global effects that compare two hypothetical interventions on the whole treatment vector. An example of such an effect is the global average treatment effect (GATE), which contrasts the interventions of treating all units of the population versus treating no unit of the population. In contrast, we consider the EATE that is the average effect of assigning the treatment to one unit versus not and integrate out the treatment selections from the other units. Causal effects like the EATE summarizing the effect of N unit-specific interventions generally cannot be described using a single intervention on the whole treatment vector, as done for global effects. The behavior of estimators for the EATE under the wrong independent and identically distributed assumption is studied by Sävje et al. [3]. Sofrygin and van der Laan [10] mentioned a possible extension to estimate the EATE with TMLE, but all their results are for global effects such as the GATE. Their theory assumes some kind of a bounded entropy integral, which is difficult to verify for machine learning methods.

Our contribution includes the following. First, we present a semiparametric, machine learning-based approach to estimate the EATE with observational data from a single network. Our approach enables performing inference, including confidence intervals and p -values. Particularly, we do not require multiple disjoint networks. We develop a cross-fitting algorithm under interference and reason in terms of the dependency graph to explicitly allow for different interactions, also specifically ones that are beyond immediate neighbors in the network. Second, the limiting asymptotic Gaussian distribution and optimal N -convergence rate of the EATE estimator are achieved even if the number of ties of a unit may diverge asymptotically. To reach this optimal convergence rate to estimate global effects, Ogburn et al. [6] need to uniformly bound the neighborhood size of a unit. Third, our algorithm based on sample splitting is easy to understand and implement, and the user may choose any machine learning algorithm. Fourth, we analyze the Swiss StudentLife Study data [27,28] and estimate the effect of study time on the grade point average (GPA) of freshmen students after their first-year examinations at one of the world’s leading universities.

Outline of this article. Section 2 presents the model assumptions, characterizes the treatment effect of interest, outlines the procedures for the point estimation of the EATE and estimation of its variance, and establishes asymptotic results. Section 3 demonstrates our methodological and theoretical developments in a simulation study and on empirical data of the StudentLife Study.

2 Framework and our network AIPW estimator

2.1 Model formulation

We consider i = 1 , … , N units interacting in a known undirected network G . For each unit i , we observe a binary treatment W i ∈ { 0 , 1 } , a univariate outcome Y i , and a possibly multivariate vector of observed covariates C i that may causally affect W i and Y i . The outcome Y i may be dichotomous or continuous, and the potentially multivariate covariates C i may consist of discrete and continuous components. Irrespective of whether the outcomes are continuous or dichotomous, we can consider the following SEM with additive error terms for i = 1 , … , N :

(1) C i ← ε C i , Z i ← ( f z 1 ( C − i , G ) , … , f z t ( C − i , G ) ) , W i ← h 0 ( C i , Z i ) + ε W i , X i ← ( f x 1 ( W − i , C − i , G ) , … , f x r ( W − i , C − i , G ) ) , Y i ← W i g 1 0 ( C i , X i ) + ( 1 − W i ) g 0 0 ( C i , X i ) + ε Y i ,

where the errors ε W i and ε Y i are jointly independent, we have E [ ε W i ∣ C i , Z i ] = 0 and E [ ε Y i ∣ W i , C i , X i ] = 0 , the errors satisfy the assumptions in the study by Bühlmann [29] (required for the bootstrap variance results in Appendix F), and the ε W i ’s are identically distributed and the ε Y i ’s are identically distributed (required for the alternative variance results in Appendix G). We note that the identical distribution of the error terms is only required for our approach to estimate standard errors. The vector C − i = ( C 1 , C 2 , … , C i − 1 , C i + 1 , … , C N ) denotes the vector of covariates of units j ≠ i , and W − i is similarly defined. The binary treatments W i can be thought of as Bernoulli ( h 0 ( C i , Z i ) ) realizations. A constant h 0 corresponds to a Bernoulli experiment. This SEM encodes the assumption that the covariates C i and features X i suffice to control for confounding of the effect of the treatment on the outcome. The propensity score function h 0 ( ⋅ , ⋅ ) and the outcome model consisting of g 1 0 ( ⋅ , ⋅ ) and g 0 0 ( ⋅ , ⋅ ) are fixed but unknown functions that all units share. Nevertheless, the distribution of the responses may differ across units due to the X -spillover that captures effects from, e.g., a unit’s neighbors’ covariates and treatment assignments as described next. Because every unit may have a different number of neighbors, the X i ’s may follow a different distribution across different units, resulting in non-fixed distributions of the responses across units. Furthermore, the individual equations in (1) have to be understood in a distributional sense in that, if, e.g., g 1 0 ≡ 0 ≡ g 0 0 , we have Y i = ε Y i in distribution only.

The functions f z l , l ∈ [ t ] and f x l , l ∈ [ r ] , which are shared by all units and used to build the Z - and X -features, are assumed to be known, and their concatenations are assumed to be of fixed dimensions t and r , respectively. This is analogous to the in-practice considerations in Ogburn et al. [6]. We also allow for features of further degree neighbors: for example, f x 1 might capture the fraction of treated units that are a distance of 2 from a given unit in the network G . Making use of an implied dependency graph gives a more transparent formulation (Section 2.3). Since the network G is undirected, our spillover effects are assumed to be reciprocal, i.e., if unit i receives spillover effects from unit j through W j and/or C j , then unit j also receives spillover effects from unit i through W i and/or C i . Example 1 illustrates the construction of 2-dimensional X -features. Importantly, the X - and Z -features render the unit-level data dependent. In addition, the distributions of propensity scores and outcomes are not generally identical across units due to distributional differences of these features.

$Figure 1 Network on nine units where the node label represents the number of a unit. Gray nodes receive the treatment, corresponding to W i = 1 {W}_{i}=1 , and white ones do not, corresponding to W i = 0 {W}_{i}=0 .$

Figure 1

Network on nine units where the node label represents the number of a unit. Gray nodes receive the treatment, corresponding to W i = 1 , and white ones do not, corresponding to W i = 0 .

Example 1

Consider the network in Figure 1, where gray nodes take the treatment and white ones do not. We choose r = 2 many X -features and discard any influence of C j in X i , i.e., f x l ( { ( W j , C j ) } j ∈ [ N ] \ { i } , G ) = f x l ( { ( W j ) } j ∈ [ N ] \ { i } , G ) for l = 1 , 2 . Given a unit i , we choose the first feature in X i as the fraction of treated neighbors of unit i and the second feature as the fraction of treated neighbors of neighbors of i . Let us consider unit i = 6 in Figure 1. Its neighbors are the units 2, 5, and 7, and its neighbors of neighbors are the units 1 and 3 (neighbors of unit 2) and unit 8 (neighbor of unit 7), where we exclude i = 6 from its second degree neighborhood by definition. Therefore, we have X 6 = ( 1 ⁄ 3 , 2 ⁄ 3 ) because one out of three neighbors is treated and two out of three neighbors of neighbors are treated. The whole 9 × 2 dimensional X -feature matrix is obtained by applying the same computations to all other units i .

2.2 Treatment effect and identification

Plugging in the outcome equation of the SEM (1), we can rewrite the treatment effect of interest, the EATE, as

(2) θ N 0 = 1 N ∑ i = 1 N E W − i , C − i , C i [ E [ Y i d o ( W i = 1 ) − Y i d o ( W i = 0 ) ∣ W − i , C − i , C i ] ] = 1 N ∑ i = 1 N E C i , X i [ E ε Y i [ Y i ∣ do ( W i = 1 ) , C i , X i ] − E ε Y i [ Y i ∣ do ( W i = 0 ) , C i , X i ] ] = 1 N ∑ i = 1 N E C i , X i [ g 1 0 ( C i , X i ) − g 0 0 ( C i , X i ) ] ,

where we obtain that the unit-specific treatment effect of unit i is θ i 0 = E C i , X i [ g 1 0 ( C i , X i ) − g 0 0 ( C i , X i ) ] . Particularly, we assume that given the observable confounders C i and features X i , we can replace the do-operator by respective conditioning. The expectation E C i , X i over C i and X i is with respect to the observational distributions of C i and X i , as defined by the SEM (1). This notation makes explicit that the EATE is conditional on N , whereas it remains implicit that it is also conditional on the network G . We refer to the study by Ogburn et al. [6] for a discussion of the interpretation of such conditional effects.

Estimating g 1 0 and g 0 0 by regression machine learning algorithms and plugging them into (2) would not result in a parametric convergence rate and an asymptotic Gaussian distribution of the so-obtained estimator. To obtain asymptotic normality with convergence at the N -rate, a centered correction term involving the propensity score h 0 is added to g 1 0 ( C i , X i ) − g 0 0 ( C i , X i ) , and we can identify the EATE as follows.

Lemma 2.1

Let i ∈ [ N ] . Let

(3) S i = ( C i , Z i , W i , X i , Y i )

be the concatenation of the observed variables for unit i . For concatenations η = ( g 1 , g 0 , h ) of general nuisance functions g 1 , g 0 , and h , consider the score

(4) φ ( S i , η ) = g 1 ( C i , X i ) − g 0 ( C i , X i ) + W i h ( C i , Z i ) ( Y i − g 1 ( C i , X i ) ) − 1 − W i 1 − h ( C i , Z i ) ( Y i − g 0 ( C i , X i ) ) ,

including the aforementioned correction term. For the true nuisance functions η 0 = ( g 1 0 , g 0 0 , h 0 ) , we have E [ φ ( S i , η 0 ) ] = θ i 0 and can consequently identify the EATE (2) by

(5) θ N 0 = 1 N ∑ i = 1 N E [ φ ( S i , η 0 ) ] .

The aforementioned expectation is with respect to the law of S i , but we omit it for notational simplicity.

The proof of Lemma 2.1 is provided in Appendix E. Based on this lemma, we will present our estimator of θ N 0 in Section 2.4. The true nuisance functions η 0 = ( g 1 0 , g 0 0 , h 0 ) are not of statistical interest, but have to be estimated to build an estimator of θ N 0 , and we will estimate them using regression machine learning algorithms. Such machine learning estimators might suffer from regularization bias and converge slower than at the N -rate. However, the two correction terms W i ⁄ h ( C i , Z i ) ( Y i − g 1 ( C i , X i ) ) and ( 1 − W i ) ⁄ ( 1 − h ( C i , Z i ) ) ( Y i − g 0 ( C i , X i ) ) make the score φ Neyman orthogonal, which counteracts the effect of regularization bias. Moreover, the machine learning estimators are only required to converge at a moderate rate; please see Section 2.4 for further details.

Scharfstein et al. [30] and Bang and Robins [31] considered a similar score φ for causal effect estimation and inference under the SUTVA assumption, and their function is based on the influence function for the mean for missing data from Robins and Rotnitzky [32]. Moreover, it is also used to compute the AIPW estimator under SUTVA, and our score φ defined in (4) coincides with one of the AIPW approaches under SUTVA if we omit the X - and Z -spillover features. In this case, we can reformulate φ as

φ ( S i , η 0 ) = W i Y i e ( C i ) − ( 1 − W i ) Y i ( 1 − e ( C i ) ) − W i − e ( C i ) e ( C i ) ( 1 − e ( C i ) ) ( ( 1 − e ( C i ) ) E [ Y i ∣ W i = 1 , C i ] + e ( C i ) E [ Y i ∣ W i = 0 , C i ] ) ,

where e ( C i ) = E [ W i ∣ C i ] = h 0 ( C i ) denotes the propensity score, E [ Y i ∣ W i = 1 , C i ] = g 1 0 ( C i , X i ) , and E [ Y i ∣ W i = 0 , C i ] = g 0 0 ( C i , X i ) . This equivalence remains true if the true nuisance functions are replaced by their estimators.

2.3 Dependency graph

Depending on the feature functions that are used, if an edge connects two units in the network G , the units may be dependent. However, the absence of an edge in G does not necessarily imply independence of the respective units. Subsequently, we present a second graph where the presence of an edge represents dependence and its absence independence of the variables of the two respective units. Our theoretical results will be established based on this so-called dependency graph [3]. Example 2 illustrates the concept.

Definition 1

Dependency graph on S i , i ∈ [ N ] [3]: the dependency graph G D = ( V , E D ) on the unit-level data S i , i ∈ [ N ] defined in (3), is an undirected graph on the node set V of the network G = ( V , E ) with potentially larger edge set E D than E . An undirected edge { i , j } between two nodes i and j from V belongs to E D if at least one of the following two conditions holds: (1) there exists an m ∈ [ N ] \ { i , j } such that W m and/or C m are present in both X i and X j or are present in both Z i and Z j and (2) W i is present in X j , or C i is present in X j or in Z j , i.e., units i and j receive spillover effects from at least one common third unit, or they receive spillover effects from each other.

$Figure 2 Network G G on four units (left), where the spillover effects come from the treatments of the direct neighbors, which results in a distance-two dependence, which is displayed in the corresponding dependency graph G D {G}_{D} (middle). The underlying causal DAG is displayed on the right, where arrows due to X X -spillover effects are gray.$

Figure 2

Network G on four units (left), where the spillover effects come from the treatments of the direct neighbors, which results in a distance-two dependence, which is displayed in the corresponding dependency graph G D (middle). The underlying causal DAG is displayed on the right, where arrows due to X -spillover effects are gray.

Example 2

Consider the chain-shaped network G in Figure 2 on the left. We consider as 1-dimensional X -spillover effect the fraction of treated direct neighbors in the network G and no Z -spillover. The resulting dependency graph G D is displayed in the middle of Figure 2. In G D , unit 2 shares an edge with units 1 and 3 because these units are neighbors of 2 in the network. Unit 2 also shares an edge with 4 in G D because it shares its neighbor 3 with unit 4. The right-hand of Figure 2 displays the causal directed acyclic graph (DAG) on all units corresponding to this model, including confounders C . Due to the definition of the X -spillover effect, we have X 1 = W 2 and X 4 = W 3 . Consequently, using graphical criteria [33–37], we infer that the unit-level data S 1 = ( C 1 , W 1 , X 1 , Y 1 ) are independent of S 4 = ( C 4 , W 4 , X 4 , Y 4 ) .

The dependency graph is a function of the network G as well as the Z -and X -features. Constraining the growth of the maximal degree of this graph allows us to obtain a CLT result for our treatment effect estimator.

2.4 Estimation procedure and asymptotics

Subsequently, we describe our estimation procedure and its asymptotic properties. We use sample splitting and cross-fitting to estimate the EATE θ N 0 identified by equation (5) as follows. We randomly partition [ N ] into K ≥ 2 sets of approximately equal size that we call I 1 , … , I K . We split the unit-level data according to this partition into the sets S I k = { S i } i ∈ I k , k ∈ [ K ] . For each k ∈ [ K ] , we perform the following steps. First, we estimate the nuisance functions g 1 0 , g 0 0 , and h 0 on the complement set of S I k , which we define as

(6) S I k c = { S j } j ∈ [ N ] \ ( S I k ∪ { S m ∣ ∃ i ∈ I k : ( i , m ) ∈ E D } ) ,

where E D denotes the edge set of the dependency graph G D . Particularly, S I k c consists of unit-level data S j from units j that do not share an edge with any unit i ∈ I k in the dependency graph. Consequently, the set S I k c contains all S j ’s that are independent of the data in S I k . To estimate g 1 0 , we select the S i ’s from S I k c whose W i equals 1 and regress the corresponding outcomes Y i on the confounders C i and the features X i , which yields the estimator g ˆ 1 I k c . Similarly, to estimate g 0 0 , we select the S i ’s from S I k c whose W i equal 0 and perform an analogous regression, which yields the estimator g ˆ 0 I k c . To estimate h 0 , we use the whole set S I k c and regress W i on the confounders C i and the features Z i , which yields the estimator h ˆ I k c ^[2]. These regressions may be carried out with any machine learning algorithm. We concatenate these nuisance function estimators into the nuisance parameter estimator η ˆ I k c = ( g ˆ 1 I k c , g ˆ 0 I k c , h ˆ I k c ) and plug it into φ that is defined in (4). We then evaluate the so-obtained function φ ( ⋅ , η ˆ I k c ) on the data S I k , which yields the terms φ ( S i , η ˆ I k c ) for i ∈ I k , i.e., we evaluate φ ( ⋅ , η ˆ I k c ) on unit-level data S i that is independent of the data that were used to estimate the nuisance parameter η ˆ I k c . Finally, we estimate the EATE by the cross-fitting estimator

(7) θ ˆ = 1 K ∑ k = 1 K 1 ∣ I k ∣ ∑ i ∈ I k φ ( S i , η ˆ I k c )

that averages over all K folds. The estimator θ ˆ converges at the parametric rate, N − 1 ⁄ 2 , and follows a Gaussian distribution asymptotically with limiting variance σ ∞ 2 as stated in Theorem 2.2.

The partition I 1 , … , I K is random. To alleviate the effect of this randomness, the whole procedure is repeated a number of B times, and the median of the individual point estimators over the B repetitions is our final estimator of θ N 0 . The asymptotic results for this median estimator remain the same as for θ ˆ (see Chernozhukov et al. [20]). For each repetition b ∈ [ B ] , we compute a point estimator θ ˆ b , a variance estimator σ ˆ ∞ , b 2 (for details, please see Section 2.5), and a p-value p b for the two-sided test H 0 : θ N 0 = 0 versus H A : θ N 0 ≠ 0 . The B many p -values p 1 , … , p B from the individual repetitions are aggregated according to

p aggr 0 = 2 median b ∈ [ B ] ( p b ) .

This aggregation scheme yields a valid overall p-value for the same two-sided test [38]. The corresponding confidence interval is constructed as

(8) CI ( θ ˆ ) = { θ ∈ R ∣ p aggr θ of testing H 0 : θ N 0 = θ vs H A : θ N 0 ≠ θ satisfies p aggr θ > α } ,

where typically α = 0.05 . This set contains all values θ for which the null hypothesis H 0 : θ N 0 = θ cannot be rejected at level α against the two-sided alternative H A : θ N 0 ≠ θ .

Next, we describe how CI ( θ ˆ ) can easily be computed. Due to the asymptotic result of Theorem 2.2, the aggregated p-value p aggr θ for θ ∈ R can be represented as

p aggr θ = 4 median b ∈ [ B ] ( 1 − Φ ( N σ ˆ ∞ , b − 1 ∣ θ ˆ b − θ ∣ ) ) ,

where Φ denotes the cumulative distribution function of a standard Gaussian random variable. Consequently, we have

p aggr θ > α ⇔ Φ − 1 ( 1 − α ⁄ 4 ) > median b ∈ [ B ] ( N σ ˆ ∞ , b − 1 ∣ θ ˆ b − θ ∣ ) ,

which can be solved for feasible values of θ using root search. A full description of our method is presented in Algorithm 1.

Algorithm 1: Estimating the EATE from observational data on networks with spillover effects using plugin machine learning
Input: N unit-level observations S i = ( W i , C i , X i , Z i , Y i ) from the model (1), network G , feature functions f x l , l ∈ [ r ] and f z l , l ∈ [ t ] , corresponding dependency graph G D , natural number K , natural number B , significance level α ∈ [ 0 , 1 ] , machine learning algorithms.
Output: Estimator of the EATE θ N 0 and a valid p-value and confidence interval for the two-sided test H 0 : θ N 0 = 0 vs H A : θ N 0 ≠ 0 .
1	for b ∈ [ B ] do
2 3 4 5 6 7	Randomly split the index set [ N ] into K sets I 1 , … , I K of approximately equal size . for k ∈ [ K ] do Compute nuisance function estimators g ˆ 1 I k c , g ˆ 0 I k c , and h ˆ I k c with machine learning algorithm and data from S I k c . end Compute point estimator of θ N 0 according to ( 7 ) , and call it θ ˆ b . Estimate asymptotic variance of θ ˆ b using the bootstrap procedure described in Section 2.5 ( or according to Theorem G.1 in Appendix G ) , and call it σ ˆ ∞ , b 2 . Compute p - value p b for the two - sided test H 0 : θ N 0 = 0 vs. H A : θ N 0 ≠ 0 using θ ˆ b , σ ˆ ∞ , b 2 , and asymptotic Gaussian approximation .
8 9	end
10	Compute θ ˆ = median s ∈ [ B ] ( θ ˆ b ) .
11	Compute aggregated p-value p aggr 0 = 2 median b ∈ [ B ] p b .
12	Compute confidence interval according to (8), call it CI ( θ ˆ ) .
13	Return θ ˆ , p aggr 0 , CI ( θ ˆ ) .

Before we present our main theorem we mentioned in the construction of confidence intervals earlier, we present and discuss key assumptions. First, we require that products of machine learning errors decay fast enough, namely,

‖ h 0 ( C i , Z i ) − h ˆ I k c ( C i , Z i ) ‖ P , 2 ⋅ ( ‖ g 1 0 ( C i , X i ) − g ˆ 1 I k c ( C i , X i ) ‖ P , 2 + ‖ g 0 0 ( C i , X i ) − g ˆ 0 I k c ( C i , X i ) ‖ P , 2 + ‖ h 0 ( C i , Z i ) − h ˆ I k c ( C i , Z i ) ‖ P , 2 ) ≪ N − 1 2

(see Assumption A4 in the appendix for more details). In particular, the individual error terms may vanish at a rate smaller than N − 1 ⁄ 4 . This is achieved by many machine learning methods under suitable assumptions (see, for instance, Chernozhukov et al. [20]): ℓ 1 -penalized and related methods in a variety of sparse models [39–44], forward selection in sparse models [45], L 2 -boosting in sparse linear models [46], a class of regression trees and random forests [47], and neural networks [48]. Second, to ensure enough sparsity in the dependency structure of the data, the maximal degree d max in the dependency graph is assumed to grow at most at the rate d max = o ( N 1 ⁄ 4 ) , which implies that the dependencies are not too far reaching. This assumption allows us to bound the Wasserstein distance of our (centered and scaled) treatment effect estimator to a standard Gaussian random variable using Stein’s method [49].

Assumption 1

The maximal degree d max of a node in the dependency graph satisfies d max = o ( N 1 ⁄ 4 ) .

Ogburn et al. [6] only required d max = o ( N 1 ⁄ 2 ) , but achieved a slower convergence rate of their treatment effect estimator. To recover the N -rate, they require that d max is bounded by a constant, meaning d max = O ( 1 ) .

Furthermore, we require that this dependency structure is not too strong moment-wise in the sense that the variance term given in the following assumption converges.

Assumption 2

Let { P N } N ≥ 1 be a sequence of sets of probability distributions P of the N units. There exists σ ∞ 2 , possibly depending on P ∈ P N , satisfying 0 < L ≤ σ ∞ 2 ≤ U < ∞ with fixed constants L and U , such that for all P ∈ P N , we have

(9) lim N → ∞ Var 1 N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) − σ ∞ 2 = 0 ,

where ψ ( S i , θ i 0 , η 0 ) = φ ( S i , η 0 ) − θ i 0 is a centered version of φ .

Assuming bounded second moments, ∑ j = 1 N ∣ Cov ( ψ ( S i , θ i 0 , η 0 ) , ψ ( S j , θ i 0 , η 0 ) ) ∣ can be bounded, up to constants, by d ( i ) , where d ( i ) denotes the degree of node i in the dependency graph. Consequently, we have

(10) Var 1 N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) ≤ γ ⋅ 1 N ∑ i = 1 N d ( i ) ,

where γ denotes some universal constant. Subsequently, we consider two special cases. First, if the maximal degree of the dependency graph is uniformly bounded by some constant D , we can bound (10) by the constant γ D . Second, assume that the dependency graph has some nodes with finite degree: d ( i ) ≤ D for i in some set S max c ; the other nodes’ degree d ( i ) for i ∈ S max is bounded by d max = o ( N 1 ⁄ 4 ) with ∣ S max ∣ ≥ O ( N ⁄ d max ) = O ( N 3 ⁄ 4 ) . Then, (10) is also of bounded order O ( 1 ) .

Theorem 2.2

(Asymptotic distribution of θ ˆ ) Assume Assumptions 1 and 2 as well as A3 and A4 stated in the appendix in Section A. Then, the estimator θ ˆ of the EATE θ N 0 given in (7) converges at the parametric rate, N − 1 ⁄ 2 , and asymptotically follows a Gaussian distribution, namely,

(11) N σ ∞ − 1 ( θ ˆ − θ N 0 ) → d N ( 0 , 1 ) ( N → ∞ ) ,

where σ ∞ is characterized in Assumption 2. The convergence in (11) is in fact uniformly over the law P ∈ P N ( N → ∞ ) .

Please see Section E in the appendix for a proof of Theorem 2.2. The asymptotic variance σ ∞ 2 in Theorem 2.2 can be consistently estimated using a bootstrap approach (see Section 2.5). Alternatively, it is possible to consistently estimate it using a plugin approach (see Theorem G.1 in the next Section G). However, empirical simulations have revealed that the bootstrap procedure described in the next section performs better.

Our estimator θ ˆ is robust in two senses. First, it is N -consistent and asymptotically normal if only the product property (9) of the machine learning estimators holds. Second, it can be shown that it remains consistent if either the propensity model or the outcome model is correctly specified. These properties are also called rate double robustness and model double robustness, respectively [50].

2.5 Bootstrap variance estimator

We use the residual bootstrap as follows to estimate the asymptotic variance. First, we use the estimated nuisance functions to compute the outcome regression residuals. More precisely, for i ∈ [ N ] , denote by k ( i ) the index in [ K ] specifying the partition unit i belongs to, namely, i ∈ I k ( i ) . Then, we estimate the ε Y ’s by ε ˆ Y i = ε ˆ Y i ′ − 1 N ∑ j = 1 N ε ˆ Y j ′ , where ε ˆ Y i ′ = Y i − W i g ˆ 1 I k ( i ) c ( C i , X i ) − ( 1 − W i ) g ˆ 0 I k ( i ) c ( C i , X i ) . Next, we sample confounders { C i * } i ∈ [ N ] with replacement from { C i } i ∈ [ N ] , and we sample ε ˆ Y i * with replacement from { ε ˆ Y i } i ∈ [ N ] . These sampled covariates and error terms are now propagated through the SEM (1), i.e., we compute Z i * = ( f z 1 ( C − i * , G ) , … , f z t ( C − i * , G ) ) , sample W i * = Bernoulli ( h ˆ I k ( i ) c ( C i * , Z i * ) ) , compute X i * = ( f x 1 ( W − i * , C − i * , G ) , … , f x r ( W − i * , C − i * , G ) ) , and build Y i * = W i * g ˆ 1 I k ( i ) c ( C i * , X i * ) − ( 1 − W i * ) g ˆ 0 I k ( i ) c ( C i * , X i * ) + ε ˆ Y i * . Subsequently, we concatenate these values to obtain the bootstrap datapoints S i * = ( C i * , Z i * , W i * , X i * , Y i * ) , i ∈ [ N ] . Then, we apply our treatment effect estimation procedure to the S i * ’s to obtain a bootstrap estimator θ ˆ * . This procedure is repeated R many times, and the bootstrap variance estimator is given by the empirical variance of the θ ˆ r * over r ∈ [ R ] .

Theorem 2.3

The bootstrap scheme described in Section 2.5 consistently estimates the asymptotic variance (9) under Assumption A7 stated in the Appendix.

The proof of Theorem 2.3 can be found in Appendix F.

3 Empirical validation

We demonstrate our method in a simulation study and on a real-world dataset. In the simulation study, we validate the performance of our method on different network structures and compare it to two popular treatment effect estimators. Then, we investigate the effect of study time on exam performance in the Swiss StudentLife Study [27,28] taking into account the effect of social ties.

3.1 Simulation study

We investigate a fairly simple data-generating mechanism with 1-dimensional X -features and no Z -features. The X -interference effects a unit receives come from an interaction between treatments and control of its immediate neighbors in the network (we consider Erdős–Rényi and Watts–Strogatz). We compare the performance of our method to two popular off-the-shelf alternative schemes with respect to bias of the point estimator and coverage and length of respective two-sided confidence intervals: the Hájek estimator and an inverse propensity weighting estimator. Our aim is to see that these standard estimators may suffer in the presence of interference and to demonstrate that our easy-to-implement estimator overcomes their shortcomings.

We first describe the two competitors and then detail the simulation setting and present the results. Our code is available on GitHub (https://github.com/corinne-rahel/networkAIPW).

The Hájek estimator (denoted by “Hajek” in Figure 4) without incorporation of confounders [51] equals

1 N ∑ i = 1 N W i Y i 1 N ∑ j = 1 N W i + ( 1 − W i ) Y i 1 N ∑ j = 1 N ( 1 − W i ) .

The parametric convergence rate and asymptotic Gaussian distribution are preserved under X -spillover effects that equal the fraction of treated neighbors in a randomized experiment [52]. The IPW estimator [53] has been developed under SUTVA and uses observed confounding by creating a “pseudo-population” in which the treatment is independent of the confounders [54]. We compute it using sample splitting and cross-fitting according to

1 K ∑ k = 1 K 1 ∣ I k ∣ ∑ i ∈ I k W i Y i e ˆ I k c ( C i ) − ( 1 − W i ) Y i 1 − e ˆ I k c ( C i ) ,

where e ˆ I k c is the fitted propensity score obtained by regressing W i on C i on the data in i ∈ S I k c . In our simulation, e ˆ I k c coincides with h ˆ I k c because we consider no Z -features. We denote this estimator by “IPW” in Figure 4. These estimators are not designed for the interference structures we consider, but we would like to investigate the performance of these off-the-shelf and easy-to-implement estimators, also in comparison with our proposed method.

Figure 3

Different network structures on N = 200 units: Erdős–Rényi network (left) where two nodes are connected with probability 3 ⁄ N (every node is connected to three other nodes in expectation); Watts–Strogatz network (right) with a rewiring probability of 0.05, a 1-dimensional ring-shaped starting lattice where each node is connected to two neighbors on both sides (i.e., every node is connected to four other nodes), no loops, and no multiple edges. The graphs are generated using the R-package igraph [55].

We investigate two network structures that govern our interference effects: Erdős–Rényi networks [56] and Watts–Strogatz networks [57]. Erdős–Rényi networks randomly form edges between units with a fixed probability and are a simple example of a random mathematical network model. These networks play an important role as a standard against which to compare more complicated models. Watts–Strogatz networks, also called small-world networks, share two properties with many networks in the real world: a small average shortest path length and a large clustering coefficient. To construct such a network, the vertices are first arranged in a regular fashion and linked to a fixed number of their neighbors. Then, some randomly chosen edges are rewired with a constant rewiring probability. A representative of each network type is provided in Figure 3. For each of these two network types, we consider one case where the dependency in the network does not increase with N (denoted by “const” in Figure 4) and one where it increases with N (denoted by N ^(1/15) in Figure 4).

$Figure 4 Coverage (fraction of times the true, and in general unknown, θ N 0 {\theta }_{N}^{0} was inside the confidence interval), log mean length of two-sided 95% confidence intervals for θ N 0 {\theta }_{N}^{0} , and mean bias over 1,000 simulation runs for Erdős–Rényi and Watts–Strogatz networks of different complexities (Erdős–Rényi: expected degree 3 and 3 N 1 ⁄ 15 3{N}^{1/15} for “const” and “ N N ^ (1/15),” respectively; Watts–Strogatz: before rewiring, nodes have degree 4 and 4 N 1 ⁄ 15 4{N}^{1/15} for “const” and “ N N ^ (1/15),” respectively, and the rewiring probability is 0.05). We compare the performance of our method, netAIPW, with the Hájek and an IPW estimator, indicated by color. The variances of the competitors are empirical variances over the 1,000 repetitions, whereas we computed confidence intervals for netAIPW according to (8) with B = 1 B=1 and 300 bootstrap samples. The shaded regions in the coverage plot represent 95% confidence bands with respect to the 1,000 simulation runs.$

Figure 4

Coverage (fraction of times the true, and in general unknown, θ N 0 was inside the confidence interval), log mean length of two-sided 95% confidence intervals for θ N 0 , and mean bias over 1,000 simulation runs for Erdős–Rényi and Watts–Strogatz networks of different complexities (Erdős–Rényi: expected degree 3 and 3 N 1 ⁄ 15 for “const” and “ N ^ (1/15),” respectively; Watts–Strogatz: before rewiring, nodes have degree 4 and 4 N 1 ⁄ 15 for “const” and “ N ^ (1/15),” respectively, and the rewiring probability is 0.05). We compare the performance of our method, netAIPW, with the Hájek and an IPW estimator, indicated by color. The variances of the competitors are empirical variances over the 1,000 repetitions, whereas we computed confidence intervals for netAIPW according to (8) with B = 1 and 300 bootstrap samples. The shaded regions in the coverage plot represent 95% confidence bands with respect to the 1,000 simulation runs.

The specific unit-level structural equations (1) we consider are as follows. For each unit i ∈ [ N ] , we sample independent and identically distributed confounders C i ∼ Unif ( 0 , 1 ) from the uniform distribution. The treatment selections W i are drawn from a Bernoulli distribution with arbitrarily chosen success probability p i = p i ( C i ) = 0.15 1 C i < 0.33 + 0.5 1 0.33 ≤ C i < 0.66 + 0.85 1 0.66 ≤ C i . Let α ( i ) denote the neighbors of unit i in the network (without i itself). Then, we let the 1-dimensional X -features X i denote the shifted average number of neighbors assigned to treatment weighted by their confounder, namely,

X i = 1 ∣ α ( i ) ∣ ∑ j ∈ α ( i ) ( 1 W j = 1 − 1 W j = 0 ) C j ,

if α ( i ) is non-empty, and 0 else. We do not consider Z -features. For real numbers x and c , we consider the arbitrary functions

g 1 0 ( x , c ) = 1.5 1 x ≥ 0.5 , c ≥ − 0.2 , x < 0.7 + 4 1 c ≥ − 0.2 , x ≥ 0.7 + 0.5 1 x ≥ 0.5 , c < − 0.2 + 3.5 1 x < 0.5 , c ≥ − 0.2 + 2.5 1 x < 0.5 , c < − 0.2

and

g 0 0 ( x , c ) = 0.5 1 x ≥ 0.4 , c ≥ 0.2 − 0.75 1 x ≥ 0.4 , c < 0.2 + 0.25 1 x < 0.4 , c ≥ 0.2 − 0.5 1 x < 0.4 , c < 0.2 ,

i.e., the functions g 1 0 , g 0 0 , and h 0 are step functions. For independent and identically distributed error terms ε Y i ∼ Unif ( − 0.12 ⁄ 2 , 0.12 ⁄ 2 ) , we consider the outcomes Y i = W i g 1 0 ( C i , X i ) + ( 1 − W i ) g 0 0 ( C i , X i ) + ε Y i .

For the sample sizes N = 625 , 1,250, 2,500, and 5,000, we perform 1,000 simulation runs redrawing the data according to the SEM, and consider B = 1 , K = 5 , and R = 300 bootstrap samples to estimate the variance in Algorithm 1, i.e., we consider one split per generated dataset and consequently do not aggregate p -values in these simulations. However, the empirical analysis in Section 3.2 aggregates p -values over 100 datasplits. We estimate the nuisance functions by random forests consisting of 500 trees with a minimal node size of 5 and other default parameters using the R-package ranger [58]. To estimate the propensity score, we limit the depth of the trees to 2. Our results for the Erdős–Rényi and Watts–Strogatz networks are displayed in Figure 4. Two different panels are used to display the results for different ranges of the bias of the methods. For all network types and complexities, we observe the following. The IPW estimator incurs some bias as can be expected because it does not account for network spillover, and even under SUTVA, it is not Neyman orthogonal, which means we are not allowed to plug in machine learning estimators of nuisance functions. Furthermore, it is known to have a poor finite-sample performance due to estimated propensity scores e ˆ I k c that may be close to 0 or 1. The Hájek estimator incurs some bias because it does not adjust for observed confounding and assumes a randomized treatment instead. The bias of our method (denoted by “netAIPW” in Figure 4) decreases as the sample size increases. As the dependency graph becomes more complex, our method requires more observations to achieve a small bias because the datasets S I k c in (6), which are used to estimate the nuisance functions, are smaller in denser networks. In terms of coverage, the two competitors perform poorly, whereas our method guarantees coverage.

Simulation results involving spillover effects from second-degree neighbors and misspecified spillover effects are presented in Appendix C. Furthermore, for a Bernoulli ( 1 ⁄ 2 ) treatment assignment and with the “const” Watts-Strogatz setting presented in the main article, we found that the AIPW approach leads to variances that are of about a factor of 23 smaller than the ones obtained with IPW. This suggests that AIPW is helpful in reducing the variance of IPW even in the randomized case.

3.2 Empirical analysis: Swiss StudentLife study data

Subsequently, we estimate the causal effect of study time on academic success of university students with our newly developed estimator. We quantify this causal effect by the EATE that is the average of the difference in expected GPA of the final exam had a student studied much versus little, allowing for potential spillover effects from the student’s friends on the student’s study time. Among the factors that determine academic success are person-specific traits, such as intelligence [59], willingness to work hard [60], and socioeconomic background [61]. The Swiss StudentLife Study data [27,28] were collected to investigate the impact of various factors on academic achievement. It consists of observations from freshmen undergraduate students pursuing a degree in the natural sciences at a Swiss university. Instead of a university entrance test, these students had to pass a demanding examination after 1 year of studying. At several time points throughout this year, the students were asked to fill out questionnaires about their student life, social network, and well-being. The data consist of three cohorts of students. Cohort 1 was observed in 2016 and cohorts 2 and 3 in 2017. Importantly, for all three cohorts, the data contain friendship information among the students. We build the corresponding undirected network by drawing an edge between two students if at least one of them mentioned the other one as being a friend. We believe that spillover effects arise due to students interacting in this network, and thus, we have to control for them when estimating the EATE described earlier. Figure 5 displays the resulting network consisting of the three cohorts.

$Figure 5 Friendship networks per cohort with black dots representing W i = 1 {W}_{i}=1 and a weekly study time of at least 8 h, white for W i = 0 {W}_{i}=0 , and a weekly study time of less than 8 hours, and a bigger node size represents a higher GPA.$

Figure 5

Friendship networks per cohort with black dots representing W i = 1 and a weekly study time of at least 8 h, white for W i = 0 , and a weekly study time of less than 8 hours, and a bigger node size represents a higher GPA.

GPA ( Y i ) constitutes our outcome variable and represents the average grade of seven to nine exams, depending on study programs. It ranges from 1 to 6, with passing grades of 4 or higher. The average GPA in the data we used was 4.266 with a standard deviation of 0.872. The remaining variables were measured 5 to 6 months before the exam period and correspond to wave four of the Swiss StudentLife Study data. The self-reported number of hours spent studying per week during the semester ( W i ) constitutes the treatment variable. It was dichotomized into studying many ( W i = 1 ) and few ( W i = 0 ) hours. We considered a setting where W i = 1 corresponds to studying at least 8 hours per week, which is the 20% quantile, and one where W i = 1 corresponds to studying at least 20 hours per week, which is the 80 % quantile. We consider spillover effects from the friends of a student, which are a student’s direct neighbors in the friendship network. We consider Z -spillover effects that account for the effect of befriended students’ study motivation and stress variables on a student’s treatment. We do not consider spillover effects on the outcome GPA (no X -features). The Z i -spillover variable of a student i is a vector of length 6, where each entry corresponds to the average of the following six variables across the friends of the student: (a) study motivation, measured with the learning objectives subscale of the SELLMO-ST^[3] [62]; (b) work avoidance, measured with the work avoidance subscale of the students version of the SELLMO-ST 3 ; (c) the average of ten perceived stress items [63]; (d, e) two items specifically on exam related stress; and (f) whether one was perceived as clever by at least one other student. In addition to these network effects, we control on the unit level ( C i ) for the just mentioned variables observed on an individual unit as well as the cohort number, gender, having Swiss nationality, speaking German, and the financial situation. From all the data of the three cohorts combined, we only considered individuals for whom all the mentioned variables, i.e., treatment, outcome, covariates, and Z -spillover variables, are observed. The final sample consisted of N = 526 individuals: 113 from cohort 1, 119 from cohort 2, and 294 from cohort 3. In our algorithm, we used S = 1,000 sample splits (from which we aggregate p-values as in (2.4)) with K = 10 groups each and random forests consisting of 5,000 trees to learn g 0 0 , g 1 0 , and h 0 whose leaf size was initially determined by fivefold cross-validation. Also, we used the variance estimator as in Appendix G that relies on fewer assumptions.

We estimated the EATE with two different definitions for W i = 1 , defined by a study time of either at least 8 or 20 h per week, corresponding to the 20 and 80% quantiles, respectively, and Table 1 displays the results. Table 1(a) displays our estimated EATE with W i = 1 representing a weekly study time of at least 8 h. Our EATE estimator is positive and significant. On average, students received a 0.362 points higher GPA had they studied at least 8 h per week compared to studying less. Consequently, a significantly higher GPA can be achieved by studying more. If we apply the same procedure but exclude the Z -spillover covariates (no spillover), the EATE estimator is higher and also significant. Table 1(b) displays our results with W i = 1 representing a weekly study time of at least 20 hours. Our EATE estimator is positive but not significant anymore. Hence, our results suggest that GPA is not significantly higher had a student studied at least 20 h per week compared to studying less. Without spillover, the treatment effect is significant. In both cases in Table 1, the estimate of the EATE is higher under the assumption of no spillover effects, compared to the estimator that allows for possible Z -spillover effects. This potentially relevant difference highlights the importance of not a priori ruling out spillover effects. Overall, the model including spillover effects seems more realistic than the one excluding them. Finally, when interpreting the results, it is important to recall that study time captures the learning time during the semester. There is an additional 8-week lecture-free preparation period, and our study time does not reflect this preparation time. Consequently, our results only describe the EATE of study time during the semester on GPA.

Table 1

EATE and 95% confidence intervals for θ N 0 for different settings with different control groups, namely, studying less than 8 (a) or less than 20 (b) hours per week

Spillover	EATE	95% CI for θ N 0
(a) W i = 1 if studied at least 8 h per week ( 20 % quantile)
Yes	0.362	[0.283, 0.442]
No	0.451	[0.364, 0.528]
(b) W i = 1 if studied at least 20 h per week (80% quantile)
Yes	0.078	[ − 0.096 , 0.252 ]
No	0.163	[ 0.011 , 0.311 ]

4 Conclusion

Causal inference with observational data usually assumes independent units. However, having independent observations is often questionable, and so-called spillover effects among units are common in practice. Our aim was to develop point estimation and asymptotic inference for the expected average treatment effect (EATE) with observational data from a single (social) network. We would like to point out the hardness of this problem: we consider treatment effect estimation on data with increasing dependence among units, where the data-generating mechanism can be highly nonlinear and include confounders. We use an augmented inverse probability weighting (AIPW) principle and account for spillover effects that we capture by features, which are functions of the known network and the treatment and covariate vectors. There may be several features, and one feature may capture spillover effects from different units than another feature; these units might be direct neighbors to compute one feature and neighbors of neighbors to compute another feature. We consider the dependency graph to pose assumptions on these features in our asymptotic theory. Units may interact beyond their direct neighborhoods, interactions may become increasingly complex as the sample size increases, and we consider arbitrary networks. Using ideas of double machine learning [20], we develop a cross-fitting algorithm under interference that allows us to estimate the nuisance components of our model by arbitrary machine learning algorithms. Although we employ machine learning algorithms, our EATE estimator converges at the N -rate and asymptotically follows a Gaussian distribution, which allows us to perform inference.

In a simulation study, we demonstrated that commonly employed methods for treatment effect estimation suffer from the presence of spillover effects, whereas our method could account for the complex dependence structures in the data so that the bias vanished with increasing sample size and coverage was guaranteed. In the Swiss StudentLife Study, we investigated the EATE of study time on the GPA of university examinations, accounting for spillover effects due to friendship relations. Omitting this spillover may lead to biased results due to spurious association.

In this work, we focused on estimating the EATE. Other effects may be estimated in a similar manner, for instance, the global average treatment effect (GATE) where all units are jointly intervened on. We develop an estimator of the GATE in Appendix H.

Acknowledgements

We thank the associate editor and reviewers for detailed and constructive comments. We also thank Leonard Henckel and Dominik Rothenhäusler for useful comments.

Funding information: CE and PB received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 786461), and M-LS received funding from the Swiss National Science Foundation (SNF) (Project No. 200021_172485). The Swiss StudentLife data collection was supported by Swiss National Science Foundation Grant 10001A 169965 and the rectorate of ETH Zurich.
Author contributions: All authors have accepted responsibility for the en4re content of this manuscript and approved its submission.
Conflict of interest: The authors state no conflict of interest.
Data availability statement: The data and code used in the simula4on study are available on GitHub (https://github.com/corinne-rahel/networkAIPW). The Swiss StudentLife data analyzed in the empirical analysis is cited in the main text of our paper.

Appendix A A1 Assumptions and additional definitions

We consider the following notation. We denote by [ N ] the set { 1 , 2 , … , N } . We add the probability law as a subscript to the probability operator P and the expectation operator E whenever we want to emphasize the corresponding dependence. We denote the L p ( P ) -norm by ‖ ⋅ ‖ P , p and the Euclidean or operator norm by ∣ ⋅ ∣ , depending on the context. We implicitly assume that given expectations and conditional expectations exist. We denote by → d convergence in distribution. The symbol ⊥ ⊥ denotes independence of random variables.

We observe N units according to the structural equations (1) that are connected by an underlying network. For each unit i ∈ [ N ] , we concatenate S i = ( W i , C i , X i , Z i , Y i ) that are relevant for unit i . For notational simplicity, we abbreviate D i = ( C i , X i ) and U i = ( C i , Z i ) for i ∈ [ N ] .

Let the number of sample splits K ≥ 2 be a fixed integer independent of N . We assume that N ≥ K holds. Consider a partition I 1 , … , I K of [ N ] . We assume that all sets I 1 , … , I K are of equal cardinality n . We make this assumption for the sake of notational simplicity, but our results hold without it.

Let { δ N } N ≥ K and { Δ N } N ≥ K be two sequences of non-negative numbers that converge to 0 as N → ∞ . Let { P N } N ≥ 1 be a sequence of sets of probability distributions P of the N units.

For completeness, we recall the following two assumptions from the main text. Assumption A1 limits the growth rate of the maximal degree of a node in the dependency graph. Assumption A2 characterizes the asymptotic variance in Theorem G.1 as the limit of the population variance on the N units.

Assumption A1

The maximal degree d max of a node in the dependency graph satisfies d max = o ( N 1 ⁄ 4 ) .

Assumption A2

Let { P N } N ≥ 1 be a sequence of sets of probability distributions P of the N units. There exists σ ∞ 2 , possibly depending on P ∈ P N , satisfying 0 < L ≤ σ ∞ 2 ≤ U < ∞ with fixed constants L , U , such that for all P ∈ P N , we have

(A1) lim N → ∞ Var 1 N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) − σ ∞ 2 = 0 ,

where ψ ( S i , θ i 0 , η 0 ) = φ ( S i , η 0 ) − θ i 0 is a centered version of φ .

We make the following additional sets of assumptions. Assumption A3 recalls that we use the model (1) and specifies regularity assumptions on the involved random variables. Assumptions 3.2 and 3.3 ensure that the random variables are integrable enough. Assumption 3.4 ensures that the true underlying function h 0 of the treatment selection model is bounded away from 0 and 1, which allows us to divide by h 0 and 1 − h 0 .

Assumption A3

Let p ≥ 4 . For all N , all i ∈ [ N ] , all P ∈ P N , and all k ∈ [ K ] , we have the following:

The structural equations (1) hold, where the treatment W i ∈ { 0 , 1 } is binary.
There is a finite real constant C 1 independent of P satisfying ‖ W i ‖ P , p + ‖ C i ‖ P , p + ‖ X i ‖ P , p + ‖ Z i ‖ P , p + ‖ Y i ‖ P , p ≤ C 1 .
There is a finite real constant C 2 independent of P such that we have ‖ Y i ‖ P , ∞ + ‖ g 1 0 ( D i ) ‖ P , ∞ + ‖ g 0 0 ( D i ) ‖ P , ∞ + ‖ h 0 ( U i ) ‖ P , ∞ ≤ C 2 .
There is a finite real constant C 3 independent of P such that P ( C 3 ≤ h 0 ( U i ) ≤ 1 − C 3 ) = 1 holds.
There is a finite real constant C 4 such that we have ∣ θ i 0 ∣ ≤ C 4 .

Assumption A4 characterizes the realization set of the nuisance functions and the N − 1 ⁄ 2 convergence rate of products of the machine learning errors from estimating the nuisance functions g 1 0 , g 0 0 , and h 0 .

Assumption A4

Consider the p ≥ 4 from Assumption A3. For all N ≥ K and all P ∈ P N , consider a nuisance function realization set T such that the following conditions hold:

The set T consists of P -integrable functions η = ( g 1 , g 0 , h ) whose p th moment exists and whose ‖ ⋅ ‖ P , ∞ -norm is in fact uniformly bounded, and T contains η 0 = ( g 1 0 , g 0 0 , h 0 ) . Furthermore, there is a finite real constant C 5 such that for all i ∈ [ N ] and all elements η = ( g 0 , g 1 , h ) ∈ T , we have
‖ h 0 ( W i ) − h ( W i ) ‖ P , 2 ⋅ ( ‖ g 1 0 ( D i ) − g 1 ( D i ) ‖ P , 2 + ‖ g 0 0 ( D i ) − g 0 ( D i ) ‖ P , 2 + ‖ h 0 ( W i ) − h ( W i ) ‖ P , 2 ) ≤ δ N N − 1 2 .
Assumption 3.4 also holds with h 0 replaced by h .
Let κ be the largest real number such that for all i ∈ [ N ] and all η ∈ T , we have
‖ h 0 ( W i ) − h ( W i ) ‖ P , 2 + ‖ g 1 0 ( D i ) − g 1 ( D i ) ‖ P , 2 + ‖ g 0 0 ( D i ) − g 0 ( D i ) ‖ P , 2 ≲ δ N N − κ ,
i.e., κ represents the slowest convergence rate of our machine learners. Then, there is a finite real constant C 6 such that d max N − 2 κ ≤ C 6 holds, where d max denotes the maximal degree of the dependency graph.
For all k ∈ [ K ] , the nuisance parameter estimate η ˆ I k c = η ˆ I k c ( S I k c ) belongs to the nuisance function realization set T with P -probability no less than 1 − Δ N .

Assumption A5 and A6 are only required to establish that our plugin estimator of the asymptotic variance is consistent in Appendix G. (However, please recall that we recommend using the bootstrap procedure presented in Section 2.5 unless the sample size is large). They are not required to establish the asymptotic Gaussian distribution of our plugin machine learning estimator. Assumption A5 characterizes the order of the minimal size of the sets A d for d ≥ 0 . These sets are required to contain a sufficient number of units such that the degree-specific treatment effects θ d 0 for d ≥ 0 can be estimated at a fast enough rate. These estimators are required to give a consistent estimator of the asymptotic variance σ ∞ 2 .

Assumption A5

For d ≥ 0 , the order of ∣ A d ∣ is at least N 3 ⁄ 4 , denoted by Ω ( N 3 ⁄ 4 ) according to the Bachmann–Landau notation [64].

Assumption A6 specifies that all individual machine learning estimators of the nuisance functions converge at a rate faster than N − 1 ⁄ 4 .

Assumption A6

The slowest convergence rate κ in Assumption 4.3 satisfies κ ≥ 1 ⁄ 4 .

B Network effects in the social sciences

We consider models related to spillover effects. However, another notion of spillover effects has prevailed within the social science networks literature, namely, social influence effects. In this appendix, we describe social influence effects and how their modeling differs from our approach. Whereas spillover effects represent new covariates on the unit level that are built from variables of other units along network paths, social influence effects mostly concern effects that a specific variable A j of neighboring units has on A i of the i th unit. In the statistics literature, this is called contagion [65,66]. In the social sciences, there are two important models to investigate social influence/contagion processes: the autologistic actor attribute model (ALAAM; Robins et al. [67], Daraganova and Robins [68]) and the stochastic actor-oriented model (SAOM; Snijders [69], Snijders et al. [70], Steglich et al. [71]). Both models aim at estimating the degree to which a variable A i of a focal individual is associated with the values of its neighbors’ values of A . Whereas ALAAMs only considers cross-sectional data, SAOMs additionally allow estimating longitudinal social influence effects.

In contrast, the spillover features that we consider summarize variables from neighboring units. They represent a new variable that is used for the treatment or outcome regression models. For example, in our empirical analysis, we consider the spillover effect of study motivation of unit i ’s neighbors on the learning hours of unit i . We do not consider spillover from the learning hours of unit i ’s neighbors on unit i ’s own learning hours (i.e. social influence/contagion). Instead, we model such associations of the individual units’ learning hours by constructing features from other variables and units that act as observed confounders. Moreover, we are not interested in estimating the effect as such of, say, other units’ study motivation on the learning hours of unit i . However, this is possible with ALAAMs and SAOMs. We are not interested in estimating spillover as such, but we consider spillover as a tool to control for spurious associations due to the network structure to estimate treatment effects.

C Additional simulation results

First, we present simulation results involving spillover effects from second-degree neighbors and misspecified spillover effects. We consider the same data-generating mechanism and estimation framework as in Section 3.1 apart from the following change: the “neighborhood” α ( i ) defining X i contains all second-degree neighbors of unit i , i.e., all units that are a distance 2 away from unit i in the network (neighbors of neighbors). For incorrectly specified spillover effects, we assumed that α ( i ) contains the direct neighbors of unit i instead. We consider an Erdős–Rényi network as in Section 3.1 except that the average degree of a unit is now 2.5. The results are displayed in Figure A1. Our method, netAIPW, does not seem to suffer much from the misspecified spillover effects in terms of coverage, whereas the other methods do. In general, we observed that it is advantageous to include spillover effects even if they are not entirely correctly specified.

$Figure A1 Coverage (fraction of times the true, and in general unknown, θ N 0 {\theta }_{N}^{0} was inside the confidence interval), log mean length of two-sided 95% confidence intervals for θ N 0 {\theta }_{N}^{0} , and mean bias over 1,000 simulation runs for the “const” Erdős–Rényi network as in Section 3.1, except for the average degree of 2.5. We compare the performance of our method, netAIPW, with the Hájek and an IPW estimator, indicated by color, for correctly and incorrectly specifying the spillover effects from second-degree neighbors. The variance of the competitors are empirical variances over the 1,000 repetitions, whereas we computed confidence intervals for netAIPW according to (8) with B = 1 B=1 and 300 bootstrap samples. The shaded regions in the coverage plot represent 95% confidence bands with respect to the 1,000 simulation runs.$

Figure A1

Coverage (fraction of times the true, and in general unknown, θ N 0 was inside the confidence interval), log mean length of two-sided 95% confidence intervals for θ N 0 , and mean bias over 1,000 simulation runs for the “const” Erdős–Rényi network as in Section 3.1, except for the average degree of 2.5. We compare the performance of our method, netAIPW, with the Hájek and an IPW estimator, indicated by color, for correctly and incorrectly specifying the spillover effects from second-degree neighbors. The variance of the competitors are empirical variances over the 1,000 repetitions, whereas we computed confidence intervals for netAIPW according to (8) with B = 1 and 300 bootstrap samples. The shaded regions in the coverage plot represent 95% confidence bands with respect to the 1,000 simulation runs.

Next, we present networks and different kinds of spillover effects to show when Assumption A2 holds and when it fails to hold. We consider the same second-degree spillover effects and Erdős–Rényi network as mentioned earlier in this section. The “const” network has an expected degree of 2.5, and the “ N ^(1/15)” one has an expected degree of 2.5 N 1 ⁄ 15 in Figure A2. The maximal degree, divided by N 1 ⁄ 4 of the dependency graph of the “const” network, decreases with N , whereas the respective quantity increases with N for the non-constant-degree network, i.e., only the constant-degree network satisfies Assumption A2. The non-constant-degree network implies a dependency graph that is “too dense” to satisfy this assumption. We would like to remark that satisfying Assumption A2 is an interplay of the underlying network and the chosen spillover effects because they determine the dependency graph, and hence its maximal degree, together. A given network might lead to a dependency graph satisfying Assumption A2 with one kind of spillover effects (e.g., only from neighbors), whereas the same network might lead to a dependency graph violating this assumption with another kind of spillover effects (e.g., also including second-degree neighbors, i.e., neighbors of neighbors).

$Figure A2 Simulated maximal degree of the dependency graph from second-degree spillover on Erdős–Rényi networks with an expected degree of either 2.5 (“const”) or 2.5 N 1 ⁄ 15 2.5{N}^{1/15} (“ N N ^ (1/15)”) divided by N 1 ⁄ 4 {N}^{1/4} , averaged over 1,000 simulation runs.$

Figure A2

Simulated maximal degree of the dependency graph from second-degree spillover on Erdős–Rényi networks with an expected degree of either 2.5 (“const”) or 2.5 N 1 ⁄ 15 (“ N ^ (1/15)”) divided by N 1 ⁄ 4 , averaged over 1,000 simulation runs.

D Supplementary lemmata

In this section, we prove two results on conditional independence relationships of the variables from our model. We argue for the DAG of our model (1) and use graphical criteria [33–37,72,73]. We denote the direct causes of W i by pa ( W i ) , the parents of W i . Analogously, we denote the parents of Y i by pa ( Y i ) ; please see for instance Lauritzen [33]. We assume that pa ( W i ) consists of C i and the variables used to compute the spillover feature Z i and that pa ( Y i ) consists of W i , C i , and the variables used to compute the spillover feature X i .

Lemma D.1

Let i ∈ [ N ] , and let C j ∉ pa ( Y i ) . Then, we have Y i ⊥ ⊥ C j ∣ pa ( Y i ) .

Proof of Lemma D.1

The parents of Y i are a valid adjustment set [35]. Because Y i has no descendants, the claim follows.□

Lemma D.2

Let i ∈ [ N ] , and let C j ∉ pa ( W i ) . Then, we have W i ⊥ ⊥ C j ∣ pa ( W i ) . Furthermore, for j ≠ i , we have W i ⊥ ⊥ W j ∣ pa ( W i ) .

Proof of Lemma D.2

The parents of W i are a valid adjustment set [35]. The treatment variable W i has no descendants apart from outcomes Y , which are colliders on any path from W i to C j or W j , and thus, the empty set blocks these paths. Consequently, the two claims follow.□

E Proof of Theorem 2.2

Proof of Lemma 2.1

Let i ∈ [ N ] . We have

E [ ψ ( S i , θ i 0 , η 0 ) ] = E W i h 0 ( U i ) ( Y i − g 1 0 ( D i ) ) − E 1 − W i 1 − h 0 ( U i ) ( Y i − g 0 0 ( D i ) ) .

We have

(A2) E W i h 0 ( U i ) ( Y i − g 1 0 ( D i ) ) = E W i h 0 ( U i ) ( E [ Y i ∣ pa ( Y i ) ∪ pa ( W i ) ] − g 1 0 ( D i ) ) = E 1 h 0 ( U i ) E [ W i Y i − W i g 1 0 ( D i ) ∣ pa ( Y i ) ] = E W i h 0 ( U i ) E [ ε Y i ∣ pa ( Y i ) ] = 0

due to Lemma D.1 and because E [ ε Y i ∣ pa ( Y i ) ] = 0 holds by assumption. Analogous computations for E [ ( 1 − W i ) ⁄ ( 1 − h 0 ( U i ) ) ( Y i − g 0 0 ( D i ) ) ] conclude the proof.□

The following lemma shows that the score function φ is Neyman orthogonal in the sense that its Gateaux derivative vanishes [20].

Lemma E.1

(Neyman orthogonality) Assume that the assumptions of Theorem 2.2 hold. Let η ∈ T , and let i ∈ [ N ] . Then, we have

∂ ∂ r r = 0 E [ φ ( S i , η 0 + r ( η − η 0 ) ) ] = 0 .

Proof of Lemma E.1

Let r ∈ ( 0 , 1 ) , let i ∈ [ N ] , and let η ∈ T . Then, we have

(A3) ∂ ∂ r E [ φ ( S i , η 0 + r ( η − η 0 ) ) ] = ∂ ∂ r E [ g 1 0 ( D i ) − g 0 0 ( D i ) + r ( g 1 ( D i ) − g 0 ( D i ) − g 1 0 ( D i ) + g 0 0 ( D i ) ) + W i h 0 ( U i ) + r ( h ( U i ) − h 0 ( U i ) ) ( Y i − g 1 0 ( D i ) − r ( g 1 ( D i ) − g 1 0 ( D i ) ) ) − 1 − W i 1 − h 0 ( U i ) − r ( h ( U i ) − h 0 ( U i ) ) ( Y i − g 0 0 ( D i ) − r ( g 0 ( D i ) − g 0 0 ( D i ) ) ) = E [ ( g 1 ( D i ) − g 0 ( D i ) ) − ( g 1 0 ( D i ) − g 0 0 ( D i ) ) + W i ( h 0 ( U i ) + r ( h ( U i ) − h 0 ( U i ) ) ) 2 ( − ( g 1 ( D i ) − g 1 0 ( D i ) ) ( h 0 ( U i ) + r ( h ( U i ) − h 0 ( U i ) ) ) − ( Y i − g 1 0 ( D i ) − r ( g 1 ( D i ) − g 1 0 ( D i ) ) ) ( h ( U i ) − h 0 ( U i ) ) ) − 1 − W i ( 1 − h 0 ( U i ) − r ( h ( U i ) − h 0 ( U i ) ) ) 2 ( − ( g 0 ( D i ) − g 0 0 ( D i ) ) ( 1 − h 0 ( U i ) − r ( h ( U i ) − h 0 ( U i ) ) ) + ( Y i − g 0 0 ( D i ) − r ( g 0 ( D i ) − g 0 0 ( D i ) ) ) ( h ( U i ) − h 0 ( U i ) ) ) ] .

We evaluate this expression at r = 0 and obtain

∂ ∂ r ∣ r = 0 E [ φ ( S i , η 0 + r ( η − η 0 ) ) ] = E [ ( g 1 ( D i ) − g 0 ( D i ) ) − ( g 1 0 ( D i ) − g 0 0 ( D i ) ) − 1 + ε W i h 0 ( U i ) ( g 1 ( D i ) − g 1 0 ( D i ) ) − W i ( h 0 ( U i ) ) 2 ( Y i − g 1 0 ( D i ) ) ( h ( U i ) − h 0 ( U i ) ) + 1 − ε W i 1 − h 0 ( U i ) ( g 0 ( D i ) − g 0 0 ( D i ) ) − 1 − W i ( 1 − h 0 ( U i ) ) 2 ( Y i − g 0 0 ( D i ) ) ( h ( U i ) − h 0 ( U i ) ) = 0

due to (A2) and because

E ε W i h 0 ( U i ) ( g 1 ( D i ) − g 1 0 ( D i ) ) = E ( E [ W i ∣ pa ( W i ) ∪ pa ( Y i ) ] − h 0 ( U i ) ) 1 h 0 ( U i ) ( g 1 ( D i ) − g 1 0 ( D i ) ) = E E [ W i − h 0 ( U i ) ∣ pa ( W i ) ] 1 h 0 ( U i ) ( g 1 ( D i ) − g 1 0 ( D i ) ) = E E [ ε W i ∣ pa ( W i ) ] 1 h 0 ( U i ) ( g 1 ( D i ) − g 1 0 ( D i ) ) = 0

holds due to Lemma D.2 and because we assumed E [ ε W i ∣ pa ( W i ) ] = 0, and similarly, for E [ ε W i ⁄ ( 1 − h 0 ( U i ) ) ( g 0 ( D i ) − g 0 0 ( D i ) ) ] .□

The following lemma bounds the second directional derivative of the score function. Its proof uses that products of the errors of the machine learners are of a smaller order than N − 1 ⁄ 2 .

Lemma E.2

(Product property) Assume the assumptions of Theorem 2.2 hold. Let r ∈ ( 0 , 1 ) , let η ∈ T , and let i ∈ [ N ] . Then, we have

∂ 2 ∂ r 2 E [ φ ( S i , η 0 + r ( η − η 0 ) ) ] ≲ δ N N − 1 2 .

Proof of Lemma E.2

We use the first directional derivative we derived in (A3) to compute the second directional derivative

∂ 2 ∂ r 2 E [ φ ( S i , η 0 + r ( η − η 0 ) ) ] = 2 E W i ( h 0 ( U i ) + r ( h ( U i ) − h 0 ( U i ) ) ) 4 ( ( g 1 ( D i ) − g 1 0 ( D i ) ) ( h 0 ( U i ) + r ( h ( U i ) − h 0 ( U i ) ) ) + ( Y i − g 1 0 ( D i ) − r ( g 1 ( D i ) − g 1 0 ( D i ) ) ) ( h ( U i ) − h 0 ( U i ) ) ) ⋅ ( h 0 ( U i ) + r ( h ( U i ) − h 0 ( U i ) ) ) ( h ( U i ) − h 0 ( U i ) ) ] + 2 E 1 − W i ( 1 − h 0 ( U i ) − r ( h ( U i ) − h 0 ( U i ) ) ) 4 ( ( g 0 ( D i ) − g 0 0 ( D i ) ) ( 1 − h 0 ( U i ) − r ( h ( U i ) − h 0 ( U i ) ) ) − ( Y i − g 0 0 ( D i ) − r ( g 0 ( D i ) − g 0 0 ( D i ) ) ) ( h ( U i ) − h 0 ( U i ) ) ) ⋅ ( 1 − h 0 ( U i ) − r ( h ( U i ) − h 0 ( U i ) ) ) ( h ( U i ) − h 0 ( U i ) ) ] .

Due to Hölder’s inequality and Assumptions 3.1, 3.3, 3.4, and 4.1, we have

∂ 2 ∂ r 2 E [ φ ( S i , η 0 + r ( η − η 0 ) ) ] ≲ ( ‖ g 1 ( D i ) − g 1 0 ( D i ) ‖ P , 2 + ‖ h ( U i ) − h 0 ( U i ) ‖ P , 2 ) ‖ h ( U i ) − h 0 ( U i ) ‖ P , 2 + ( ‖ g 0 ( D i ) − g 0 0 ( D i ) ‖ P , 2 + ‖ h ( U i ) − h 0 ( U i ) ‖ P , 2 ) ‖ h ( U i ) − h 0 ( U i ) ‖ P , 2 .

Due to Assumption 4.1, both summands mentioned earlier are bounded by δ N N − 1 ⁄ 2 , and hence, we conclude the proof.□

The following lemma describes how we apply Stein’s method [74] to obtain the asymptotic Gaussian distribution of our estimator although the data are highly dependent.

Lemma E.3

(Asymptotic distribution with Stein’s method) Assume the assumptions of Theorem 2.2 hold. Denote by

σ N 2 = Var 1 N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) .

Observe that by Assumption A2, we have lim N → ∞ ( σ N 2 − σ ∞ 2 ) = 0 . Then, we have

σ N − 1 ⋅ 1 N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) → d N ( 0 , 1 ) .

Proof of Lemma E.3

According to Lemma 2.1, we have E [ ψ ( S i , θ i 0 , η 0 ) ] = 0 . According to Assumption A3, the fourth moment of ψ ( S i , θ i 0 , η 0 ) exists for all i ∈ [ N ] and is uniformly bounded over i ∈ [ N ] . Recall that we denote by d max the maximal degree in the dependency graph on S i , i ∈ [ N ] . Based on Ross [75, Theorem 3.6], we can thus bound the Wasserstein distance of σ N − 1 ⋅ 1 N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) to N ( 0 , 1 ) as follows: there exist finite real constants c 1 and c 2 such that we have

(A4) d W σ N − 1 ⋅ 1 N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) ≤ c 1 ⋅ d max 3 ⁄ 2 σ N 2 ⋅ ∑ i = 1 N E 1 N ψ ( S i , θ i 0 , η 0 ) 4 + c 2 ⋅ d max 2 σ N 3 ⋅ ∑ i = 1 N E 1 N ψ ( S i , θ i 0 , η 0 ) 3 = c 1 ⋅ d max 3 ⁄ 2 ⋅ 1 N σ N 2 ⋅ 1 N ∑ i = 1 N E [ ψ 4 ( S i , θ i 0 , η 0 ) ] + c 2 ⋅ d max 2 ⋅ 1 N σ N 3 ⋅ 1 N ∑ i = 1 N E [ ∣ ψ ( S i , θ i 0 , η 0 ) ∣ 3 ] .

By assumption, we have d max = o ( N 1 ⁄ 4 ) . Thus, we have d max 3 ⁄ 2 ⋅ 1 N = o ( N − 1 ⁄ 8 ) and d max 2 ⋅ 1 N = o ( 1 ) . Because the terms E [ ψ 4 ( S i , θ i 0 , η 0 ) ] and E [ ∣ ψ ( S i , θ i 0 , η 0 ) ∣ 3 ] are uniformly bounded over all i ∈ [ N ] and because σ N → σ ∞ as N → ∞ according to Assumption A2, the Wasserstein distance in (A4) is of order o ( 1 ) . Consequently, we infer the statement of the lemma.□

Lemma E.4

(Vanishing covariance due to sparse dependency graph) Assume the assumptions of Theorem 2.2 hold. Let k ∈ [ K ] , and recall that n = ∣ I k ∣ holds. Then, we have

1 n ∑ i ∈ I k ( φ ( S i , η ˆ I k c ) − E [ φ ( S i , η ˆ I k c ) ∣ S I k c ] ) − 1 n ∑ i ∈ I k ( φ ( S i , η 0 ) − E [ φ ( S i , η 0 ) ] ) = o P ( 1 ) .

Proof of Lemma E.4

Let k ∈ [ K ] . We have

(A5) E 1 n ∑ i ∈ I k ( φ ( S i , η ˆ I k c ) − E [ φ ( S i , η ˆ I k c ) ∣ S I k c ] ) − 1 n ∑ i ∈ I k ( φ ( S i , η 0 ) − E [ φ ( S i , η 0 ) ] ) 2 ∣ S I k c = 1 n ∑ i ∈ I k E [ ( φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ) 2 ∣ S I k c ] − 1 n ∑ i ∈ I k E [ φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ∣ S I k c ] 2 + 1 n ∑ i , j ∈ I k , i ≠ j E [ ( φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ) ( φ ( S j , η ˆ I k c ) − φ ( S j , η 0 ) ) ∣ S I k c ] − 1 n ∑ i , j ∈ I k , i ≠ j E [ φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ∣ S I k c ] E [ φ ( S j , η ˆ I k c ) − φ ( S j , η 0 ) ∣ S I k c ] .

Let i ∈ [ N ] . The nuisance parameter estimator η ˆ I k c belongs to T with P -probability at least 1 − Δ N by Assumption 4.4. Therefore, with P -probability at least 1 − Δ N = 1 − o ( 1 ) , we have

E [ ( φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ) 2 ∣ S I k c ] ≤ sup η ∈ T − g 1 0 ( D i ) + g 1 ( D i ) + g 0 0 ( D i ) − g 0 ( D i ) + W i h 0 ( U i ) ε Y i − W i h ( U i ) ( g 1 0 ( D i ) − g 1 ( D i ) + ε Y i ) − 1 − W i 1 − h 0 ( U i ) ε Y i + 1 − W i 1 − h ( U i ) ( g 0 0 ( D i ) − g 0 ( D i ) + ε Y i ) P , 2 ≤ sup η ∈ T ‖ g 1 0 ( D i ) − g 1 ( D i ) ‖ P , 2 + sup η ∈ T ‖ g 0 0 ( D i ) − g 0 ( D i ) ‖ P , 2 + sup η ∈ T h ( U i ) − h 0 ( U i ) h 0 ( U i ) h ( U i ) W i ε Y i P , 2 + sup η ∈ T W i h ( U i ) ( g 1 0 ( D i ) − g 1 ( D i ) ) P , 2 + sup η ∈ T h 0 ( U i ) − h ( U i ) ( 1 − h 0 ( U i ) ) ( 1 − h ( U i ) ) ( 1 − W i ) ε Y i P , 2 + sup η ∈ T 1 − W i 1 − h ( U i ) ( g 0 0 ( D i ) − g 0 ( D i ) ) P , 2 .

Assumptions 3.1, 3.3, 3.4, and 4.2 bound the terms ‖ W i ε Y i ⁄ ( h 0 ( U i ) h ( U i ) ) ‖ P , ∞ , ‖ W i ⁄ h ( U i ) ‖ P , ∞ , ‖ ( 1 − W i ) ε Y i ⁄ ( ( 1 − h 0 ( U i ) ) ( 1 − h ( U i ) ) ) ‖ P , ∞ , and ‖ ( 1 − W i ) ⁄ ( 1 − h ( U i ) ) ‖ P , ∞ . Assumption 4.3 specifies that the error terms ‖ h 0 ( W i ) − h ( W i ) ‖ P , 2 , ‖ g 1 0 ( D i ) − g 1 ( D i ) ‖ P , 2 , and ‖ g 0 0 ( D i ) − g 0 ( D i ) ‖ P , 2 are upper bounded by δ N N − κ . Due to Hölder’s inequality, we infer

(A6) E [ ( φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ) 2 ∣ S I k c ] ≲ δ N N − κ ,

with P -probability at least 1 − Δ N .

Subsequently, we bound the summands in (A5). Due to (A6), we have

1 n ∑ i ∈ I k E [ ( φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ) 2 ∣ S I k c ] − 1 n ∑ i ∈ I k E [ φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ∣ S I k c ] 2 ≲ δ N N − 2 κ

with P -probability at least 1 − Δ N . Observe that we have

1 n ∑ i , j ∈ I k , i ≠ j E [ ( φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ) ( φ ( S j , η ˆ I k c ) − φ ( S j , η 0 ) ) ∣ S I k c ] − 1 n ∑ i , j ∈ I k , i ≠ j E [ φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ∣ S I k c ] E [ φ ( S j , η ˆ I k c ) − φ ( S j , η 0 ) ∣ S I k c ] = 1 n ∑ i , j ∈ I k , i ≠ j Cov ( φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) , φ ( S j , η ˆ I k c ) − φ ( S j , η 0 ) ∣ S I k c ) = 1 n ∑ i , j ∈ I k , { i , j } ∈ E D Cov ( φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) , φ ( S j , η ˆ I k c ) − φ ( S j , η 0 ) ∣ S I k c ) ,

where E D denotes the edge set of the dependency graph, because the S i with i ∈ I k are independent of data in S I k c and because, given S I k c , φ ( S i η ˆ I k c ) − φ ( S i , η 0 ) and φ ( S j , η ˆ I k c ) − φ ( S j , η 0 ) are uncorrelated if there is no edge between i and j in the dependency graph. In the dependency graph, each node has a maximal degree of d max . Thus, there are at most 1 ⁄ 2 ⋅ N ⋅ d max many edges in E D . With P -probability at least 1 − Δ N , the term

Cov ( φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) , φ ( S j , η ˆ I k c ) − φ ( S j , η 0 ) ∣ S I k c )

can be bounded by δ N N − 2 κ up to constants for all i and j due to (A6). Therefore, with P -probability at least 1 − Δ N , we have

1 n ∑ i , j ∈ I k , i ≠ j E [ ( φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ) ( φ ( S j , η ˆ I k c ) − φ ( S j , η 0 ) ) ∣ S I k c ] − 1 n ∑ i , j ∈ I k , i ≠ j E [ φ ( S i , η ˆ I k c ) − φ ( S i , η 0 ) ∣ S I k c ] E [ φ ( S j , η ˆ I k c ) − φ ( S j , η 0 ) ∣ S I k c ] ≲ δ N d max N − 2 κ ≲ δ N ,

where the last bound holds due to Assumption 4.3. Consequently, we have

E 1 n ∑ i ∈ I k ( φ ( S i , η ˆ I k c ) − E [ φ ( S i , η ˆ I k c ) ∣ S I k c ] ) − 1 n ∑ i ∈ I k ( φ ( S i , η 0 ) − E [ φ ( S i , η 0 ) ] ) 2 ∣ S I k c ≲ δ N ,

with P -probability at least 1 − Δ N , and we infer the statement of the lemma based on Chernozhukov et al. [20, Lemma 6.1].□

Lemma E.5

(Taylor expansion) Assume the assumptions of Theorem 2.2 hold. Let k ∈ [ K ] . We have

1 n ∑ i ∈ I k ( E [ φ ( S i , η ˆ I k c ) ∣ S I k c ] − E [ φ ( S i , η 0 ) ] ) = o P ( 1 ) .

Proof of Lemma E.5

Let k ∈ [ K ] . For r ∈ [ 0 , 1 ] , let us define the function

f k ( r ) = 1 n ∑ i ∈ I k ( E [ φ ( S i , η 0 + r ( η ˆ I k c − η 0 ) ) ∣ S I k c ] − E [ φ ( S i , η 0 ) ] .

We have

E 1 n ∑ i ∈ I k ( E [ φ ( S i , η ˆ I k c ) ∣ S I k c ] − E [ φ ( S i , η 0 ) ] ) S I k c = 1 n ∑ i ∈ I k ( E [ φ ( S i , η ˆ I k c ) ∣ S I k c ] − E [ φ ( S i , η 0 ) ] ) = n ∣ f k ( 1 ) ∣ .

We apply a Taylor expansion to f k ( 1 ) at 0 and obtain

f k ( 1 ) = f k ( 0 ) + f k ′ ( 0 ) + 1 2 f k ″ ( r ˜ ) ,

for some r ˜ ∈ ( 0 , 1 ) . Thus, we have

E 1 n ∑ i ∈ I k ( E [ φ ( S i , η ˆ I k c ) ∣ S I k c ] − E [ φ ( S i , η 0 ) ] ) ∣ S I k c ≤ n ∣ f k ( 0 ) ∣ + ∣ f k ′ ( 0 ) ∣ + sup r ∈ ( 0 , 1 ) 1 2 ∣ f k ″ ( r ) ∣ .

Due to the definition of f k , we have f k ( 0 ) = 0 . Due to Neyman orthogonality that we established in Lemma E.1, we have f k ′ ( 0 ) = 0 . Due to the product property that we established in Lemma E.2, we have sup r ∈ ( 0 , 1 ) 1 2 ∣ f k ″ ( r ) ∣ ≲ δ N N − 1 ⁄ 2 with P -probability at least 1 − Δ N because η ˆ I k c belongs to T with P -probability at least 1 − Δ N . Consequently, we have

E 1 n ∑ i ∈ I k ( E [ φ ( S i , η ˆ I k c ) ∣ S I k c ] − E [ φ ( S i , η 0 ) ] ) ∣ S I k c ≲ δ N ,

with P -probability at least 1 − Δ N . We infer the statement of the lemma based on Chernozhukov et al. [20, Lemma 6.1].□

Proof of Theorem 2.2

We have

N ( θ ˆ − θ N 0 ) = N ⋅ 1 n K ∑ k = 1 K ∑ i ∈ I k ψ ( S i , θ i 0 , η ˆ I k c ) = 1 K ∑ k = 1 K 1 n ∑ i ∈ I k ( ψ ( S i , θ i 0 , η ˆ I k c ) − ψ ( S i , θ i 0 , η 0 ) ) + 1 N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 )

because the disjoint sets I k are of equal size n , so that we have N = n K . Let k ∈ [ K ] . We have

1 n ∑ i ∈ I k ( ψ ( S i , θ i 0 , η ˆ I k c ) − ψ ( S i , θ i 0 , η 0 ) ) ≤ 1 n ∑ i ∈ I k ( ψ ( S i , θ i 0 , η ˆ I k c ) − E [ ψ ( S i , θ i 0 , η ˆ I k c ) ∣ S I k c ] ) − 1 n ∑ i ∈ I k ( ψ ( S i , θ i 0 , η 0 ) − E [ ψ ( S i , θ i 0 , η 0 ) ] ) + 1 n ∑ i ∈ I k ( E [ ψ ( S i , θ i 0 , η ˆ I k c ) ∣ S I k c ] − E [ ψ ( S i , θ i 0 , η 0 ) ] ) = o P ( 1 ) ,

due to Hölder’s inequality and Lemmas E.4 and E.5. Because K is a constant independent of N , we have

1 K ∑ k = 1 K 1 n ∑ i ∈ I k ( ψ ( S i , θ i 0 , η ˆ I k c ) − ψ ( S i , θ i 0 , η 0 ) ) = o P ( 1 ) .

Due to Lemma E.3, we have 1 N ⋅ σ N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) → d N ( 0 , 1 ) as N → ∞ . Due to Assumption A2, we therefore have

1 N σ ∞ − 1 ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) = 1 N ⋅ σ N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) ⋅ σ N σ ∞ − 1 → d N ( 0 , 1 ) ,

as N → ∞ . Consequently, we have N σ ∞ − 1 ( θ ˆ − θ N 0 ) → d N ( 0 , 1 ) as claimed.□

F Bootstrap variance estimator

We use the following assumption to establish the consistency of the bootstrap variance estimator. It is a high-level assumption, and we will not verify it in terms of the model (1); yet, assuming some form of continuity (as below) seems to be essentially necessary for the bootstrap to be consistent.

Assumption A7

To make the dependence of σ ∞ 2 in (9) on the law of the response error terms ε Y , the law of the covariates C , the nuisance functions η 0 , and the network G , we introduce the functional

σ ∞ 2 ( P ε , P C , η 0 ; G ) = lim N → ∞ Var 1 N ∑ i = 1 N ψ ( S i , θ i 0 , η 0 ) ,

which can be represented as

σ ∞ 2 ( P ε , P C , η 0 ; G ) = lim N → ∞ Var 1 N ∑ i = 1 N φ ( S i , η 0 ) ,

due to ψ ( S i , θ i 0 , η 0 ) = φ ( S i , η 0 ) − θ i 0 and because the θ i 0 ’s are non-random.

We assume that σ ∞ 2 ( P ε , P C , η 0 ; G ) is continuous with respect to Mallows’ distance d 2 ( ⋅ , ⋅ ) in the first and second argument and with respect to ‖ ⋅ ‖ P , 2 in the third argument.

Proof of Theorem 2.3

The bootstrap variance relies on the same dependency structure induced by the network as σ ∞ 2 and can be represented by

σ ∞ 2 ( P ˆ ε ˆ , P ˆ C , η ˆ ; G ) = lim N → ∞ Var * 1 N ∑ i = 1 N ψ ( S i * , θ ˆ i 0 , η ˆ ) ,

where the construction of S i * is described in Section 2.5. Similarly, we can rewrite this bootstrap variance as

σ ∞ 2 ( P ˆ ε ˆ , P ˆ C , η ˆ ; G ) = lim N → ∞ Var * 1 N ∑ i = 1 N φ ( S i * , η ˆ ) .

Due to Assumption 3.1 and [76], we have d 2 ( P ˆ C , P C ) → P 0 , where d 2 ( ⋅ , ⋅ ) denotes Mallows’ distance. Furthermore, due to ‖ ε ˆ Y − ε Y ‖ P , 2 → P 0 , we also have d 2 ( P ˆ ε ˆ , P ε ) → P 0 (see [76] and [29, Lemma 5.4]). Due to ‖ η ˆ I k c − η 0 ‖ P , 2 → P 0 for k ∈ [ K ] , we obtain

lim N → ∞ Var * 1 N ∑ i = 1 N φ ( S i * , η ˆ ) − Var 1 N ∑ i = 1 N ψ ( S i , η 0 ) → P 0 ,

which consequently establishes consistency of the bootstrap variance.□

G Consistent plugin variance estimator

An alternative to the bootstrap variance estimator can be constructed as described below. We do not recommend this estimator unless the sample size is large relative to the network connectivity, but its consistency can be derived under different and more explicit conditions than in (A7).

The challenge is that the unit-level effects θ i 0 for i ∈ [ N ] are not all equal. This is because the unit-level data points S i are typically not identically distributed. The difference in distributions originates from the X - and Z -features that generally depend on a varying number of other units. If two unit-level data points S i and S j have the same distribution, then their unit-level treatment effects θ i 0 and θ j 0 coincide. If enough of these unit-level treatment effects coincide, we can use the corresponding unit-level data to estimate them. Subsequently, we describe this procedure.

We partition [ N ] into sets A d for d ≥ 0 such that all unit-level data points S i for i ∈ A d have the same distribution. Provided that the sets A d are large enough, we can consistently estimate the corresponding θ d 0 for d ≥ 0 by

(A7) θ ˆ d = 1 ∣ A d ∣ ∑ i ∈ A d φ ( S i , η ˆ I k ( i ) c ) ,

where k ( i ) denotes the index in [ K ] such that i ∈ I k ( i ) . The convergence rate of these estimators is at least N − 1 ⁄ 4 (see Lemma G.3 in Section G.1 in the appendix). To achieve this rate, we require that the sets A d contain at least of order N 3 ⁄ 4 many indices (see Assumption A5 in Section A in the appendix). The parametric convergence rate cannot be achieved in general because A d is of smaller size than N , but the corresponding units may have the maximal d max many ties in the network.

Subsequently, we characterize a situation in which the index d corresponds to the degree in the dependency graph G D . This is the case if two unit-level data points S i and S j have the same distribution if and only if the units i and j have the same degree in G D . We assume, given a unit i and some m ∈ [ N ] \ { i } , that (1) if C m is part of Z i , then C m is also part of X i , and vice versa; and (2) if W m is part of X i , then C m is part of X i and Z i and vice versa. Consequently, if two units i ≠ j have the same degree in the dependency graph, then their X - and their Z -features are computed using the same number of random variables. Hence, X i and X j as well as Z i and Z j are identically distributed, and therefore, S i and S j have the same distribution. Thus, the sets A d form partition of the units according to their degree in the dependency graph, i.e., A d = { i ∈ [ N ] : d ( i ) = d } for d ≥ 0 , where d ( i ) denotes the degree of i in the dependency graph. There are d max + 1 = o ( N 1 ⁄ 4 ) many such sets, and each of them is required to be of size at least of order N 3 ⁄ 4 in Lemma G.3. This is feasible because there are N units in total. Provided that the machine learning estimators of the nuisance functions converge at a rate faster than N 1 ⁄ 4 as specified by Assumption A6 in the appendix, we have the following consistent estimator of the asymptotic variance given in Theorem G.1. Algorithm 1 summarizes the whole procedure of point estimation and inference for the EATE where the variance is estimated as given in Theorem G.1. Nevertheless, this estimation scheme can be extended to general sets A d .

Theorem G.1

Denote by G D = ( V , E D ) the dependency graph on S i , i ∈ [ N ] . For a unit i ∈ [ N ] , denote by d ( i ) its degree in G D and by k ( i ) the number in [ K ] such that S i ∈ I k ( i ) . In addition to the assumptions made in Theorem 2.2, also assume that Assumptions A5 and A6 stated in Section A in the appendix hold. Based on φ defined in (4), we define the score function ψ ( S i , θ , η ) = φ ( S i , η ) − θ for some general θ ∈ R and the nuisance function triple η = ( g 1 , g 0 , h ) . Then,

1 N ∑ i = 1 N ψ 2 ( S i , θ ˆ d ( i ) , η ˆ I k ( i ) c ) + 2 N ∑ { i , j } ∈ E D ψ ( S i , θ ˆ d ( i ) , η ˆ I k ( i ) c ) ψ ( S j , θ ˆ d ( j ) , η ˆ I k ( j ) c )

is a consistent estimator of the asymptotic variance σ ∞ 2 in Theorem 2.2.

G.1 Proof of Theorem G.1

Lemma G.2

Assume the assumptions of Theorem G.1 hold. Let i ∈ [ N ] . There exists a finite real constant C 7 independent of i such that ‖ ψ ( S i , θ d ( i ) 0 , η 0 ) ‖ P , 4 ≤ C 7 holds. Consequently, for i , j , m , r ∈ [ N ] , we can also bound the following terms by finite uniform constants:

‖ ψ ( S i , θ d ( i ) 0 , η 0 ) ‖ P , 2 ,
Var ( φ ( S i , η 0 ) ) ,
Var ( ψ 2 ( S i , θ d ( i ) 0 , η 0 ) ) ,
Cov ( φ ( S i , η 0 ) , φ ( S j , η 0 ) ) ,
Var ( ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) ) ,
Cov ( ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) , ψ ( S m , θ d ( m ) 0 , η 0 ) ψ ( S r , θ d ( r ) 0 , η 0 ) ) .

Moreover, we have φ 2 ( S i , η 0 ) = O P ( 1 ) . Furthermore, we have ψ 2 ( S i , θ d ( i ) 0 , η ˆ I k ( i ) c ) = O P ( 1 ) .

Proof of Lemma G.2

We have

(A8) ‖ ψ ( S i , θ d ( i ) 0 , η 0 ) ‖ P , 4 ≤ ‖ g 1 0 ( D i ) ‖ P , 4 + ‖ g 0 0 ( D i ) ‖ P , 4 + W i h 0 ( U i ) P , 4 ‖ Y i − g 1 0 ( D i ) ‖ P , ∞ + 1 − W i 1 − h 0 ( U i ) P , 2 ‖ Y i − g 0 0 ( D i ) ‖ P , ∞ + ∣ θ d ( i ) 0 ∣ .

All individual summands in the aforementioned decomposition are bounded by a finite real constant independent of i due to Assumption A3. Therefore, there exists a finite real constant C 7 independent of i such that ‖ ψ ( S i , θ i 0 , η 0 ) ‖ P , 4 ≤ C 7 holds.

The other terms in the statement of the present lemma are bounded as well by finite real constants independent of i , j , m , r ∈ [ N ] due to Hölder’s inequality.

Moreover, we have ψ 2 ( S i , η 0 ) = O P ( 1 ) because ‖ ψ 2 ( S i , η 0 ) ‖ P , 2 is bounded by a constant that is independent of i .

Furthermore, with P -probability at least 1 − Δ N , we have

E [ ψ 2 ( S i , θ d ( i ) 0 , η ˆ I k ( i ) c ) ∣ S I k ( i ) c ] ≤ sup η ∈ T E [ ψ 2 ( S i , θ d ( i ) 0 , η ) ] = sup η ∈ T ‖ ψ ( S i , θ d ( i ) 0 , η ) ‖ P , 2 2 .

The term ‖ ψ ( S i , θ d ( i ) 0 , η ) ‖ P , 2 2 is bounded by a real constant that is independent of i and η because the derivation in (A8) also holds with η 0 replaced by η ∈ T due to Assumption A4.□

Lemma G.3

(Convergence rate of unit-level effect estimators) Assume the assumptions of Theorem G.1 hold. Let d ≥ 0 , and assume that all assumptions of Section A in the appendix hold. Then, we have θ ˆ d − θ d 0 = o P ( N − 1 ⁄ 4 ) , where θ ˆ d is as in (A7).

Proof of Lemma G.3

Let d ≥ 0 . Due to the definition of θ ˆ d given in (A7) and Lemma 2.1, we have

(A9) N 1 4 ( θ ˆ d − θ d 0 ) = N 1 4 ∣ A d ∣ ∑ i ∈ A d ( φ ( S i , η ˆ I k ( i ) c ) − E [ φ ( S i , η 0 ) ] ) = N 1 4 ∣ A d ∣ ∑ i ∈ A d ( φ ( S i , η ˆ I k ( i ) c ) − φ ( S i , η 0 ) ) + N 1 4 ∣ A d ∣ ∑ i ∈ A d ( φ ( S i , η 0 ) − E [ φ ( S i , η 0 ) ] ) .

Subsequently, we show that the two sets of summands in (A9) are of order o P ( 1 ) . We start with the first set of summands. Let i ∈ A d . With P -probability at least 1 − Δ N , we have

E [ ( φ ( S i , η ˆ I k ( i ) c ) − φ ( S i , η 0 ) ) 2 ∣ S I k c ] ≲ δ N N − κ ,

due to equation (A6). Hence, we have ∣ φ ( S i , η ˆ I k ( i ) c ) − φ ( S i , η 0 ) ∣ = O P ( δ N N − κ ) based on Chernozhukov et al. [20, Lemma 6.1]. Consequently, we have

N 1 4 ∣ A d ∣ ∑ i ∈ A d ∣ φ ( S i , η ˆ I k ( i ) c ) − φ ( S i , η 0 ) ∣ = O P ( δ N N 1 4 − κ ) = o P ( 1 ) ,

because we have κ ≥ 1 ⁄ 4 by Assumption A6. Next, we show that the second set of summands in (A9) is of order o P ( 1 ) . Let ε > 0 . We have

P N 1 4 ∣ A d ∣ ∑ i ∈ A d ( φ ( S i , η 0 ) − E [ φ ( S i , η 0 ) ] ) 2 > ε 2 ≤ N 1 2 ε 2 ∣ A d ∣ 2 ∑ i ∈ A d Var ( φ ( S i , η 0 ) ) + ∑ i , j ∈ A d , i ≠ j Cov ( φ ( S i , η 0 ) , φ ( S i , η 0 ) ) = N 1 2 ε 2 ∣ A d ∣ 2 ( ∣ A d ∣ + 2 ∣ E D ∩ A d 2 ∣ ) O ( 1 ) ,

because Var ( φ ( S i , η 0 ) ) and Cov ( φ ( S i , η 0 ) , φ ( S i , η 0 ) ) are bounded by constants uniformly over i due to Lemma G.2, and because Cov ( φ ( S i , η 0 ) , φ ( S i , η 0 ) ) does not equal 0 only if { i , j } ∈ E D ∩ A d 2 , where E D denotes the edge set of the dependency graph. There are ∣ A d ∣ many nodes in A d , and each node has a maximal degree of d max . Thus, we have ∣ E D ∩ A d 2 ∣ ≤ 1 ⁄ 2 ∣ A d ∣ d max . Due to d max = o ( N 1 ⁄ 4 ) and ∣ A d ∣ = Ω ( N 3 ⁄ 4 ) , which hold according to Assumptions A1 and A5, we obtain

N 1 2 ε 2 ∣ A d ∣ 2 ( ∣ A d ∣ + 2 ∣ E ∩ A d 2 ∣ ) O ( 1 ) = o ( 1 ) .

Consequently, we also have

□ N 1 4 ∣ A d ∣ ∑ i ∈ A d ( φ ( S i , η 0 ) − E [ φ ( S i , η 0 ) ] ) = o P ( 1 ) .

Lemma G.4

(Consistent variance estimator part I) Assume the assumptions of Theorem G.1 hold. We have

1 N ∑ i = 1 N ( ψ 2 ( S i , θ ˆ d ( i ) , η ˆ I k ( i ) c ) − E [ ψ 2 ( S i , θ d ( i ) 0 , η 0 ) ] ) = o P ( 1 ) .

Proof of Lemma G.4

We have

(A10) 1 N ∑ i = 1 N ( ψ 2 ( S i , θ ˆ d ( i ) , η ˆ I k ( i ) c ) − E [ ψ 2 ( S i , θ d ( i ) 0 , η 0 ) ] ) = 1 N ∑ i = 1 N ( ψ 2 ( S i , θ ˆ d ( i ) , η ˆ I k ( i ) c ) − ψ 2 ( S i , θ ˆ d ( i ) , η 0 ) ) + 1 N ∑ i = 1 N ( ψ 2 ( S i , θ ˆ d ( i ) , η 0 ) − ψ 2 ( S i , θ d ( i ) 0 , η 0 ) ) + 1 N ∑ i = 1 N ( ψ 2 ( S i , θ d ( i ) 0 , η 0 ) − E [ ψ 2 ( S i , θ d ( i ) 0 , η 0 ) ] ) .

We bound the three sets of summands in (A10) individually. The first set of summands can be expressed as

1 N ∑ i = 1 N ( ψ 2 ( S i , θ ˆ d ( i ) , η ˆ I k ( i ) c ) − ψ 2 ( S i , θ ˆ d ( i ) , η 0 ) ) = 1 N ∑ i = 1 N ( φ 2 ( S i , η ˆ I k ( i ) c ) − φ 2 ( S i , η 0 ) ) − 2 N ∑ i = 1 N θ ˆ d ( i ) ( φ ( S i , η ˆ I k ( i ) c ) − φ ( S i , η 0 ) ) .

We have

(A11) 1 N ∑ i = 1 N ( φ 2 ( S i , η ˆ I k ( i ) c ) − φ 2 ( S i , η 0 ) ) = o P ( 1 )

because the function R ∋ x ↦ x 2 ∈ R is continuous and due to equation (A6). Indeed, let ε > 0 . Because the function R ∋ x ↦ x 2 ∈ R is continuous, there exists δ > 0 such that if ∣ φ ( S i , η ˆ I k ( i ) c ) − φ ( S i , η 0 ) ∣ < δ , then also ∣ φ 2 ( S i , η ˆ I k ( i ) c ) − φ 2 ( S i , η 0 ) ∣ < ε . Consequently, we have

P ( ∣ φ 2 ( S i , η ˆ I k ( i ) c ) − φ 2 ( S i , η 0 ) ∣ > ε ∣ S I k ( i ) c ) ≤ P ( ∣ φ ( S i , η ˆ I k ( i ) c ) − φ ( S i , η 0 ) ∣ > δ ∣ S I k ( i ) c ) ≤ 1 δ sup η ∈ T ‖ φ ( S i , η ) − φ ( S i , η 0 ) ‖ P , 1 ,

with P -probability at least 1 − Δ N , and we infer (A11) due to (A6). The estimator θ ˆ d ( i ) is a consistent estimator of θ d ( i ) 0 due to Lemma G.3, and θ d ( i ) 0 is bounded independent of i due to Assumption 3.5. Moreover, we have ∣ φ ( S i , η ˆ I k ( i ) c ) − φ ( S i , η 0 ) ∣ = o P ( 1 ) due to (A6) and Chernozhukov et al. [20, Lemma 6.1]. Consequently, we have

2 N ∑ i = 1 N θ ˆ d ( i ) ( φ ( S i , η ˆ I k ( i ) c ) − φ ( S i , η 0 ) ) = o P ( 1 ) ,

due to Hölder’s inequality. Hence, the first set of summands in (A10) is of order o P ( 1 ) . The second set of summand in (A10) can be decomposed as

1 N ∑ i = 1 N ( ψ 2 ( S i , θ ˆ d ( i ) , η 0 ) − ψ 2 ( S i , θ d ( i ) 0 , η 0 ) ) = 1 N ∑ i = 1 N ( θ ˆ d ( i ) 2 − ( θ d ( i ) 0 ) 2 ) − 2 N ∑ i = 1 N ( θ ˆ d ( i ) − θ d ( i ) 0 ) φ ( S i , η 0 ) .

We have ∣ 1 N ∑ i = 1 N ( θ ˆ d ( i ) 2 − ( θ d ( i ) 0 ) 2 ) ∣ = o P ( 1 ) due to Lemma G.3. Lemma G.2 bounds φ 2 ( S i , η 0 ) in probability. Due to Hölder’s inequality, we obtain

2 N ∑ i = 1 N ( θ ˆ d ( i ) − θ d ( i ) 0 ) φ ( S i , η 0 ) = o P ( 1 ) .

Consequently, the second set of summands in (A10) is of order o P ( 1 ) . Finally, we bound the third set of summands in (A10). Let ε > 0 . We have

P 1 N ∑ i = 1 N ( ψ 2 ( S i , θ d ( i ) 0 , η 0 ) − E [ ψ 2 ( S i , θ d ( i ) 0 , η 0 ) ] ) 2 > ε 2 ≤ 1 ε 2 N 2 ∑ i = 1 N Var ( ψ 2 ( S i , θ d ( i ) 0 , η 0 ) ) + ∑ i , j ∈ [ N ] , { i , j } ∈ E D Cov ( ψ 2 ( S i , θ d ( i ) 0 , η 0 ) , ψ 2 ( S j , θ d ( j ) 0 , η 0 ) ) ≤ 1 ε 2 N 2 ( N O ( 1 ) + N d max O ( 1 ) ) = o ( 1 )

because Var ( ψ 2 ( S i , θ d ( i ) 0 , η 0 ) ) and Cov ( ψ 2 ( S i , θ d ( i ) 0 , η 0 ) , ψ 2 ( S j , θ d ( j ) 0 , η 0 ) ) are bounded uniformly over i and j by Lemma G.2, because Cov ( ψ 2 ( S i , θ d ( i ) 0 , η 0 ) , ψ 2 ( S j , θ d ( j ) 0 , η 0 ) ) does not vanish only if { i , j } ∈ E D , and because d max = o ( N 1 ⁄ 4 ) by Assumption A1. Consequently, also the third set of summands in (A10) is of order o P ( 1 ) , and we have established the statement of the present lemma.□

Lemma G.5

(Consistent variance estimator part II) Assume the assumptions of Theorem G.1 hold. Denote by E D the edge set of the dependency graph. We have

1 N ∑ i , j ∈ [ N ] , { i , j } ∈ E D ( ψ ( S i , θ ˆ d ( i ) , η ˆ I k ( i ) c ) ψ ( S j , θ ˆ d ( j ) , η ˆ I k ( j ) c ) − E [ ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) ] ) = o P ( 1 ) .

Proof of Lemma G.5

We have the decomposition

(A12) 1 N ∑ i , j ∈ [ N ] , { i , j } ∈ E D ( ψ ( S i , θ ˆ d ( i ) , η ˆ I k ( i ) c ) ψ ( S j , θ ˆ d ( j ) , η ˆ I k ( j ) c ) − E [ ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) ] ) = 2 N ∑ { i , j } ∈ E D ( ψ ( S i , θ ˆ d ( i ) , η ˆ I k ( i ) c ) ψ ( S j , θ ˆ d ( j ) , η ˆ I k ( j ) c ) − ψ ( S i , θ d ( i ) 0 , η ˆ I k ( i ) c ) ψ ( S j , θ d ( j ) 0 , η ˆ I k ( j ) c ) ) + 2 N ∑ { i , j } ∈ E D ( ψ ( S i , θ d ( i ) 0 , η ˆ I k ( i ) c ) ψ ( S j , θ d ( j ) 0 , η ˆ I k ( j ) c ) − ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) ) + 2 N ∑ { i , j } ∈ E D ( ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) − E [ ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) ] ) .

Subsequently, we bound the three sets of summands in (A12) individually. We start by bounding the first set of summands. We have

1 N ∑ { i , j } ∈ E D ( ψ ( S i , θ ˆ d ( i ) , η ˆ I k ( i ) c ) ψ ( S j , θ ˆ d ( j ) , η ˆ I k ( j ) c ) − ψ ( S i , θ d ( i ) 0 , η ˆ I k ( i ) c ) ψ ( S j , θ d ( j ) 0 , η ˆ I k ( j ) c ) ) = 2 N ∑ { i , j } ∈ E D ( θ d ( i ) 0 − θ ˆ d ( i ) ) ψ ( S j , θ d ( j ) 0 , η ˆ I k ( j ) c ) + 1 N ∑ { i , j } ∈ E D ( θ d ( i ) 0 − θ ˆ d ( i ) ) ( θ d ( j ) 0 − θ ˆ d ( j ) ) .

We have

1 N ∑ { i , j } ∈ E D ( θ d ( i ) 0 − θ ˆ d ( i ) ) ψ ( S j , θ d ( j ) 0 , η ˆ I k ( j ) c )

≤ 1 N ∑ { i , j } ∈ E D ( θ d ( i ) 0 − θ ˆ d ( i ) ) 2 1 N ∑ { i , j } ∈ E D ψ ( S j , θ d ( j ) 0 , η ˆ I k ( j ) c ) = 1 N ∣ E D ∣ o P ( N − 1 ⁄ 4 ) = d max o P ( N − 1 ⁄ 4 ) = o P ( 1 )

due to Hölder’s inequality, Lemma G.3, Lemma G.2, and Assumption A1. Moreover, we have

1 N ∑ { i , j } ∈ E D ( θ d ( i ) 0 − θ ˆ d ( i ) ) ( θ d ( j ) 0 − θ ˆ d ( j ) ) = 1 N ∣ E D ∣ o P ( N − 1 ⁄ 2 ) = o P ( 1 )

due to Hölder’s inequality, Lemma G.3, and Assumption A1. Consequently, the first set of summands in (A12) is of order o P ( 1 ) . We proceed to bound the second set of summands in (A12). Let { i , j } ∈ E D . Due to the construction of S I k ( i ) and S I k ( i ) c , we have S i = ( W i , C i , X i , Z i , Y i ) ∈ S I k ( i ) , and none of W i , C i , Y i , or the variables used to compute X i belong to S I k ( i ) c . Moreover, the variables W i , C i , Y i , and the variables used to compute X i also cannot belong to S I k ( j ) c as otherwise we would have S i ⊥ ⊥ S j , and consequently, { i , j } ∉ E D . Therefore, we have

E [ ∣ ψ ( S i , θ d ( i ) 0 , η ˆ I k ( i ) c ) ψ ( S j , θ d ( j ) 0 , η ˆ I k ( j ) c ) − ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) ∣ ∣ S I k ( i ) c , S I k ( j ) c ] ≤ sup η 1 , η 2 ∈ T E [ ∣ ψ ( S i , θ d ( i ) 0 , η 1 ) ψ ( S j , θ d ( j ) 0 , η 2 ) − ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) ∣ ] ≤ sup η 1 ∈ T ‖ φ ( S i , η 1 ) − φ ( S i , η 0 ) ‖ P , 2 ‖ ψ ( S j , θ d ( j ) 0 , η 0 ) ‖ 2 + sup η 2 ∈ T ‖ ψ ( S i , θ d ( i ) 0 , η 0 ) ‖ P , 2 ‖ φ ( S j , η 2 ) − φ ( S j , η 0 ) ‖ P , 2 + sup η 1 , η 2 ∈ T ‖ φ ( S i , η 1 ) − φ ( S i , η 0 ) ‖ P , 2 ‖ φ ( S j , η 2 ) − φ ( S j , η 0 ) ‖ P , 2 ,

with P -probability at least 1 − Δ N due to Hölder’s inequality. Because all terms mentioned earlier are uniformly bounded due to Lemma G.2, we infer that the second set of summands in (A12) is of order o P ( 1 ) based on Chernozhukov et al. [20, Lemma 6.1]. Finally, we bound the third set of summands in (A12). Let ε > 0 . We have

(A13) P 1 N ∑ { i , j } ∈ E D ( ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) − E [ ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) ] ) 2 > ε 2 ≤ 1 ε 2 N 2 ∑ { i , j } ∈ E D Var ( ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S i , θ d ( j ) 0 , η 0 ) ) + ∑ { i , j } , { m , r } ∈ E D , unequal Cov ( ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) , ψ ( S m , θ d ( m ) 0 , η 0 ) ψ ( S r , θ d ( r ) 0 , η 0 ) ) .

Due to Lemma G.2, the variance and covariance terms in (A13) are uniformly bounded by constants. Furthermore, the covariance terms do only not equal 0 if S i depends on S m or S r , or if S j depends on S m or S r . In order to better describe these dependency relationships, we build a graph on the edge set of the dependency graph. We consider the graph G ′ = ( V ′ , E ′ ) with V ′ = E D and such that an edge { { i , j } , { m , r } } ∈ E ′ if and only if at least one of { i , m } , { i , r } , { j , m } , { j , r } belongs to E D . Consequently, { { i , j } , { m , r } } ∈ E ′ if and only if ( S i , S j ) ⊥̸ ⊥ ( S m , S r ) , in which case the covariance term in (A13) corresponding to { i , j } and { m , r } does not vanish. Furthermore, we have ∣ E ′ ∣ = 1 ⁄ 2 ∣ E D ∣ d max ′ , where d max ′ denotes the maximal degree of a node in G ′ . We have d max ′ ≤ 2 d max . Consequently, we have

P 1 N ∑ { i , j } ∈ E D ( ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) − E [ ψ ( S i , θ d ( i ) 0 , η 0 ) ψ ( S j , θ d ( j ) 0 , η 0 ) ] ) 2 > ε 2

≤ 1 ε 2 N 2 ( ∣ E D ∣ + ∣ E ′ ∣ ) O ( 1 ) ≤ 1 ε 2 N 2 ( N d max + N d max 2 ) O ( 1 ) = 1 ε 2 N ( o ( N 1 ⁄ 4 ) + o ( N 1 ⁄ 2 ) ) O ( 1 ) = o ( 1 )

due to Assumption A1. Therefore, we have established the statement of the present lemma because we have verified that all three sets of summands in (A12) are of order o P ( 1 ) .□

Proof of Theorem G.1

The proof follows from Lemmas G.4 and G.5.□

H Extension to estimate global effects

So far, we focused on the EATE. We intervened on each individual unit and left the treatment selections of the other units as they were.

Subsequently, we consider another type of treatment effect where we assess the effect of a single intervention that intervenes on all subjects simultaneously. Instead of the EATE in (2), we subsequently consider the GATE with respect to the binary vector π ∈ { 0 , 1 } N of treatment selections

(A14) ξ N 0 ( π ) = 1 N ∑ i = 1 N E [ Y i d o ( W = π ) − Y i d o ( W = 1 − π ) ] ,

where W = ( W 1 , … , W N ) denotes the complete vector of treatment selections of all units. In practice, the most common choice is where all components of π equal 1, i.e., the treatment effect comes from comparing the situation where all units are assigned to the treatment versus where no-one gets the treatment.

We use the same definition for S i , i ∈ [ N ] as before and denote the dependency graph on S i , i ∈ [ N ] by G D = ( V , E D ) . Furthermore, we let α ( i ) = { j ∈ [ N ] : { i , j } ∈ E D } ∪ { i } for i ∈ [ N ] denote the nodes that share an edge with i in the dependency graph together with i itself. For some real number ξ ∈ R and a nuisance function triple η = ( g 1 , g 0 , h ) , consider the score function

(A15) ψ ( S i , θ , ξ ) = g 1 ( C i , X i ) − g 0 ( C i , X i ) + ∏ j ∈ α ( i ) W j h ( C j , Z j ) ( Y i − g 1 ( C i , X i ) ) − ∏ j ∈ α ( i ) 1 − W j 1 − h ( C j , Z j ) ( Y i − g 0 ( C i , X i ) ) − ξ .

In contrast to the score that we used for the EATE, this score includes additional factors W j h ( C j , Z j ) and 1 − W j 1 − h ( C j , Z j ) for units j that share an edge with i in the dependency graph. With the GATE, when we globally intervene on all treatment selections at the same time, this also influences the X i that are present in g 1 and g 0 . In the score (A15), the “correction terms” ( ∏ j ∈ α ( i ) W j h ( C j , Z j ) ) ( Y i − g 1 ( C i , X i ) ) and ( ∏ j ∈ α ( i ) 1 − W j 1 − h ( C j , Z j ) ) ( Y i − g 0 ( C i , X i ) ) are only active if i and the units from which it receives spillover effects have the same observed treatment selection.

Let us denote by

ξ i 0 = E [ Y i d o ( W = π ) − Y i d o ( W = 1 − π ) ] = E [ g 1 0 ( C i , X i π ) − g 0 0 ( C i , X i 1 − π ) ]

the i th contribution in (A14). Here,

X i π = ( f x 1 ( π − i , C − i , A ) , … , f x r ( π − i , C − i , A ) )

denotes the feature vector where W j is replaced by π j , and

X i 1 − π = ( f x 1 ( 1 − π − i , C − i , A ) , … , f x r ( 1 − π − i , C − i , A ) )

denotes the feature vector where W j is replaced by 1 − π j . The features Z i π and Z i 1 − π are defined analogously. Similar to Lemma 2.1, it can be shown that E [ ψ ( S i , ξ i 0 , η 0 ) ] = 0 holds, which lets us identify the global treatment effect ξ N 0 by

ξ N 0 = 1 N ∑ i = 1 N E [ φ ( S i , η 0 ) ] ,

where

φ ( S i , η ) = g 1 ( C i , X i ) − g 0 ( C i , X i ) + ∏ j ∈ α ( i ) W j h ( C j , Z j ) ( Y i − g 1 ( C i , X i ) ) − ∏ j ∈ α ( i ) 1 − W j 1 − h ( C j , Z j ) ( Y i − g 0 ( C i , X i ) ) .

To estimate ξ N 0 , we apply the same procedure as for the ATE. The only difference is that when we evaluate the machine learning estimates, we do not use the observed treatment selections, but instead insert the respective components of π and 1 − π . However, we insert the actually observed treatment selections in the product terms ∏ j ∈ α ( i ) W j h ( C j , Z j ) and ∏ j ∈ α ( i ) 1 − W j 1 − h ( C j , Z j ) . This gives the estimator ξ ˆ . Analogously to Theorem 2.2 for the EATE, also the GATE with respect to π converges at the parametric rate and follows a Gaussian distribution asymptotically.

Theorem H.1

(Asymptotic distribution of ξ ˆ ) Assume Assumption A3 (with θ replaced by ξ ), 1, and A4 in the appendix in Section A hold. Furthermore, assume that there exists a finite real constant L such that ∣ α ( i ) ∣ ≤ L holds for all i ∈ [ N ] .

Then, the estimator ξ ˆ of the GATE with respect to π ∈ { 0 , 1 } N , ξ N 0 , satisfies

N ( ξ ˆ − ξ N 0 ) → d N ( 0 , σ ∞ ) ,

where σ ∞ is characterized in Assumption A2 with the ψ in (A15). The convergence in (H.1) is in fact uniformly over the law P of the observations.

This theorem requires that the number of spillover effects a unit receives is bounded. Theorem 2.2 that establishes the parametric convergence rate and asymptotic Gaussian distribution of the EATE estimator did not require such an assumption. The reason is that h 0 ( C i , Z i ) represents the conditional expectation of W i given C i and Z i and consequently a probability taking values in the interval ( 0 , 1 ) . If we allowed ∣ α ( i ) ∣ to grow with N , the products ∏ j ∈ α ( i ) W j h ( C j , Z j ) and ∏ j ∈ α ( i ) 1 − W j 1 − h ( C j , Z j ) would diverge.

To estimate σ ∞ 2 in Theorem H.1, we can apply the procedure described in Section G, where we replace ψ , φ , and the point estimators by the respective new quantities. Also an analog of Theorem G.1 holds, but where we assume the setting of Theorem H.1 holds and that ∣ A d ∣ → ∞ as N → ∞ for all d ≥ 0 . In particular, we do not require Assumptions A5 and A6 formulated in the appendix in Section A. Furthermore, to prove consistency of the variance estimator, it is sufficient to establish that the degree-specific causal effect estimators ξ ˆ d , which are defined analogously to θ ˆ d , are consistent. In particular, they are not required to converge at a particular rate.

Also van der Laan [14], Sofrygin and van der Laan [10], and Ogburn et al. [6] consider semiparametric estimation of the GATE using TMLE. They also require a uniform bound of the number of spillover effects a unit receives to achieve the parametric convergence rate of their estimator.

References

[1] Rubin D. Comment on: Randomization analysis of experimental data in the fisher randomization test by D. Basu. J Amer Stat Assoc. 1980;75:591–3. 10.2307/2287653Search in Google Scholar

[2] Perez-Heydrich C, Hudgens MG, Halloran E, Clemens JD, Ali M, Emch ME. Assessing effects of cholera vaccination in the presence of interference. Biometrics. 2014;70(3):731–41. 10.1111/biom.12184Search in Google Scholar PubMed PubMed Central

[3] Sävje F, Aronow PM, Hudgens MG. Average treatment effects in the presence of unknown interference. Ann Stat. 2021;49(2):673–701. 10.1214/20-AOS1973Search in Google Scholar

[4] Lee Y, Ogburn EL. Network dependence can lead to spurious associations and invalid inference. J Amer Stat Assoc. 2021;116(535):1060–74. 10.1080/01621459.2020.1782219Search in Google Scholar

[5] Sobel ME. What do randomized studies of housing mobility demonstrate? J Amer Stat Assoc. 2006;101(476):1398–407. 10.1198/016214506000000636Search in Google Scholar

[6] Ogburn EL, Sofrygin O, Diiiaz I, van der Laan MJ. Causal inference for social network data. J Amer Stat Assoc. 2022;0(0):1–15. Search in Google Scholar

[7] Eckles D, Bakshy E. Bias and high-dimensional adjustment in observational studies of peer effects. J Amer Stat Assoc. 2021;116(534):507–17. 10.1080/01621459.2020.1796393Search in Google Scholar

[8] Ogburn EL, VanderWeele TJ. Vaccines, contagion, and social networks. Ann Appl Stat. 2017;11(2):919–48. 10.1214/17-AOAS1023Search in Google Scholar

[9] Hudgens MG, Halloran E. Toward causal inference with interference. J Amer Stat Assoc. 2008;103(482):832–42. 10.1198/016214508000000292Search in Google Scholar PubMed PubMed Central

[10] Sofrygin O, van der Laan MJ. Semi-parametric estimation and inference for the mean outcome of the single time-point intervention in a causally connected population. J Causal Inference. 2017;5(1):1–35. 10.1515/jci-2016-0003Search in Google Scholar PubMed PubMed Central

[11] Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82(4):669–88. 10.1093/biomet/82.4.669Search in Google Scholar

[12] Splawa-Neyman J, Dabrowska DM, Speed TP. On the application of probability theory to agricultural experiments. Essay on Principles. Section 9. Stat Sci. 1990;5(4):465–72. 10.1214/ss/1177012031Search in Google Scholar

[13] Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educat Psychol. 1974;66(5):688–701. 10.1037/h0037350Search in Google Scholar

[14] van der Laan M. Causal inference for a population of causally connected units. J Causal Inference. 2014;2(1):13–74. 10.1515/jci-2013-0002Search in Google Scholar PubMed PubMed Central

[15] Manski CF. Identification of endogenous social effects: the reflection problem. Rev Econ Stud. 1993;60(3):531–42. 10.2307/2298123Search in Google Scholar

[16] Chin A. Regression adjustments for estimating the global treatment effect in experiments with interference. J Causal Inference. 2019;7(2):20180026. 10.1515/jci-2018-0026Search in Google Scholar

[17] Cai J, De Janvry A, Sadoulet E. Social networks and the decision to insure. Amer Econ J Appl Econ. 2015;7(2):81–108. 10.1257/app.20130442Search in Google Scholar

[18] Leung M. Treatment and spillover effects under network interference. Rev Econ Stat. 2020;102(2):368–80. 10.1162/rest_a_00818Search in Google Scholar

[19] Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Amer Stat Assoc. 1995;90(429):106–21. 10.1080/01621459.1995.10476493Search in Google Scholar

[20] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, et al. Double/debiased machine learning for treatment and structural parameters. Econom J. 2018;21(1):C1–68. 10.1111/ectj.12097Search in Google Scholar

[21] Tchetgen Tchetgen EJ, Fulcher IR, Shpitser I. Auto-G-computation of causal effects on a network. J Amer Stat Assoc. 2021;116(534):833–44. 10.1080/01621459.2020.1811098Search in Google Scholar PubMed PubMed Central

[22] Robins J. A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect. Math Model. 1986;7(9):1393–512. 10.1016/0270-0255(86)90088-6Search in Google Scholar

[23] Lauritzen SL, Richardson TS. Chain graph models and their causal interpretations. J R Stat Soc Ser B (Stat Methodol). 2002;64(3):321–48. 10.1111/1467-9868.00340Search in Google Scholar

[24] van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1). 10.2202/1557-4679.1043Search in Google Scholar

[25] van der Laan MJ, Rose S. Targeted learning. Springer Series in Statistics. New York: Springer; 2011. 10.1007/978-1-4419-9782-1Search in Google Scholar

[26] van der Laan MJ, Rose S. Targeted learning in data science. Springer Series in Statistics. New York: Springer; 2018. 10.1007/978-3-319-65304-4Search in Google Scholar

[27] Stadtfeld C, Vörös A, Elmer T, Boda Z, Raabe IJ. Integration in emerging social networks explains academic failure and success. Proc Nat Acad Sci. 2019;116(3):792–7. 10.1073/pnas.1811388115Search in Google Scholar PubMed PubMed Central

[28] Vörös A, Boda Z, Elmer T, Hoffman M, Mepham K, Raabe IJ, et al. The Swiss StudentLife Study: Investigating the emergence of an undergraduate community through dynamic, multidimensional social network data. Soc Netw. 2021;65:71–84. 10.1016/j.socnet.2020.11.006Search in Google Scholar

[29] Bühlmann P. Sieve bootstrap for time series. Bernoulli. 1997;3(2):123–48. 10.2307/3318584Search in Google Scholar

[30] Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. J Amer Stat Assoc. 1999;94(448):1096–120. 10.1080/01621459.1999.10473862Search in Google Scholar

[31] Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–72. 10.1111/j.1541-0420.2005.00377.xSearch in Google Scholar PubMed

[32] Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Amer Stat Assoc. 1995;90(429):122–9. 10.1080/01621459.1995.10476494Search in Google Scholar

[33] Lauritzen SL. Graphical models. Oxford statistical science series. Oxford: Clarendon Press; 1996. 10.1093/oso/9780198522195.001.0001Search in Google Scholar

[34] Pearl J. Graphs, causality, and structural equation models. Sociol Meth Res. 1998;27(2):226–84. 10.1177/0049124198027002004Search in Google Scholar

[35] Pearl J. Causality: Models, reasoning, and inference. 2nd ed. Cambridge: Cambridge University Press; 2009. 10.1017/CBO9780511803161Search in Google Scholar

[36] Pearl J. An introduction to causal inference. Int J Biostat. 2010;6(2):7. 10.2202/1557-4679.1203Search in Google Scholar PubMed PubMed Central

[37] Perković E, Textor J, Kalisch M, Maathuis MH. Complete graphical characterization and construction of adjustment sets in markov equivalence classes of ancestral graphs. J Machine Learn Res. 2018;18(220):1–62. Search in Google Scholar

[38] Meinshausen N, Meier L, Bühlmann P. p-Values for high-dimensional regression. J Amer Stat Assoc. 2009;104(488):1671–81. 10.1198/jasa.2009.tm08647Search in Google Scholar

[39] Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of lasso and dantzig selector. Ann Stat. 2009;37(4):1705–32. 10.1214/08-AOS620Search in Google Scholar

[40] Bühlmann P, van de Geer S. Statistics for high-dimensional data: methods, theory and applications. Springer Series in Statistics. Heidelberg: Springer; 2011. 10.1007/978-3-642-20192-9Search in Google Scholar

[41] Belloni A, Chernozhukov V, Wang L. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika. 2011;98(4):791–806. 10.1093/biomet/asr043Search in Google Scholar

[42] Belloni A, Chernozhukov V. Ell-penalized quantile regression in high-dimensional sparse models. Ann Stat. 2011;39(1):82–130. 10.1214/10-AOS827Search in Google Scholar

[43] Belloni A, Chen D, Chernozhukov V, Hansen C. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica. 2012;80(6):2369–429. 10.3982/ECTA9626Search in Google Scholar

[44] Belloni A, Chernozhukov V. Least squares after model selection in high-dimensional sparse models. Bernoulli. 2013;19(2):521–47. 10.3150/11-BEJ410Search in Google Scholar

[45] Kozbur D. Analysis of testing-based forward model selection. Econometrica. 2020;88(5):2147–73. 10.3982/ECTA16273Search in Google Scholar

[46] Luo Y, Spindler M. High-Dimensional L2 Boosting: Rate of Convergence; 2016. Preprint arXiv:1602.08927. Search in Google Scholar

[47] Wager S, Walther G. Adaptive Concentration of Regression Trees, with Application to Random Forests; 2016. Preprint arXiv:1503. 06388. Search in Google Scholar

[48] Chen X, White H. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Trans Inform Theory. 1999;45:682–91. 10.1109/18.749011Search in Google Scholar

[49] Stein C. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 2: Probability theory. vol. 6. University of California Press; 1972. p. 583–603. 10.1525/9780520423671-036Search in Google Scholar

[50] Smucler E, Rotnitzky A, Robins JM. A unifying approach for doubly-robust ell1 regularized estimation of causal contrasts; 2019. Preprint arXiv:1904.03737. Search in Google Scholar

[51] Hájek J. Comment on “An essay on the logical foundations of survey sampling, part one” by Basu. In: Godambe VP, Sprott DA, editors. Foundations of Statistical Inference. Toronto: Holt, Rinehart and Winston; 1971. p. 236. Search in Google Scholar

[52] Li S, Wager S. Random graph asymptotics for treatment effect estimation under network interference. Ann Stat. 2022;50(4):2334–58. 10.1214/22-AOS2191Search in Google Scholar

[53] Rosenbaum PR. Model-based direct adjustment. J Amer Stat Assoc. 1987;82(398):387–94. 10.1080/01621459.1987.10478441Search in Google Scholar

[54] Hirano K, Imbens G, Ridder G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica. 2003;71(4):1161–89. 10.1111/1468-0262.00442Search in Google Scholar

[55] Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006;Complex Systems:1695. https://igraph.org. Search in Google Scholar

[56] Erdős P, Rényi A. On random graphs I. Publicationes Mathematicae. 1959;6:290–7. 10.5486/PMD.1959.6.3-4.12Search in Google Scholar

[57] Watts DJ, Strogatz S. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–2. 10.1038/30918Search in Google Scholar PubMed

[58] Wright MN, Ziegler A.A fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw. 2017;77(1):1–17. 10.18637/jss.v077.i01Search in Google Scholar

[59] Chamorro-Premuzic T, Furnham A. Personality, intelligence and approaches to learning as predictors of academic performance. Personality Individual Differ. 2008;44(7):1596–603. 10.1016/j.paid.2008.01.003Search in Google Scholar

[60] Los R, Schweinle A. The interaction between student motivation and the instructional environment on academic outcome: a hierarchical linear model. Soc Psychol Educat. 2019;22(2):471–500. 10.1007/s11218-019-09487-5Search in Google Scholar

[61] Heckman JJ. Skill formation and the economics of investing in disadvantaged children. Science. 2006;312(5782):1900–2. 10.1126/science.1128898Search in Google Scholar PubMed

[62] Spinath B, Stiensmeier-Pelster J, Schoene C, Dickhäuser O. Skalen zur Erfassung der Lern- und Leistungsmotivation: SELLMO. Bern: Hogrefe Verlag; 2002. Search in Google Scholar

[63] Cohen S, Williamson G. Perceived stress in a probability sample of the United States. In: Spacapam S, Oskamp S, editors. The Social Psychology of Health: Claremont Symposium on Applied Social Psychology. Newbury Park, CA: Sage; 1988. Search in Google Scholar

[64] Lattimore T, Szepesvári C. Bandit algorithms. Cambridge: Cambridge University Press; 2020. 10.1017/9781108571401Search in Google Scholar

[65] Ugander J, Karrer B, Backstrom L, Kleinberg J. Graph cluster randomization: network exposure to multiple universes; 2013. Preprint arXiv:1305. 6979. 10.1145/2487575.2487695Search in Google Scholar

[66] Eckles D, Karrer B, Ugander J. Design and analysis of experiments in networks: reducing bias from interference. J Causal Infer. 2017;5(1):20150021. 10.1515/jci-2015-0021Search in Google Scholar

[67] Robins G, Pattison P, Elliott P. Network models for social influence processes. Psychometrika. 2001;66(2):161–89. 10.1007/BF02294834Search in Google Scholar

[68] Daraganova G, Robins G. Autologistic actor attribute models. In: Lusher D, Koskinen J, Robins G, editors. Structural analysis in the social sciences. Cambridge: Cambridge University Press; 2012. p. 102–14. 10.1017/CBO9780511894701.011Search in Google Scholar

[69] Snijders TAB. Models for longitudinal network data. In: Carrington PJ, Scott J, Wasserman S, editors. Structural Analysis in the Social Sciences. Cambridge: Cambridge University Press; 2005. p. 215–47. 10.1017/CBO9780511811395.011Search in Google Scholar

[70] Snijders TAB, van de Bunt GG, Steglich CEG. Introduction to stochastic actor-based models for network dynamics. Soc Netw. 2010;32(1):44–60. 10.1016/j.socnet.2009.02.004Search in Google Scholar

[71] Steglich C, Snijders TAB, Pearson M. Dynamic networks and behavior: separating selection from influence. Sociol Methodol. 2010;40(1):329–93. 10.1111/j.1467-9531.2010.01225.xSearch in Google Scholar

[72] Peters J, Janzing D, Schölkopf B. Elements of causal inference: Foundations and learning algorithms. Adaptive Comput Machine Learn. Cambridge, MA: The MIT Press; 2017. Search in Google Scholar

[73] Maathuis M, Drton M, Lauritzen S, Wainwright M, editors. Handbook of graphical models. Handbooks of Modern Statistical Methods. Boca Raton, FL: Chapman & Hall/CRC; 2019. 10.1201/9780429463976Search in Google Scholar

[74] Chin A. Central limit theorems via Stein’s method for randomized experiments under interference; 2018. Preprint arXiv:1804.03105. Search in Google Scholar

[75] Ross N. Fundamentals of Stein’s method. Probability surveys. 2011;8(none):210–93. 10.1214/11-PS182Search in Google Scholar

[76] Bickel PJ, Freedman DA. Some asymptotic theory for the bootstrap. Ann Stat. 1981;9(6):1196–217. 10.1214/aos/1176345637Search in Google Scholar

Received: 2023-12-17

Revised: 2024-12-09

Accepted: 2025-01-04

Published Online: 2025-04-03

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2023-0082

Keywords for this article

dependent data; interference; observed confounding; semiparametric inference; spillover effects

Creative Commons

BY 4.0