The functional average treatment effect

Shane Sparkes; Erika Garcia; Lu Zhang

doi:10.1515/jci-2023-0076

Article Open Access

The functional average treatment effect

Shane Sparkes , Erika Garcia and Lu Zhang

Published/Copyright: November 9, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Causal Inference Volume 12 Issue 1

Abstract

This article establishes the functional average as an important estimand for causal inference. The significance of the estimand lies in its robustness against traditional issues of confounding. We prove that this robustness holds even when the probability distribution of the outcome, conditional on treatment or some other vector of adjusting variables, differs almost arbitrarily from its counterfactual analogue. This article also examines possible estimators of the functional average, including the sample mid-range, and proposes a new type of bootstrap for robust statistical inference: the Hoeffding bootstrap. After this, the article explores a new class of variables, the U class, that simplifies the estimation of functional averages. This class of variables is also used to establish mean exchangeability in some cases and to provide the results of elementary statistical procedures, such as linear regression and the analysis of variance, with causal interpretations. Simulation evidence is provided. The methods of this article are also applied to a National Health and Nutrition Survey data set to investigate the causal effect of exercise on the blood pressure of adult smokers.

Keywords: causal inference; functional average; extreme order statistics; mean exchangeability; linear regression; Hoeffding bootstrap

MSC 2010: 60E05; 62J99; 62G30; 62G32

1 Introduction

The Neyman-Rubin (NR) model is an important framework for causal inference with observational designs [1–4]. Say we are interested in studying a population of random variables { Y i } i ∈ I , where I = { 1 , … , N } , conditional on exposure to T i . Here, we specify that each T i is an indicator variable for simplicity and denote an arbitrary outcome variable Y i that is observed conditional on T i = t as Y i ∣ t . At any point in time, it is impossible to observe both Y i ∣ 1 and Y i ∣ 0 for an arbitrary unit i : a fact that makes individual contrasts, such as Y i ∣ 1 − Y i ∣ 0 , undefined. This problem also characterizes the “fundamental problem of causal inference” since differences of this type are also undefined in an experimental setup [5]. Since each Y i ∣ t is observed in the absence of experimental randomization, there is no guarantee that their properties align with the properties of their experimental counterparts: a fact that is problematic for scientific inference. The NR model frames these challenges in the language of missing outcomes and addresses it by constructing populations of counterfactual probability distributions, say { Y i ( 1 ) } i ∈ I and { Y i ( 0 ) } i ∈ I . These populations represent a hypothetical situation such that (s.t.) the treatment status of the entire population has been experimentally fixed to a sequence of treatment exposures. The goal under the NR framework is to identify summary causal effects that would have occurred, provided one could actually have constructed and manipulated both { Y i ( 1 ) } i ∈ I and { Y i ( 0 ) } i ∈ I [6]. More specifically, under this rubric, for some sample ζ of n outcome variables s.t. ζ = { Y 1 ∣ 1 , … , Y n 1 ∣ 1 } ∪ { Y 1 ∣ 0 , … , Y n 0 ∣ 0 } and n = n 1 + n 0 , we wish to identify some multivariate function g and some vector of adjusting variables L s.t. E { g ( Y 1 ∣ 1 , … , Y n 1 ∣ 1 ) ∣ L } − E { g ( Y 1 ∣ 0 , … , Y n 0 ∣ 0 ) ∣ L } = E { g { Y 1 ( 1 ) , … , Y N ( 1 ) } ∣ L } − E { g { Y 1 ( 0 ) , … , Y N ( 0 ) } ∣ L } , at least asymptotically [7]. Here, we again use the difference of estimands only as an example. When this is possible, unmeasured variables are said to be ignorable, conditional on L , and the conclusions of the observational study are equivalent to those of an experimental one with randomized treatment exposures [8,9].

Although g can be any function that has scientifically meaningful properties, a small number of summary functions and estimands have dominated the literature. Functions and estimands related to quantiles have commandeered some attention [7,10–12]. A lion’s share, however, has been claimed by the arithmetic mean and expected values. Unfortunately, the adage that “there is no free lunch” applies to these common setups. When a researcher wishes to exchange E { Y ( t ) } with an equivalent and identifiable expected value to estimate an expected treatment effect, for instance, a set of sufficient conditions are usually required: consistency (C1), strong ignorability (C2), and positivity (C3) [13]. These conditions will soon be defined in detail. For now, it suffices to state that E ( Y ∣ t , L ) = E { Y ( t ) ∣ L } for an appropriately chosen L under these assumptions. Since the expectation on the left of the equality sign is identified, it can be estimated directly “in place” of the counterfactual expected value on the right – a condition that has typically been labeled as mean exchangeability. Standardization (iterated expectation) is then used to yield the quantity of interest since E { E { Y ( t ) ∣ L } } = E { Y ( t ) } . A cardinal problem, however, is that C2 is exceptionally nontrivial to achieve [14–16]. Put succinctly, properly specifying a vector of adjusting variables L that is sufficient for supplying C2 requires a comprehensive understanding of the causal dynamics involved. However, this level of understanding is absent by design in most circumstances. Consequently, it is reasonable to assert that most researcher-specified L are likely to fall short of this mark. Finally, L is often high in dimension. Parametric methods must then be employed to approximate the mean model of Y in conjunction with standardization and bias likely ensues outside of toy examples [6].

One of our cardinal contributions is to highlight a different summary causal effect – the functional average causal effect – as a valuable estimand for the NR framework. While the expected causal effect can detect a change in the weighted average of the observable values of an outcome variable under treatment, the functional average causal effect can capture a change in the uniform average only. This makes the functional average causal effect a coarser estimand that is less flexible but more fundamental than its weighted counterpart. Put otherwise, although it can capture changes in the underlying structure of a random variable, it cannot detect changes in the distribution of these values when the distribution itself is not uniform. Hence, functional average causal effects are relevant for statisticians in research contexts s.t. the treatment or exposure being investigated is expected to transform the set of values that are possible to observe. Examples of such effects are provided in Section 2.

Since functional averages are much more elementary than traditional estimands, however, they are identified under very mild conditions and therefore avoid many of the identification challenges previously discussed. Moreover, we show that they impart a causal interpretation to the results of standard statistical procedures when used in conjunction with a more general type of causal theory that allows for unknown and unmeasured variables. For example, if Y ∣ t and Y ( t ) have the same image in the traditional analysis sense for all t of interest, this is sufficient for identifying functional average causal effects insofar as the researcher has grounds for defending the assertion that T causally impacts the outcome. The probability distributions of Y ∣ t and Y ( t ) can otherwise differ arbitrarily. Confounding is immaterial, insofar as it does not change the image of the underlying function(s). So is informative sampling more generally. All that matters for identification is that – theoretically – Y ∣ t and Y ( t ) pull from the same set of real numbers, at least conditional on some L . Although this is not a necessary condition for what we call functional average exchangeability, it is at least sufficient for establishing it.

The remainder of this article goes as follows. For clarity, we mathematically define functional averages and prove an elementary but fundamental claim in Section 2. Moreover, we show that the functional average treatment effect is salient when the researcher believes that an intervention alters the set of possible values that an outcome can achieve as aforementioned. After this, we examine a small set of functional average estimators, including the sample mid-range, and establish their statistical consistency under general conditions. Since their sampling distributions are largely intractable, we re-purpose the bootstrap as a method for conservative inference. In Section 3, we provide elucidation on a particular class of bounded random variables – the U class – that generalizes the notion of symmetry and assists in the estimation of functional averages. We also show that U random variables possess many favorable properties when it comes to causal inference. These facts allow us to also prove – under the auspices of a causal theory – that linear regressions estimate causal effects under a standard set of assumptions already employed for association studies. Section 4 presents simulation evidence that substantiates our claims.

Finally, in Section 5, we use our strategies in conjunction with data from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS) to investigate if a history of exercise activity causally impacts the systolic blood pressure (SBP) of adult smokers. Plentiful evidence exists that smoking is associated with cardiovascular disease processes and mortality [17,18]. Evidence has also been presented that smoking is a factor in arterial stiffening [19]. However, while some literature has supported the proposition that exercise lowers arterial blood pressure [20] and that smokers who exercise show fewer signs of arterial stiffening [21], the evidence is not yet definitive. Functional average estimation targets deterministic changes in the structure of an outcome variable and is thus an informative tool in this context.

2 Functional average treatment effect

In this section, we first introduce important definitions and notation, although some concepts will be left implicit for readability. For instance, we leave the underlying probability space of the form ( Ω , ℱ , P ) for an arbitrary random variable Y ( ω ) : Ω → R unstated, and the same goes for probability spaces defining joint distributions. Recall that the support of a random variable is a smallest closed set S s.t. Pr ( Y ∈ S ) = 1 . Alternatively, it can also be defined as the closure of the set of values S s.t. the density or mass function f ( y ) > 0 for ∀ y ∈ S . Here, we will be dealing with bounded random variables, which means that S is a strict subset of the real numbers. This is not a limiting constraint. Anything that can be empirically measured is necessarily bounded.

With these concepts, we can revisit the functional average. If S is discrete, define R = ∣ S ∣ , where ∣ ⋅ ∣ in this context denotes the number of elements in the set. If Y is continuous, then R = ∫ R { 1 y ∈ S } d y and the functional average A v ( ⋅ ) is A v ( Y ) = R − 1 ∫ R { y 1 y ∈ S } d y . For discrete variables, it is A v ( Y ) = R − 1 ∑ y ∈ S y . Note that we have avoided the use of general measures for purposes of accessibility.

Sometimes it will be the case that, for some measurable function g , Y = g ( X 1 , … , X k ) . Then the support of Y with respect to (w.r.t.) the joint distribution of ( X 1 , … , X k ) = X ∈ R k is some general region ℛ ⊆ S 1 × … × S k , where each S i indicates the support of X i . Without loss of generality (WLOG), we will henceforth deal only with the continuous case. In this context, R = ∫ R k 1 ( x 1 , … , x k ) ∈ ℛ d x 1 … d x k and A v x ( Y ) = R − 1 ∫ ℛ g ( x 1 , … , x k ) d x 1 … d x k .

Now, let E h Y indicate that the expectation of Y is taken w.r.t. a different density or mass function h ( y ) that is also defined on S . Then it is also apparent that A v ( Y ) = E h Y when h ( y ) is a uniform density or mass function since it then follows that h ( y ) = R − 1 and hence E h Y = ∫ S y ⋅ R − 1 d y = A v ( Y ) . When a subscript is omitted and Y ∼ f ( y ) , it will be understood that the expectation is taken w.r.t. the baseline density (mass) function f , provided it exists. Otherwise, we say that E U Y = A v ( Y ) as a special case, although we will avoid this notation after this section. This is because A v ( Y ) is best interpreted with the lenses of basic, deterministic analysis. The exception to this statement is when Y truly follows a uniform probability law.

A functional average treatment effect then – for any two treatment values of interest t , t ′ – can be defined as h { A v { Y ( t ) } , A v { Y ( t ′ ) } } = h { E U { Y ( t ) } , E U { Y ( t ′ ) } } for a user-specified function h . In this article, we set h to a simple difference for exploratory purposes, i.e.,

h { A v { Y ( t ) } , A v { Y ( t ′ ) } } = Δ t , t ′ = A v { Y ( t ) } − A v { Y ( t ′ ) } .

2.1 Examples of applicability

The average functional value is not a usual focus in statistical settings. The expected value w.r.t. the baseline measure has instead largely been an object of interest. Hence, we offer a short argument and demonstration of its importance as a preliminary apologia. We start with the meaning of an expected value. By definition, an expected value is a sum of all possible values that an outcome variable can take, where each value is weighted by the probability of observing it or its density. However, the chance (or density) of observation is extraneous to causal relationships that are unrelated to altered probabilities. Put otherwise, we are not always directly interested in the chance of observing an individual outcome under a treatment condition. Mostly, we conceptualize these values to identify summary effects. While these summary effects are interpretable, their meaning is somewhat obscure. For instance, observe E { Y ( 1 ) } = ∑ y ∈ S Y ( 1 ) f Y ( 1 ) ( y ) ⋅ y . We often interpret E { Y ( 1 ) } as the “average value of Y ” under experimental treatment. However, this is not entirely accurate since it is a weighted average. Language relating to the weighted nature of the sum is often omitted since it does not possess a straightforward empirical interpretation in nontrivial observational contexts. At least partially, this is because the weights are often unknown and cannot be approximated from the data without fairly strong assumptions.

This is not the case for functional averages. Although the functional average can also be construed, albeit counterfactually in most cases, as an expected value, the uniform measure imbues it with a more deterministic interpretation. To appreciate this further, again observe that A v { Y ( 1 ) } = ∣ S Y ( 1 ) ∣ − 1 ∑ y ∈ S Y ( 1 ) y in our current context. Hence, in contrast to E { Y ( 1 ) } , it truly is the “average value of Y ” under experimental treatment. The unknown probabilities (or densities) do not enter the picture. As a result, functional averages do not require a probabilistic framework to acquire meaning. This fact helps with their interpretation. Moreover, it potentially makes functional averages very salient estimands in the presence of confounding.

We now provide three concrete examples of functional average treatment effects and contrast them with expected treatment effects. The first example instantiates a situation where a functional average effect exists, but an expected one does not. After this, a second example explores the situation where both exist and are nonidentical. Finally, the third example discusses modeling contexts where functional averages are inappropriate estimands in comparison to traditional ones.

2.1.1 Example 1

Say T is a binary treatment variable s.t. T = 1 when a particular psychotropic medication is received and Y is a Likert scale measuring anxiety in individuals with clinical depression. Also say that Y ( 0 ) can take any integer between one and ten with the following probabilities, respectively:

{ 0.01 , 0.04 , 0.05 , 0.1 , 0.15 , 0.15 , 0.3 , 0.1 , 0.05 , 0.05 } .

Then E { Y ( 0 ) } = 6.14 and A v { Y ( 0 ) } = 1 0 − 1 ∑ i = 1 10 i = 5.5 . Now, say Y ( 1 ) has nonzero mass only on integers between one and eight with the following probabilities:

{ 0.01 , 0.01 , 0.01 , 0.05 , 0.1 , 0.5 , 0.18 , 0.14 } .

Under this scheme, it is also the case that E { Y ( 1 ) } = 6.14 . This makes the detection of a causal effect impossible if only the expected treatment effect is utilized. However, A v { Y ( 1 ) } = 8 − 1 ∑ i = 1 8 i = 4.5 , which means that Δ 1 , 0 = 4.5 − 5.5 = − 1 . This value partially reflects the elimination of the probability that the two most extreme levels of anxiety are observed under treatment, albeit at the cost of a higher probability of mild to moderate anxiety experiences. In other words, after treatment, the respondent will never select nine or ten as measures of their symptom experience, even if provided the opportunity to do so. Changes in functional averages are capable of capturing an aspect of this effect.

2.1.2 Example 2

Again, let T be a binary treatment variable for simplicity and say Y ( 0 ) is a random variable for the untreated systolic blood pressure (SBP) of individuals who have been diagnosed with high blood pressure. We will assume that Y ( 0 ) follows a truncated normal distribution with a mean at 155 mmHg and support S Y ( 0 ) = [ 110 , 370 ] and Y ( 1 ) follows a truncated normal distribution with mean 125 mmHg on support S Y ( 1 ) = [ 90 , 250 ] . Here, both an expected and a functional average treatment effect are present and relevant. WLOG, note that ∫ S Y ( 0 ) d y = ∫ 110 370 d y = 370 − 110 = 260 . Then A v { Y ( 0 ) } = 26 0 − 1 ∫ 110 370 y d y = 240 , while A v { Y ( 1 ) } = 170 by similar calculation. Therefore, Δ 1 , 0 = 170 − 240 = − 70 , while E { Y ( 1 ) } − E { Y ( 0 ) } = − 30 . The functional average in this example offers additional information about changes in the possibilities of extremes that cannot be captured by expected values with the baseline probability measure, insofar as it is not uniform.

Pertinently, these examples elucidate how functional averages and their differences remain invariant to any redistribution of the presented probabilities insofar as they remain nonzero on the same set. For instance, say we employed a biased sampling mechanism (such as a convenience sampling) and we also failed to measure the confounders that are sufficient for mean exchangeability in this example. As a consequence, say Y ∣ 1 follows a truncated normal distribution s.t. E ( Y ∣ 1 ) = 110 and Y ∣ 0 follows a truncated normal distribution s.t. E ( Y ∣ 0 ) = 180 , a state of affairs that evidences confounding. This would not matter for our purposes, however, insofar as the convenience sample was executed in such a way as to preserve the sets of values that the functions could theoretically materialize under an experimental design. Under this scenario, although the expected causal effect is confounded, the functional average causal effect remains identified. It is resistant to latent unknowns.

2.1.3 Example 3

As previously stated, the functional average is not a useful estimand in all contexts. For example, in any context s.t. S Y ( 1 ) = S Y ( 0 ) , a functional average causal effect will not exist; it will be zero. In fact, this point generalizes to any estimand that is a function of the supports. In a temporary abuse of set notation, this is because if S Y ( 1 ) = S Y ( 0 ) , it then follows that g { S Y ( 1 ) } − g { S Y ( 0 ) } = 0 for any function g .

To show this more concretely, again consider the content of Example 2, except say Y ( 1 ) – treated SBP – follows a truncated normal distribution with an expected value of 125 mmHg and Y ( 0 ) – untreated SBP – follows a truncated normal distribution with an expected value of 180 mmHg. Furthermore, assume that both distributions are now supported on S Y ( 0 ) = S Y ( 1 ) = [ 80 , 350 ] . Note that the behavior of the densities can differ greatly, depending on how we specify the variances. We simply require for any ε > 0 that f ( 80 + ε ) > 0 and f ( 350 − ε ) > 0 WLOG. Consequently, it is immediately observable that E { Y ( 1 ) } − E { Y ( 0 ) } = − 55 and an expected causal effect is present. However, Δ 1 , 0 = 0 by construction.

In addition, consider a situation s.t. the outcome is a binary variable and the treatment exposure does not eliminate the chance that the event associated with the outcome variable is observed. For instance, state that Y ( 1 ) is the occurrence of a mood episode while taking a mood-stabilizing medication and E { Y ( 1 ) } > 0 . Under the premise that Y ( 0 ) is also nondegenerate, it thus follows that S Y ( 1 ) = S Y ( 0 ) = { 0 , 1 } and Δ 1 , 0 = 0 . An expected causal effect, if identified, would be able to detect if the medication under investigation lowers the probability that a mood episode occurs. Functional averages would not be capable of detecting a change.

Not only do these examples show that the expected and functional average treatment effects are not equivalent when the supports of the experimental outcomes are the same, but they also illustrate that functional average causal effects – and any causal effect predicated upon functions of supports – are inappropriate estimands when it is suspected that the effect of exposure is “weak” in the sense that it only causes perturbations to the distribution of the outcome probabilities. In this sense, functional average effects and causal effects based upon support alterations more generally represent a class of stronger causal effects that communicate the impact of the treatment exposure on the more fundamental structure, or “bones,” of the outcome variable. In other words, while expected causal effects attempt to summarily capture changes in what one is likely to observe or experience, functional average causal effects attempt to capture what, on average, is even possible to observe or experience.

2.2 Identifying and estimating counterfactual functional averages

Next, we prove some basic statements about functional averages under mild conditions and the rubric of informative sampling. For this, we specify a conditional population of interest P = { Y 1 ∣ t , … , Y N ∣ t } s.t. Y 1 ∣ t ∼ f ( y ∣ T = t ) WLOG. This setup can be defined with additional conditioning or extended to unconditional circumstances, but this is omitted here for brevity. Additionally, observe a complete-case sample ζ ⊂ P and a complementary vector of indicator variables δ = ( δ 1 , … , δ N ) s.t. δ i = 1 if and only if Y i ∣ t ∈ ζ . Essentially, each δ i variable determines if Y i ∣ t is selected for observation by the researcher. We also assume that E ( δ i ∣ y i , t ) > 0 for all ∀ i , which implies that E ( δ i ∣ t ) = π i > 0 for all i , an assumption that is typically called sampling positivity. It is well-known that an arbitrary Y i ∣ t ∈ ζ does not, in general, follow the distribution of the theoretical population [22–25]. Instead, Y i ∣ t ∈ ζ possesses a weighted density or mass function f δ ( y i ∣ t ) = π i − 1 E ( δ i ∣ y i , t ) f ( y i ∣ t ) . Utilizing this fact, it is important to note, then, that E ( Y i ∣ t , δ i = 1 ) = π i − 1 σ Y i ∣ t , E ( δ i ∣ Y i , t ) + E ( Y i ∣ t ) for a fixed t . Here, the notation σ Y i ∣ t , E ( δ i ∣ Y i , t ) denotes the covariance: E { Y i ∣ t ⋅ E ( δ i ∣ Y i , t ) } − E ( Y i ∣ t ) π i . To see why this last assertion holds, observe WLOG:

E ( Y i ∣ t , δ i = 1 ) = ∫ S ∣ t y ⋅ π i − 1 E ( δ i ∣ y i , t ) f ( y i ∣ t ) ⋅ d y = ∫ S ∣ t { y ⋅ π i − 1 E ( δ i ∣ y i , t ) } ⋅ f ( y i ∣ t ) ⋅ d y = π i − 1 ⋅ E { Y i ∣ t ⋅ E ( δ i ∣ Y i , t ) } = π i − 1 σ Y i ∣ t , E ( δ i ∣ Y i , t ) + E ( Y i ∣ t ) .

It is also important to consider a basic sufficient condition for when f δ ( y ∣ t ) and f ( y ∣ t ) share the same support. Insofar as E ( δ i ∣ y i , t ) > 0 for all y i , this condition is met. This follows since f δ ( y i ∣ t ) = π i − 1 E ( δ i ∣ y i , t ) f ( y i ∣ t ) . Therefore, when E ( δ i ∣ y i , t ) > 0 for all y i ∈ S Y i ∣ t , f δ ( y i ∣ t ) can only be nonzero on precisely the same set of values as f ( y i ∣ t ) .

For conciseness, we often denote Y i ∣ t ∈ ζ as Y δ i ∣ t under the implicit assumption that δ i = 1 . We also use Y i ∣ T ∈ ζ or Y δ i ∣ T with the understanding that T is fixed to whatever value it takes for unit i ∈ I . We now specify our short list of assumptions more formally. Altogether, we have three sets. The first set is a re-statement of C1–C3 for clarity. The second set (A1–A4) is mostly constituted by strictly weaker versions of C1–C3; its conditions are sufficient for establishing the preservation of experimental supports and therefore the identifiability of functional average causal effects. The last set, which possesses a single assumption (D1), is important for proving the statistical consistency of estimators under very general conditions of probabilistic dependence. The usual provisos that referenced mathematical objects exist are mostly omitted.

Y δ i ∣ t = Y δ i ( t ) ∣ t for all contrasted t ∈ S T i and Y i ∣ T i ∈ ζ (Consistency).
Let L i be a vector of random variables. Then for all contrasted t ∈ S T i and Y i ∣ T i ∈ ζ , it is true that Y i ( T i ) ⊥ ⊥ T i ∣ L i (Strong ignorability).
0 < Pr ( T i = 1 ∣ L i ) < 1 for all indexes shared with each Y i ∣ T i ∈ ζ (Positivity).

Next, we introduce conditions A1–A4, which are more important for this investigation.

Let L i be a vector of observable random variables. Then there exists a possibly unknown random vector U i s.t. Y δ i ∣ t , l , u = Y δ i ( t ) ∣ t , l , u for all contrasted t ∈ S T i , u ∈ S U i , l ∈ S L i , and δ i (Existential consistency).
Let U i and L i be the same as specified earlier. Then for all contrasted t ∈ S T i , it is true that Y i ( T i ) ⊥ ⊥ T i ∣ L i , U i (Existential strong ignorability).
0 < Pr ( T i = 1 ∣ L i , U i ) < 1 for all indexes shared with each Y i ∣ T i ∈ ζ (Existential positivity).
The support of Y δ i ∣ T i , L i and Y i ( T i ) ∣ L i are the same, i.e., for all Y i ∣ T i ∈ ζ , S Y i ∣ T i , L i ⊆ S Y i ( T i ) ∣ L i and S Y i ( T i ) ∣ L i ⊆ S Y i ∣ T i , L i .

Finally, we now introduce the assumption that is used to establish statistical consistency.

For simplicity, now specify that I = { 1 , 2 , 3 , … , n } labels the elements of ζ . Let Y δ i ∣ T i be an outcome random variable and say that ℒ n = ( I , E n ) is an undirected graph with node set I and link set E n s.t. a link e i , j ∈ E n between two nodes i , j ∈ I is present if and only if σ Y δ i ∣ T i , Y δ j ∣ T j ≠ 0 . Then the mean degree of this graph n − 1 ∑ i = 1 n ∑ j = 1 n − 1 1 e i , j ∈ E n = μ n = o ( n ) , where 1 e i , j ∈ E n = 0 when i = j by convention and each indicator variable is nonstochastic. This statement can be generalized to Y δ i conditional on L i .

A contrast of each C* to their respective A* for * ∈ { 1 , 2 , 3 } is apropos. In general, we can assert that each A* is weaker than its traditional counterpart. For instance, C1 posits that the value for Y δ i ( t ) , which is the value that would have been observed for this sampled variable had the entire population’s exposure been experimentally fixed to t , is the same value that is conditionally observed for Y δ i ∣ t . A1 states something that is much weaker; it asserts, provided a researcher has also fixed some L i = l , that there merely exists some random vector U i that could have been measured s.t. this same state of affairs holds within the ( t , u , l ) stratum of Y δ i . Importantly, U i need not be known or measured. Setting ( U i , L i ) to a vector of degenerate variables recovers C1. We also note that A1 could be renamed to “consistency upon correction” since it implies that, even if traditional consistency does not hold, it can be achieved provided the right oracle information. Contrarily, A1 cannot be true when consistency is impossible to achieve.

The comparisons between C2 and A2 and C3 and A3 are analogous in spirit. C2 posits, conditional on some measured L i = l , that the counterfactual outcomes are conditionally independent of the probability law governing treatment exposure. A2 does not necessarily posit this. Instead, it says this state of affairs is possible to achieve provided oracle information about an additional vector of otherwise latent random variables U i . Once again, U i only needs to exist; the researcher does not need to know or measure it. Similar to the previous assumption, A2 fails when strong ignorability is false for any assortment of variables. In lieu of repeating this same sequence of logic, we mostly omit it for C3 and A3. We only add that – although it can seem like A1–A3 are stricter than usual since they require the statements to apply to an unknown set of confounders, which is not the case. Anytime a researcher asserts that they have measured some X i , for instance, s.t. C2 and C3 are true, this simply means that there is some ( L i , U i ) s.t. X i = ( L i , U i ) and the researcher believes that they have successfully identified and measured these variables. This can also be rephrased to say that X i = L i and U i is degenerate at no loss.

Reiterating the meaning of A4 is useful as a stepping stone for further consideration. Put succinctly, when A4 holds, it means that the counterfactual distribution and conditional distribution possess the same support, conditional on some L i = l . Rejecting this notion is equivalent to positing that certain values in the support of Y i ( T i ) ∣ L i can never materialize with Y i ∣ T i , L i . This is a strong assertion with nontrivial epistemic consequences, especially in the context of noninformative sampling. If it is believed that A4 cannot be obtained, then those values that exist in the counterfactual support alone have no real-world meaning. In this circumstance, we can simply condition on those that can materialize at no empirical loss. It is also important to state that, although A1–A3 are important for reasons soon revealed, A4 alone is sufficient for identifying functional average effects and causal effects based upon supports more generally. Since A1 and A2 might not be necessary conditions for A4, the latter is important as a stand-alone condition. For instance, a researcher can assert A4 when they can feasibly defend their sampling process.

Next, we informally establish that A1–A3 imply A4. This is important since it proves that if a researcher thinks it is even possible to try to identify and estimate expected causal effects via the traditional set of assumptions (C1–C3), the experimental supports are already preserved and all causal estimands based upon them are identified. For demonstrative purposes, we prove this informally for the case s.t. U is an absolutely continuous variable, also at no loss, and that all referenced densities exist. To this end, observe the following identity under the stated premises:

f { y ( t ) ∣ l } = ∫ S U f − 1 ( l ) ⋅ f { y ( t ) , l , u } d u = ∫ S U f − 1 ( l ) ⋅ f { y ( t ) ∣ l , u } f ( l , u ) d u = ∫ S U f { y ( t ) ∣ t , l , u } f ( u ∣ l ) d u (A2 and A3) = ∫ S U f ( y ∣ t , l , u ) f ( u ∣ l ) d u . (A1)

Now, note that the following statement is also true by similar strokes of logic:

f ( y ∣ t , l ) = ∫ S U f ( y ∣ t , l , u ) f ( u ∣ t , l ) d u . This also follows by A2 and A3.

Since both f { y ( t ) ∣ l } and f ( y ∣ t , l ) are functions of y alone and otherwise share in f ( y ∣ t , l , u ) as a basis, it is then implied that f { y ( t ) ∣ l } and f ( y ∣ t , l ) are strictly positive on the same set of values. Furthermore, since E ( δ ∣ y , t , l ) > 0 implies that f δ ( y ∣ t , l ) > 0 on the same set of y values s.t. f ( y ∣ t , l ) > 0 , by transitivity, f δ ( y ∣ t , l ) also shares the same support with f { y ( t ) ∣ l } . This supplies A4 and corroborates our assertions.

We conclude the discussion of A4 with one additional point. Namely, we explicitly draw attention to the fact that A4 implies at least a restricted form of positivity when the necessary conditional probabilities are posited to exist. This follows because f ( y ∣ t , l , u ) WLOG requires that f ( t , l , u ) > 0 by construction, which implies that f ( t ∣ l , u ) > 0 . When T is a binary variable, this also necessitates that f ( t ′ ∣ l , u ) < 1 . Pertinently, however, A4 does not need positivity to hold for all possible strata of L i ; it only needs to hold for observed strata. This fact becomes useful in later sections. Finally, we conjecture that A4 does not imply A1 and A2 generally, although no further investigation on this matter is pursued here.

Before discussing D1, we offer one more pivotal exploration, especially since it is a segue into the utility and attitude accompanying our approach. In short, we formulate Y i ( T i ) ∣ L i and Y δ i ∣ T i , L i in the context of test theory, i.e., we consider a model s.t. the conditional observational outcome is equal to the conditional experimental outcome plus possibly nonindependent error. We do this because we also wish to point out that functional average causal effects are identified outside the paradigms built upon C1–C3 or even A4.

Formalizing this entails the following statement: Y δ i ∣ T i , L i = Y i ( T i ) ∣ L i + ε i . Mathematically, this statement is well defined since the observational and experimental outcomes share a probability space. This model posits that the observational outcome, conditional on some vector of scientifically salient variables, is equal to the measurement of the experimental outcome plus biasing noise that results from random measurement error or lurking variables. It is also apropos to state that, unless ε i = 0 almost surely, consistency is violated by this construction under C2. However, if we additionally assert that, although possibly dependent, Y i ( T i ) ∣ L i and ε i are supported on S Y i ( T i ) ∣ L i × S ε i s.t. A v ( ε i ) = 0 , it then follows that A v { Y δ i ∣ T i , L i } = A v { Y i ( T i ) ∣ L i } . The stipulation that A v ( ε i ) = 0 essentially means that the two distributions possess the same values on average, provided the right L i . Again, this should not be misconstrued as a sufficient condition for mean exchangeability since it is possible that E ε i ≠ 0 within this setup.

We also introduce this conceptualization, albeit briefly, because it has good potential for development and can provide an intuitive medium for causal inference in the presence of pervasive interference and model inaccuracy. Finally, it can also be interpreted using the framework of structural causal theory. If L i is a sufficient set for adjustment for the “true” but possibly unknown causal graph, then it does follow that ε i = 0 almost surely. Hence, one can feasibly surmise that the aforementioned conditions and functional average exchangeability will hold when one’s working causal model is incorrect and L i is insufficient for adjustment; however, the conceptualization is “close enough” to render the two outcome variables the same on average nevertheless.

Although we do not develop this route in any major way within this article, appreciating this type of modeling still contextualizes our main approach and allows us to pay homage to a fundamental caveat. We do the latter first. Identifying a parameter statistically is insufficient for the provision of causal meaning. Causal relationships cannot be inferred from statistical relationships alone [26]. A theory of causation – perhaps represented by a structural causal model – is therefore still required if causal meanings are to be supplied to the functional average or other estimands based upon supports [2,6]. The pivotal difference with our approach is that it is relatively forgiving of incomplete or inaccurate theories of causation. As previously stated, while other approaches require the correct specification of some random vector L that is sufficient for adjustment or C2, our approach only requires that such an adjustment is theoretically possible. Estimands based upon supports are identified, for example, with a graph constituted by the node set { T , Y , U } and the following links, insofar as the researcher is ready to defend the causal assertions implicit within them: T → Y , U → T , U → Y . In other words, if A4 holds, then, provided a more general structural causal model that allows for unmeasured confounders, the functional average effect, and others, are identified and estimable from data without accounting for the latent forces empirically. In a similar vein to the test error setup introduced in the previous paragraph, which only feasibly requires us to “get close enough” to the experimental outcome, this main approach only requires that we can even try to get close at all.

Finally, we turn to the last assumption, D1. Put succinctly, its purpose is to establish statistical consistency for estimators of potential mean outcomes and functional average outcomes in addition. Results are often proven under the premises of mutual independence and noninformative sampling. This restricts their utility, especially since many modern research settings depend upon nonprobability samples of outcome variables that partake in complicated and unknown systems of possibly “long-range” probabilistic dependence. Furthermore, this is also restricting since informative sampling can induce statistical dependencies. By proving our results under more general conditions, we expand their reliability into these contexts. D1 essentially asserts that the mean number of outcome variables in a sample that a typical one is correlated with is sublinear in n , i.e., that n − 1 μ n → 0 as n → ∞ . Note that this is a very mild assumption since it still allows for the mean number of statistical dependencies present in the sample to diverge with sample size. It places no additional constraint on the exact form of the probabilistic dependencies. We also reference an alternative: D 1 ′ . This assumption is exactly the same, except it makes use of a dependence graph s.t. a link exists between two nodes if and only if their corresponding outcome variables are statistically dependent.

A proof of our first proposition, which formally establishes functional average exchangeability via A4, concludes this section. Although it is trivial mathematically, it provides a useful foundation.

Proposition 1

(Functional average exchangeability) Suppose A4. Then A v ( Y δ ∣ T , L ) = A v { Y ( T ) ∣ L } .

Proof

The result follows directly from the premises. Since S Y δ ∣ T , L = S Y ( T ) ∣ L , ∫ S Y δ ∣ T , L 1 ⋅ d y = ∫ S Y ( T ) ∣ L 1 ⋅ d y = R WLOG for the continuous case and ∫ S Y δ ∣ T , L y d y = ∫ S Y ( T ) ∣ L y d y .□

Proposition 1 is extendable to A v l { Y ( T ) ∣ L } . However, again, this is omitted. From here, the notation for L is sometimes omitted for brevity. We do the same, when possible, for sample indexes.

2.3 The problem of estimation

The simplicity of Proposition 1 and the relative mildness of A1–A3 unfortunately coexist with the difficulty of estimating A v ( Y δ ∣ T ) . Here, the “there is no free lunch” adage returns. A theoretical estimator can be constructed, nevertheless, using the following two identities. We tacitly condition on 1 S Y for ease of reading: E { f − 1 ( Y ) Y } = ∫ S y d y and E { f − 1 ( Y ) } = R . This naturally suggests an estimator of the following form:

(2.1) A v ˜ ( Y δ ) = ∑ i = 1 n f δ i − 1 ( Y δ i ) − 1 ∑ i = 1 n f δ i − 1 ( Y δ i ) Y δ i .

In this section, we investigate some of the features of plug-in estimators for equation (2.1). After exploring the discrete case, we offer brief commentary on the difficulties of the continuous one. Then we re-visit the sample mid-range estimator. After completing these explorations, we introduce a bootstrapping strategy for conducting inference. Pertinently, we also explore these estimators and the challenges surrounding them to foster more appreciation for the utility of an alternative route that is proposed in Section 3.

2.3.1 Discrete estimators of A v ˜ ( Y δ )

When Y δ i is discrete and f ˆ n ( y ) = n − 1 ∑ i n 1 Y δ i = y = n − 1 M y , the empirical plug-in for equation (2.1) reduces to an intuitive estimator. Say 1 y ∈ S ζ is an indicator that a value y ∈ S Y is observed and therefore in the support of the empirical distribution: S ζ . Then the empirical plug-in for equation (2.1) reduces to A v ˆ ( Y δ ) = { ∑ y ∈ S Y 1 y ∈ S ζ } − 1 ∑ y ∈ S Y 1 y ∈ S ζ y under the convention that f ˆ n − 1 ( y ) = 0 when y ∉ S ζ . To see this, observe that for an arbitrary set of materialized sample values in ζ , where ζ is temporarily treated as a set of constants, ∑ y ∈ ζ f ˆ n − 1 ( y ) = n ⋅ ∣ S ζ ∣ and ∑ y ∈ ζ f ˆ n − 1 ( y ) y = n ⋅ ∑ y ∈ S ζ y . Therefore, the discrete plug-in for equation (2.1) is simply the arithmetic average of the unique values observed in the sample.

We now establish the statistical consistency of this plug-in under general dependency conditions. For the rest of this section, we omit notation for δ with the understanding that it is implicit whenever we are dealing with sampled outcomes. To this end, note that Pr ( Y 1 ≠ y , Y 2 ≠ y , … , Y n ≠ y ) = Pr ( Y n ≠ y ∣ Y n − 1 ≠ y , … , Y 1 ≠ y ) ⋅ Pr ( Y n − 1 ≠ y ∣ Y n − 2 ≠ y , … , Y 1 ≠ y ) … Pr ( Y 1 ≠ y ) and define a corresponding sequence ℱ = ( Pr ( Y i ≠ y ∣ A i ) ) i ∈ I under the convention that Pr ( Y 1 ≠ y ∣ A 1 ) = Pr ( Y 1 ≠ y ) .

Proposition 2

Suppose a sample ζ = { Y i } i ∈ I . Observe ℱ = ( Pr ( Y i ≠ y ∣ A i ) ) i ∈ I as previously defined and say k ( n ) = ∣ { s ∈ ℱ ∣ s < 1 } ∣ , where s ∈ ℱ indicates here that s is present in the sequence. If k ( n ) → ∞ as n → ∞ , then A v ˆ ( Y ) → a . s . A v ( Y ) as n → ∞ , where → a . s . denotes almost sure convergence.

Proof

Let y ∈ S Y be arbitrary and denote S ζ ⊆ S Y as the set of observed values. Then Pr ( y ∈ S ζ ) = 1 − Pr ( y ∉ S ζ ) = 1 − Pr ( Y 1 ≠ y , Y 2 ≠ y , … , Y n ≠ y ) = 1 − Pr ( Y n ≠ y ∣ Y n − 1 ≠ y , … , Y 1 ≠ y ) ⋅ Pr ( Y n − 1 ≠ y ∣ Y n − 2 ≠ y , … , Y 1 ≠ y ) … Pr ( Y 1 ≠ y ) . Now, suppose k ( n ) probabilities in the sequence ℱ are strictly less than one. Denote the maximum of these probabilities as Pr ( Y * ≠ y ) and note that since Pr ( Y * ≠ y ) < 1 , there exists some ε > 0 s.t. Pr ( Y * ≠ y ) = 1 − ε . Then:

Pr ( y ∉ S ζ ) = Pr ( Y n ≠ y ∣ Y n − 1 ≠ y , … , Y 1 ≠ y ) ⋅ Pr ( Y n − 1 ≠ y ∣ Y n − 2 ≠ y , … , Y 1 ≠ y ) … Pr ( Y 1 ≠ y ) ≤ 1 n − k ( n ) ⋅ { Pr ( Y * ≠ y ) } k ( n ) = { 1 − ε } k ( n ) .

Hence:

0 ≤ lim n → ∞ Pr ( y ∉ S ζ ) ≤ lim n → ∞ { 1 − ε } k ( n ) = 0 .

This of course implies that Pr ( y ∈ S ζ ) → 1 as n → ∞ . Next, define an indicator variable 1 y ∈ S ζ and also Z k * = sup k > n ∣ 1 y ∈ S ζ k − Pr ( y ∈ S ζ k ) ∣ = ∣ 1 y ∈ S ζ k * − Pr ( y ∈ S ζ k * ) ∣ . Letting ε > 0 be arbitrary again:

Pr ( Z k * > ε ) ≤ ε − 2 { Pr ( y ∈ S ζ k * ) ⋅ ( 1 − Pr { y ∈ S ζ k * } ) } .

This then implies that:

lim n → ∞ Pr ( Z k * > ε ) ≤ lim n → ∞ ε − 2 { Pr ( y ∈ S ζ k * ) ⋅ ( 1 − Pr { y ∈ S ζ k * } ) } = 0 .

Hence, 1 y ∈ S ζ → a . s . 1 . Thereby, since y was arbitrary, it is then implied that A v ˆ ( Y ) = { ∑ y ∈ S Y 1 y ∈ S ζ } − 1 ∑ y ∈ S Y 1 y ∈ S ζ y → a . s . R − 1 ∑ y ∈ S Y y = A v ( Y ) since R is finite.□

The elementary nature of A v ˆ ( Y ) makes it a reliable estimator for relatively simple outcome variables. A v ˆ ( Y ) will converge almost surely at an unknown, but very fast rate in all likelihood, and even in the presence of stark probabilistic dependencies, when the scale of the outcome variable possesses a small number of unique values. This statement obviously applies to sample extremes in addition.

Unfortunately, however, quantifying the rate of convergence – or the uncertainty associated with finite sample estimates – is difficult. This is true even under mutual independence. To appreciate this, it is sufficient to observe ∑ y ∈ S Y 1 y ∈ S ζ y . Since E 1 y ∈ S ζ = p y , n is unknown, so is Var { ∑ y ∈ S Y 1 y ∈ S ζ y } = ∑ y ∈ S Y p y , n ⋅ ( 1 − p y , n ) ⋅ y 2 . If ζ is a sample of identically distributed and mutually independent outcome variables, then we can attempt to estimate p y , n with p ˆ y , n = 1 − { 1 − f ˆ n ( y ) } n . However, p ˆ y , n ≡ 0 when y ∉ S ζ . Furthermore, when y ∉ S ζ , it is also unknown by definition. Hence, reasonably estimating Var { ∑ y ∈ S Y 1 y ∈ S ζ y } requires knowledge that makes estimating A v ( Y ) arguably redundant.

A recourse to the central limit theorem is also unavailable. This is because ∣ S Y ∣ is finite by construction. Therefore, A v ˆ ( Y ) will always be a finite sum of random variables. In some circumstances – such as when ∣ S Y ∣ is reasonably large – a normal approximation might still function with an acceptable degree of accuracy. However, for reasons already explored, this strategy will still require a strong set of assumptions about sampling probabilities and potential values.

2.3.2 Continuous outcomes

Kernel density estimation is an intuitive choice to estimate equation (2.1) for the continuous case. However, the properties of this plug-in are also largely intractable and unknown. For example, although the properties of a kernel density estimator f ˆ n ( y ) are well researched for a constant y ∈ S Y [27–29], the behavior of f ˆ n ( Y ) , i.e., the random variable defined and evaluated on the same random outcome that was utilized to construct it, is not as well studied. This is because kernel density estimation is often evaluated on a grid of deterministic points. Establishing the asymptotic properties of a statistic of the form { ∑ i = 1 n f ˆ n − 1 ( Y i ) } − 1 ∑ i = 1 n { f ˆ n − 1 ( Y i ) } − 1 Y i , where f ˆ n − 1 ( Y i ) = n h ⋅ { ∑ j = 1 n K { h − 1 ( Y i − Y j ) } } − 1 for some h > 0 and kernel function K ( ⋅ ) , although promising, is therefore also nontrivial. Such considerations also require a detailed consideration of possible kernel functions. Since – in general – we are interested in establishing statistical consistency under very general dependency conditions, we avoid this enterprise in this article.

We also avoid other options for density estimation since they arrive with similar challenges, some as yet undisclosed. For instance, estimators that use reciprocal estimated densities can possess unstable variances when the underlying distribution possesses a density that decays smoothly toward zero. Moreover, since each f ˆ n ( y ) is typically a function of the entire sample, plugins for equation (2.1) will necessarily possess a myriad of complex dependencies. This will prevent any elementary citation of a central limit theorem. Just as importantly, it will also limit the applicability of concentration inequalities for finite sample inference. Hence, although this is a promising area of research that demands attention, no further consideration is offered here. Instead, we focus on other estimators for the functional averages of continuous outcomes that do not rely on density estimation.

2.3.3 Mid-range estimation

For a large special class of bounded random variables, the mid-range is a simple alternative for estimating functional averages, including those from continuous distributions. Let Y ( i ) for i ∈ I n = { 1 , … , n } denote the i th order statistic of a sample s.t. Y ( 1 ) ≤ Y ( 2 ) ≤ … ≤ Y ( n ) . The mid-range, M R ˆ { Y } , or simply M R ˆ when convenient, is defined as follows: M R ˆ { Y } = 2 − 1 { Y ( 1 ) + Y ( n ) } .

Naturally, the sample mid-range estimates the population mid-range M R = 2 − 1 ( m + M ) . Linear combinations of order statistics are also well studied [30–33]. However, the sample mid-range is often ignored, and especially in applied settings, because of its possible inefficiency and since its distribution also admits no closed-form expression in a majority of situations. Before offering an exposition on some of its properties, we offer a useful definition, which highlights our interest in it. We say a random variable is regular when it is supported on a single interval of real numbers or a complete subset of integers. This definition is helpful because A v ( Y ) = M R { Y } when Y is a regular random variable.

Definition 1

A random variable will be said to be regular if and only if its support S is a single interval of real numbers or a complete subset of integers starting at some m ∈ N and ending with a maximum integer M s.t. if integer c ∈ S \ M , then c + 1 ∈ S .

M R ˆ is a statistically consistent estimator of M R for outcome variables with finite support under the assumption of mutual independence. Barndorff-Nielsen [34] established sufficient and necessary conditions for the statistical consistency of extreme order statistics. Almost sure convergence, and therefore also convergence in probability ( → p ), of an extreme order statistic to its asymptotic target is trivially fulfilled when there exists a y ∈ S Y s.t. F ( y ) = 1 and F ( y − ε ) < 1 for all ε > 0 . Hence, M R ˆ also converges almost surely to its population value for all bounded distributions under this setup. Sparkes and Zhang [35] extended this result to much more general scenarios of statistical dependence. If we define a sequence of conditional cumulative distribution functions (CDFs) ℱ in the same spirit as Proposition 2, it can be demonstrated that extreme order statistics converge in probability to their target values for bounded random variables insofar as the number of conditional CDFs in ℱ that is strictly less than unity diverges as n becomes arbitrarily large. This is once again a very mild assumption since the dependencies involved can induce arbitrary changes in the behaviors of the distribution functions otherwise. Since the random variables considered here are bounded, convergence in probability of the sample extremes also implies their almost sure convergence.

Nevertheless, as previously mentioned, when the distribution of Y is unknown, no reliable expression for the distribution of M R ˆ is accessible to use for inference [36–39]: a situation that is analogous to the discrete plug-in estimator for equation (2.1). Additional discussion on this topic is available in the supplementary material. Bootstrapping is a feasible option for inference, provided these challenges. However, it is also not without problems. Traditional bootstraps condition on the observed values of ζ and use the strong consistency of the empirical CDF F ˆ n ( y ) to emulate the sampling distribution of a statistic of interest via a re-sampling procedure [40]. They require the estimand of interest, which is estimated by the statistic T 0 ( Y 1 , Y 2 , … , Y n ) , to be a well-behaved functional of the marginal CDF. They also stipulate that the targeted parameter is not a boundary value of the support [41]. Overall, bootstrapping processes usually behave as intended under the same set of conditions that supply a central limit theorem. For these reasons, traditional bootstraps are problematic for functions of extreme order statistics. The m -out-of- n bootstrap, however, has proven to be an effective procedure in this domain [42]. Essentially, a basic m -out-of- n bootstrapping process re-samples m observations from ζ with or without replacement s.t. n − 1 m → 0 as m → ∞ . It can provide approximately valid inference when traditional methods fail. See Bickel and Ren [43], Swanepoel [44], Beran and Ducharme [45], or Politis et al. [46] for additional background and resources on the topic. Pertinently, the m -out-of- n bootstrap is also capable of handling situations with dependent observations insofar as an appropriate sub-sampling strategy is used.

Nevertheless, the m -out-of- n bootstrap (and other forms of bootstraps for dependent observations) is still insufficient for the context and conditions of this article. Three reasons substantiate this claim. First, we require a version of the bootstrap that is capable of reliably capturing θ under fairly general but unknowable dependency conditions. This rules out approaches such as the m -out-of- n bootstrap, or bootstrapping processes such as the block bootstrap, which require a re-sampling theory that corresponds adequately to the unknown dependency structure, and which typically exclude the existence of long-range dependencies [47–50].

Second, we are interested in reasoning about θ = A v ( Y ) and not E { M R ˆ } for a particular n . In most circumstances where M R ˆ will be used, i.e., those circumstances s.t. the marginal distributions are not symmetric, it will be a biased estimator [39]. It is likely that θ rests on or near the boundary of the support of M R ˆ in these circumstances. Convergence to θ might be slow and characterized by an unknown rate in addition [38]. Consequently, any inferential procedure for M R ˆ must be flexible enough to provide cogent statements about θ – and not simply about E { M R ˆ } at a particular value of n – and even when sample sizes are modest. This necessitates conservative approaches for inference that allow for θ to sit in the extremes of the empirical distribution of the bootstrapped statistics. It also therefore rules out popular bootstrapping methodologies, which construct confidence sets that are strict subsets of the hull of the observed range. Finally, we wish to use a bootstrapping strategy that does not rely on the assumption that θ is a smooth functional of F ( y ) .

Further work to produce more efficient closed-form approximations is of course a preferable route. Since the mid-range is more efficient than the sample mean when nonnegligible probability rests in the extremes of the support [51], it can provide a more efficient estimator of expected causal effects in many circumstances: a fact that is often neglected. Overall, however, due to the complicated probabilistic character of order statistics, this is an onerous road that possesses no immediate destination, especially when outcome variables are dependent and their joint distribution is unknown. This ultimately necessitates a different type of bootstrapping strategy.

2.3.4 The Hoeffding bootstrap

With these prior facts in mind, we offer a novel bootstrapping solution that can address these challenges. In summary, we assert that the bootstrap can be re-purposed to construct conservative confidence sets under fairly general conditions of statistical dependence. Notably, this re-purposed bootstrap, which we call the Hoeffding bootstrap, can be applied to all functional average estimators previously explored since it does not require θ to be a well-behaved functional of F ( y ) .

We now provide a synopsis of the approach. Further details and proofs of its validity and robustness are provided in the supplementary materials. Essentially, we show that (1) if a statistician does not condition on the observed values of ζ and treats each re-sampled Y i as random, (2) if the outcome variables being re-sampled are not monotonic transformations of one another, say, or they do not partake in other forms of truly extreme statistical dependence, and (3) it is subsequently true that T 0 → a . s . E T 0 and has finite support, then confidence sets of the form T 0 ± ϕ ( M ˆ B − m ˆ B ) provide at least 1 − α coverage asymptotically for θ provided any ϕ ≥ 1 . M ˆ B and m ˆ B represent the maximum and minimum order statistics of the bootstrap sample, respectively. Ultimately, we call this method the Hoeffding bootstrap because Hoeffding’s inequality is used to specify ϕ . The performance of confidence sets constructed with this strategy are evaluated in Section 4 and also in the supplementary materials. For clarity, we provide a schematic of the process:

Acquire a sample of random variables ζ = { Y i } i ∈ I n and compute a statistic T 0 ( Y 1 , … , Y n )
Draw m ≤ n random variables from ζ with or without replacement via a simple random sample or a theoretically guided process that attempts to reproduce a dependency structure. Compute the new statistic T 1 from these variables
Repeat I. and II. K ( n ) − 1 times, where K ( n ) = K is reasonably large, and construct { T k } k ∈ K for K = { 0 , 1 , 2 , … , K }
Set M ˆ B − m ˆ B = max k ∈ K ( T k ) − min k ∈ K ( T k )
Construct an estimate of an at least 1 − α confidence set with T 0 ± { M ˆ B − m ˆ B } 2 − 1 log ( 2 ⁄ α )

Since gaining some intuition about how this method works is still important at this point, we offer some basic insights. Note that because m ˆ B ≤ T 0 ≤ M ˆ B by construction, it is already implied that E T 0 ∈ [ E ( m ˆ B ) , E ( M ˆ B ) ] with probability one and hence that, for some sufficiently large n and K , E T 0 ∈ [ m ˆ B , M ˆ B ] with probability that is approximately one under very mild regulations. However, since we actually want to capture θ and, in practice, we often use only moderately sized K with moderate n , we require some form of intuitive penalization to mitigate small-sample bias. The use of Hoeffding’s inequality, although arbitrary, remains intuitive since its inversion results in the confidence set that would have resulted if one naively used the bootstrapped extremes to estimate the extremes of the support of T 0 . In summary, insofar as T 0 → a . s . θ and α ≤ 0.27 (this ensures that ϕ ≥ 1 ), employing Hoeffding’s inequality to construct a penalty aligns with traditional statistical instincts while guaranteeing more robust small-sample coverage for θ .

Hoeffding’s bootstrap also has other benefits. First and foremost, although conservative, it avoids dependency modeling. Since the dependency structure of the outcome variables is invariably unknown, this is a boon. Secondly, as previously stated, it applies to a much larger class of statistics because it does not require θ to be a “smooth” functional of F ( y ) . Moreover, Hoeffding’s bootstrap does not require the user to choose a re-sample size like the m -out-of- n bootstrap. As we see in Section 4, the Hoeffding bootstrap can even be more efficient than the m -out-of- n bootstrap for some resampling specifications. Perhaps more importantly, however, is the following statement: that even if the conditions that validate this approach are not met, the Hoeffding bootstrap will always perform better than strategies that use bootstrapped t -statistics or normal approximations. This fact essentially flows from Popoviciu’s inequality, which states that Var ( T 0 ) ≤ 4 − 1 ( M 0 − m 0 ) 2 when T 0 is bounded. This inequality also applies to empirical distributions. For instance, say that α = 0.05 . Then 1.96 ⋅ S T k ≤ ( M ˆ B − m ˆ B ) ≤ ( M ˆ B − m ˆ B ) ⋅ 2 − 1 log ( 2 ⁄ α ) ≈ 1.35 ⋅ ( M ˆ B − m ˆ B ) , where S k is the sample standard deviation of the bootstrap distribution. Therefore, even in worst-case scenarios, the Hoeffding bootstrap will always provide more robust and cogent statements about uncertainty than a large proportion of popular methodologies. By the same stroke, it will also provide more conservative coverage. Again, examples of its coverage are provided in Section 4 and the supplementary material.

3 U random variables and counterfactual linear regression

In this section, we discuss a class of random variables – the U class – that can help us avoid the difficulties encountered in Section 2. Recall: although we established that functional average causal effects can be identified and statistically consistently estimated under very mild assumptions – and even without adjusting for confounders – establishing efficient methods of statistical inference for these estimators is challenging. Ultimately, this is because the plug-in estimators defined and the sample mid-range possess largely intractable properties, and even asymptotically, in the absence of additional constraints that are in all likelihood inappropriate for applied settings. These difficulties are removed when working with U random variables as outcomes since they ultimately allow for the functional average to be estimated by standard additive statistics with well-known properties. We also show that U random variables are important because they can imbue basic linear regressions and analyses of variance with counterfactual – and thus possibly causal – interpretations under conditions traditionally assumed for estimating associations. On a similar note, we also prove that properties of U variables can be used to establish sufficient conditions for mean exchangeability and that they can be used to defend an extension of the Hoeffding bootstrap. First, however, a definition of a U random variable is helpful. We assume that all integrals and mathematical objects exist when referenced, as per usual.

Definition 2

Let g be a measurable function. A random variable g ( Y ) will be said to be in the class of U random variables if and only if E { g ( Y ) } = A v { g ( Y ) } . Similarly, the same will be said w.r.t. X ∈ R k for Y = g ( X 1 , … , X k ) if and only if E Y = A v x ( Y ) .

Definition 2 stipulates that a random variable is U class w.r.t. some space when its expected value is equal to its average functional value in that space. Stated in a probabilistic fashion, a variable Y is U class if one can take its expectation w.r.t. a uniform measure without changing its value. This type of random variable is ubiquitous in practice. For instance, any continuous uniform distribution or a normal distribution that has been symmetrically truncated about its expected value is a U random variable. Say Y ∼ B e r n ( π ) s.t. π = 2 − 1 . Then Y is also in the U class, as is any Y ∼ B e t a ( α , β ) s.t. α = β . Not all U random variables are symmetric, however. To illustrate this, construct a random variable Y with the following probabilities on S Y = { − 5 , 2 , 3 } : { 1 4 − 1 5 , 7 − 1 , 2 − 1 } . Then E Y = 1 4 − 1 ( − 25 + 4 + 21 ) = 0 = 3 − 1 ( − 5 + 2 + 3 ) = 0 . However, Y does not have a symmetric distribution.

A host of U properties has been investigated elsewhere [35]. To familiarize the reader, we provide a list of important ones in Table 1. Essentially, U variables are closely related in concept to sum-symmetry of the CDF and structured but uncorrelated deviations from uniformity. All bounded and symmetric random variables are in the U class, for instance, although symmetry is not a necessary condition, as previously demonstrated. Hence, all statistics that converge to a symmetric probability law behave more and more like U random variables as n becomes large. In an abuse of notation, we will say Y ∈ U if Y is in this class of random variable.

Table 1

Basic properties of U variables

Property	Description	Variable type	Conditions and definitions
σ Y , f ‒ 1 ( Y ) = 0	No association with inverse density or mass function	A	Y ∼ f ( y )
Y ∈ U ⇒ c Y ∈ U , Y + c ∈ U	Algebraic closure	A	c ∈ R
∫ m M F ( y ) d y = ∫ m M S ( y ) d y	CDF is sum-symmetric	R, C, D	S ( y ) = 1 ‒ F ( y )
U = Y + ε s.t. E ( ε ∣ Y ) = 0	Uniform variables mean-preserving spreads of U variables	R, C	U ∼ U n i f ( m , M ) , f ( y ) ≤ R ‒ 1 in left tail and unimodal
Pr ( ∣ S n ‒ E S n ∣ > ε ) ≤ 2 exp { ‒ { ∑ i = 1 n R i 2 } ‒ 1 6 ε 2 }	Concentration inequality for U variables	R, C	S n = ∑ i = 1 n Y i , Y i ∈ U , C *
∑ i = 1 M F ( i ) = E Y	Discrete CDF mean identity	R, D	S Y = { 1 , 2 , … , M }
∑ i = 1 M F ( i ) ‒ ∑ i = 1 M S ( i ) = 1	Discrete CDF sum-symmetry	R, D	S Y = { 1 , 2 , … , M }

A = All, R = Regular, C = Continuous, D = Discrete, C * = max { E ( exp { s S n } ) , E ( exp { ‒ s S n } ) } ≤ A v y ( exp { s S n } ) , s > 0 .

Out of these properties, we draw special attention to the concentration inequality: Pr ( ∣ S n − E S n ∣ > ε ) ≤ 2 exp { − { ∑ i = 1 n R i 2 } − 1 6 ε 2 } . The condition detailed in the table footnote is very mild and does not require independence. In fact, it can be true even when every single outcome variable in a sample is statistically dependent, insofar as the average correlation between those variables is mild, or μ n is adequately bounded if this is not the case. Discussion on this assumption is also available elsewhere [35]. Put succinctly, a researcher can expect it to be fulfilled if each Y i is symmetric – or at least relatively symmetric – and the joint distribution of the sample is biased away from n -tuples in the joint support that inflate S n . We note that this is useful since these conditions often apply to the error distributions of statistics of interest, including those of linear regressions.

3.1 The Hoeffding bootstrap, continued

With U random variables introduced, we now define a slightly more conservative version of the Hoeffding bootstrap. This version is more justifiable for small samples, or when a larger body of statistical dependencies are expected to exist. It also presupposes that T 0 behaves more and more like a U class random variable as n grows larger: a proposition that is reasonable to assume for many statistics, especially random sums. In short, this strategy sets out to overpenalize to compensate for small sample bias, but then tries to taper this penalization with the efficiency gained by incorporating the U properties into the reasoning.

To this end, we essentially set ϕ = 2 ⋅ 6 − 1 log ( 2 ⁄ α ) . The increased penalization is present in the added factor of 2. However, the supposition of U status allows us to use the concentration inequality presented in Table 1 instead of Hoeffding’s. All other things held equal, this strategy inflates the error around the estimate by an approximate factor of 1.15 in comparison to the first strategy when α = 0.05 . A longer justification for this construction is provided in the supplementary materials. We direct the reader there for more information.

Next, we introduce two simple propositions with direct practical or theoretical interest for causal inference. Proposition 3 follows directly from our main conditions and solves the problem of estimating A v ( Y δ ∣ t ) for variables in the U class since it allows for the replacement of the estimators of the previous section with the sample mean. Proposition 4 establishes an interesting sufficient condition for mean exchangeability. We only prove these statements for functional averages w.r.t. the range of Y . This is for conciseness. All results are easily extended to the excluded case.

Finally, recall that, although the notation is omitted, the following results also apply when the random variables are conditioned upon another vector of random variables L , perhaps to facilitate the fulfillment of A4 or U status.

Proposition 3

Suppose A1–A3 or A4. If Y δ ∣ t ∈ U , then E ( Y δ ∣ t ) = A v { Y ( t ) } .

Proof

The proof is again one line under the premises: E ( Y δ ∣ t ) = A v ( Y δ ∣ t ) = A v { Y ( t ) } .□

Again, the properties of plug-in estimators for equation (2.1) are not easy to discern in general and citations of the central limit theorem are also questionable or implausible. However, the properties of Y ¯ δ ∣ t are exceptionally well-known. This largely solves the problem insofar as the sampling process secures a sample of U variables. Again, since Proposition 3 only requires the preservation of the support, this allows for an arbitrary distortion of the population distribution otherwise: a fact that is liberating w.r.t. study design and execution.

Note also that Proposition 3 is not as trivial as it seems. It is well known that the sample mean and mid-range estimate the same parameter when the underlying distribution is symmetric. However, it is false that all U random variables are symmetric. Hence, the U concept expands the universe where the sample mean can replace the mid-range. The next proposition establishes a new route to achieving mean exchangeability, as previously mentioned.

Proposition 4

Suppose both Y ( t ) and Y δ ∣ t are U random variables under A1–A3 or A4. Then E { Y ( t ) } = E ( Y δ ∣ t ) .

Proof

By our premises, the following string of equalities applies: E { Y ( t ) } = A v { Y ( t ) } = A v ( Y δ ∣ t ) = E ( Y δ ∣ t ) .□

Great care and energy of argument are often expended to establish that E { Y ( t ) } = E ( Y δ ∣ t ) . Proposition 4 offers a new manner of doing so insofar as it is believed that the experimental distribution is sum-symmetric. Conditional or unconditional on some vector L = l , insofar as the researcher is willing to posit that the experimental distribution is in the U class, all that is actually required is a sufficiently executed sampling process that preserves the support and induces any form of U status. Then it is implied that (conditional) mean exchangeability is achieved. Once more, since it seems plausible that Y ( t ) or Y ( t ) ∣ l can be mapped into a great number of U distributions on the same support via different sampling designs or conditioning, this result is potentially very useful.

For instance, if it is believed that the counterfactual distribution is symmetric, then a nonrejection of a statistical hypothesis of symmetry in the observed distribution can be supporting evidence that mean exchangeability is achieved. More generally, if the distance between the mid-range and the sample mean is small – and here one must be diligent in deciding what precisely defines the quality of this distance – this can also be construed as evidence. A researcher can also observe the behavior of the empirical CDF for visual confirmation. For regular random variables, the area below and above the curve should be approximately equal.

Before moving forward, however, more attention is owed to some possible implications of Proposition 4. First, we draw attention to the fact that C1–C3 are missing from the premises, and that it appears that A4 alone could suffice in conjunction with the added U condition. Since A4 implies at least a restricted form of positivity, this means – insofar as the counterfactual distribution is sum-symmetric – that other routes to the identification of expected causal effects exist in the absence of even the possibility of consistency or strong ignorability. Revisiting the test theory model helps to consider this matter further. Per this thought, observe the statement: Y ∣ t , L = Y ( t ) ∣ L + ε . Now, if there exists a single triplet { Y ( t ) ∣ L , Y ∣ t , L , ε } s.t. E ε = 0 but Pr ( ε = 0 ) ≠ 1 , it subsequently follows that neither consistency nor strong ignorability are necessary conditions for mean exchangeability. One might expect E ε = 0 to be true with a nondegenerate ε distribution when unmeasured confounders impact the T → Y relationship nonlinearly.

For a more meaningful insight, however, we abandon enforcing mean exchangeability and only add a new constraint that ε ⊥ ⊥ Y ( t ) ∣ L . Stating this seems to signify that a researcher specified a L that was almost sufficient for achieving C1 and C2, but that independent noise, via measurement error or interference, persists to some degree. Recall that we have also accepted A4 as true. For simplicity, also say that Y ( t ) ∣ L and ε are supported on compact intervals. From here, we can show that these suppositions jointly imply that ε = 0 almost surely. Since S Y ( t ) ∣ L = [ m , M ] and S ε = [ − A , B ] , say, are both compact and ε ⊥ ⊥ Y ( t ) ∣ L , the support of the convolution of their densities is also compact. We also know by the Titchmarsh convolution theorem that this implies that S Y ∣ t , L = [ m − A , M + B ] . However, under A4, S Y ∣ t , L = [ m , M ] , which implies that A = B = 0 and that Pr ( ε = 0 ) = 1 . Consequently, Y ∣ t , L = Y ( t ) ∣ L and C1 and C2 are implied. Therefore, we can feel comfortable stating that A4 and a lack of omitted variable bias as defined are sufficient for deducing C1, C2, and at least a restricted form of positivity.

This thought experiment did not mention U status. However, we remark that the preservation of U status together with mean exchangeability might imply that omitted variable bias is absent, either as a standalone set of conditions or in conjunction with others that present themselves in relatively common circumstances. Such a state of affairs would supply C1 and C2 under the nice setup above. These facts have yet to be determined. Additional investigation into the relationship between these conditions will doubtlessly be fruitful.

It also stands to mention that if either Y ∣ t , L or Y ( t ) ∣ L is a U random variable and we assume that A4 and mean exchangeability are valid conditions, then the other outcome is also a U random variable and the status is preserved by construction. To see this, say the shared support is S = [ m , M ] and note that A4 and mean exchangeability implies that M − E ( Y ∣ t , L ) = M − E { Y ( t ) ∣ L } = ∫ S F { y ( t ) ∣ L } d y = 2 − 1 R . Hence, ∫ S F { y ∣ t , L } d y = 2 − 1 R , which implies that ∫ S F { y ∣ t , L } d y = ∫ S S { y ∣ t , L } d y . This achieves our result.

Finally, A4 is sufficient for establishing mean exchangeability when both the statistic being considered and its counterfactual partner converge in distribution to random variables with symmetric probability laws – or U probability laws more generally – as n → ∞ . For instance, observe some statistic Z ∣ t . Since A4 applies, A v ( Z ∣ t ) = A v { Z ( t ) } . However, since both Z ∣ t and Z ( t ) converge to U random variables, this also implies that E { Z ∣ t } ≈ E { Z ( t ) } for sufficiently large n . Once this situation is rendered more concrete by setting Z ∣ t = Y ¯ ∣ t , it becomes apparent – in at least the context of additive statistics – that the presence of confounding, which negates mean exchangeability, also prevents the viability of A4 for the arithmetic means, or that it alters the dependency structure s.t. only one but not both statistics can converge in distribution to a U random variable. Again, more exploration is due. Since little scholarship has been dedicated to how confounding impacts the statistical dependencies between outcome variables, this is also an interesting route of investigation in its own right.

3.2 Counterfactual linear regression

Linear regression is a popular tool for causal and predictive inference. Inverse probability weighting (IPW) of the marginal structural model or standardization of the adjusted model are common methods for repurposing its framework for the former [6,14,52,53]. Doubly robust estimation, an example of which is the augmented inverse probability weighting estimator (AIPW), is also commonly employed for this purpose [54–56]. This section offers a brief review of these approaches before moving on to novel methods. The marginal structural model w.r.t. an event { T = t } is defined as follows:

(3.1) E { Y ( t ) } = κ 0 + β 1 t .

The adjusted model is defined similarly, but in conjunction with a vector of adjusting variables L under the auspices that C1–C3 are true:

(3.2) E ( Y ∣ t , L ) = E { Y ( t ) ∣ L } = β 0 * + β 1 t + L β .

Standardization then yields that E { E { Y ( t ) ∣ L } } = β 0 * + β 1 t + E { L } β = E { Y ( t ) } . Under this model, the marginal expected causal effect is: E { Y ( 1 ) } − E { Y ( 0 ) } = β 1 . This effect is typically estimated using empirical plugins from the fitted multiple linear regression model and standardization [6,57]. As aforementioned, the identification of a vector L that achieves C1–C3 is nontrivial, as is the felicitous specification of equation (3.2).

These difficulties transport to IPW models. Transitioning to this topic, say e ( L ) = Pr ( T = 1 ∣ L ) is a propensity score model that captures the probabilistic mechanisms underlying treatment exposure. Provided this, it is subsequently true that E { e − 1 ( L ) ( Y ⋅ T ) − { 1 − e ( L ) } − 1 { Y ⋅ ( 1 − T ) } } = E { Y ( 1 ) } − E { Y ( 0 ) } , again under C1-C3. All things merry and well, an IPW estimator such as Δ ˆ 1 , 0 ; IPW = n − 1 ∑ i ∈ I { e ˆ − 1 ( L i ) ( Y i ⋅ T i ) − { 1 − e ˆ ( L i ) } − 1 ⋅ Y i ⋅ ( 1 − T i ) } is statistically consistent for the targeted estimand, where e ˆ ( L ) is usually estimated via logistic regression or some other semi-parametric method. The problem is, very rarely, are all things merry and well. As aforementioned, like the specification of equation (3.2), approximating e ( L ) is an arduous task that is burdened by doubt. This state of affairs is again exacerbated by the strictness of conditions C1–C3, and in particular C2.

Doubly robust estimation methods attempt to address the issue of incorrect model specification. In essence, they augment existing estimators for causal effects with the features of another s.t. the resulting estimator is statistically consistent for the identified estimand as long as at least one of the underlying models is correctly specified. Say Y ˆ i ( t ) is an estimator of a mean outcome from equation (3.2) that follows the plug-in parametric g-formula. Then one example of a doubly robust statistic is the AIPW estimator: Δ ˆ 1 , 0 ; AIPW = n − 1 ∑ i ∈ I { Y ˆ i ( 1 ) + e ˆ − 1 ( L i ) T i { Y i − Y ˆ i ( 1 ) } } − n − 1 ∑ i ∈ I { Y ˆ i ( 0 ) + { 1 − e ˆ ( L i ) } − 1 ( 1 − T i ) { Y i − Y ˆ i ( 0 ) } } .

Pertinently, all of these models face another shared difficulty: informative sampling. This is because they all revolve around expected values. Unfortunately, expected values are susceptible to alterations in the distribution(s) of the outcome variables, which result from the particularities of a study design and its sampling mechanism. For instance, for equation (3.2), what is actually estimated is: E δ ( Y ∣ t , L ) = α 0 + α 1 t + L δ α . Hence, even when C1–C3 are met, it is E δ { Y ( t ) } that is identifiable in the absence of noninformative sampling or additional constraints. E δ { Y ( t ) } , however, might not be the target of interest.

We now use concepts from the previous sections to demonstrate that the core assumptions of linear regression for predictive (associative) inference are sufficient for the identification of causal parameters in conjunction with A4 and a causal theory. In essence, we prove that functional average causal estimands remain identified and estimable using elementary methods alone, and even when C1–C3 do not hold and informative sampling is present. To this end, we specify the data-generating mechanism for the linear model in equation (3.3) with L = l fixed.

(3.3) Y δ i = α 0 + α 1 t i + l i α + ε δ i .

Traditionally, equation (3.3) requires that E δ ( ε i ∣ t i , l i ) = 0 for ∀ i when predictive inference is the goal and all covariate patterns of interest are fixed. This is weaker than strict exogeneity, which requires that E δ ( ε i ∣ T i , L i ) = 0 when T i and L i are stochastic. Using standardization requires a slightly weaker form of strict exogeneity that is conditioned on all treatment contrasts of interest since the method averages over L δ : E δ ( ε i ∣ t i , L i ) = 0 for ∀ i . Otherwise, for finite sample inference, the second core assumption is that ε δ i ∼ N ( 0 , σ i 2 ) for ∀ i .

The first core assumption is commonly evaluated using the predicted versus residual plot. Under the working proposition of valid specification, this plot should demonstrate an approximately symmetric scatter of the residuals about the horizontal zero line for any arbitrarily small neighborhood around any predicted point on the x -axis.

To make use of these traditional conditions, we must first make an inconsequential adjustment to the assumption of normality. We are working within a universe of bounded random variables. Consequently, the ε δ i of equation (3.3) cannot be normally distributed. This is no great loss for four related reasons. First, in a grand majority of scientific investigations, Y i is bounded. For example, if each Y i is a measurement of a person’s blood pressure, it is impossible for it to be less than zero or greater than an arbitrary real number. Its distribution cannot be supported on a set that is equal to R . This automatically implies that the ε i cannot truly be normally distributed. In these situations, when statisticians assume normality, it is intended as a feasible approximation that results in negligible error, and that is fecund mathematically.

The second reason is similar to the first. Even if someone wishes to insist that Y i is supported on the entire real line, Y δ i often cannot be due to the intrinsic limitations of measurement and observation. Third, as hinted in the first reason, the assumption of normality can be replaced with the assumption that ε δ i has a CDF F i ( e δ ) of a normal distribution that has been symmetrically truncated around zero. This is equivalent to positing that each ε δ i is related some variable Z i ∼ N ( 0 , σ Z i 2 ) s.t. for an (almost) arbitrarily small τ > 0 , Pr ( − M i ≤ Z i ≤ M i ) = 1 − τ and F i ( e δ ) = ( 1 − τ ) − 1 Φ Z i ( e δ ) on [ − M i , M i ] . Provided this setup, the bias that results from treating ε δ i as strictly normally distributed for mathematical convenience is unimportant, especially since one does not need to identify σ Z i 2 .

The fourth and last reason, which motivates the next proposition, is related to the requirement of symmetric scatter in the residual versus fitted plot. A symmetrically truncated normal distribution is a special case of a U random variable. Moreover, when a continuous random variable has an expected value of zero, all that is required for regular U status is for it to be supported on a symmetric interval [ − M i , M i ] .

Hence, the typical set of assumptions already employed for fixed linear regression already requires that each ε δ i ∈ U . In addition, we note that positing that ε δ i ∈ U ∀ i is a fundamentally weaker assumption than (symmetrically truncated) normality. Under this milder condition, a researcher only needs to verify that the residual versus predicted plot is (approximately) symmetrically supported around zero about any neighborhood of predicted values. The behavior of the scatter within any neighborhood is otherwise unimportant, insofar as it reasonably justifies that the expected value is also zero. Nevertheless, it is apropos to state that, if only this milder condition is supposed, then the concentration inequality of Table 1 or a central limit theorem are required for the construction of confidence intervals. Of course, under copious amounts of dependencies, a central limit will not necessarily apply.

We now present a useful main result in Proposition 5, although it is technically a more detailed case of Proposition 3. The extra assumption of regularity is not strictly necessary.

Proposition 5

Assume A4. Say Y δ ∣ t , l = g ( t , l , α ) + ε δ for some measurable (possibly monotonic) function g. Suppose each ε δ is regular, E δ ( ε ∣ t , l ) = 0 , and ε δ ∈ U for ∀ t , l fixed. Then E δ ( Y ∣ t , l ) = A v { Y ( t ) ∣ l } .

Proof

Let t , l be arbitrary. Then E δ ( Y ∣ t , l ) = g ( t , l , β ) since E δ ( ε ∣ t , l ) = 0 .

For an arbitrary bounded random variable Z , say min ( Z ) = min z ∈ S Z ( z ) and max ( Z ) = max z ∈ S Z ( z ) , the greatest lower and least upper bounds of the closed set S Z , respectively. Since g ( t , l , β ) is a constant, it follows that min ( Y δ ∣ t , l ) = g ( t , l , β ) + min ( ε δ ) and max ( Y δ ∣ t , l ) = g ( t , l , β ) + max ( ε δ ) . Moreover, since each ε δ is regular, then each Y δ ∣ t , l is also obviously regular. From here:

min ( Y δ ∣ t , l ) + max ( Y δ ∣ t , l ) = 2 g ( t , l , β ) + min ( ε δ ) + max ( ε δ ) ⇒ 2 − 1 { min ( Y δ ∣ t , l ) + max ( Y δ ∣ t , l ) } = g ( t , l , β ) + 2 − 1 { min ( ε δ ) + max ( ε δ ) } ⇒ A v ( Y δ ∣ t , l ) = g ( t , l , β ) + 0 ⇒ E δ ( Y ∣ t , l ) = A v ( Y δ ∣ t , l ) = A v { Y ( t ) ∣ l }

The third line follows from regularity and the fact that ε δ ∈ U for ∀ t , l fixed. The last line follows from substitution and A4.□

Set g { t , l , ( α 0 , α 1 , α ) } = α 0 + α 1 t + l α to recover equation (3.3). Under the assumption that ( t ′ , l ) and ( t ″ , l ) have both been fixed, where the values t ′ and t ″ represent the treatment values to be contrasted, Proposition 5 implies that α 1 ∝ A v { Y ( t ′ ) ∣ l } − A v { Y ( t ″ ) ∣ l } : the difference in the average values of Y ( T ) ∣ l when T = t ′ and T = t ″ . When ( t ′ , l ) and ( t ″ , l ) have not been fixed but at least all treatment values have been, i.e., when the researcher did not fix l for all contrasts of scientific interest, the previous statement still holds in general when E δ ( ε ∣ t , L ) = 0 or ∀ t involved. The proof of Proposition 5 would only need to use this stronger statement with no further change. If the researcher wishes to reason about contrasts of T that have not been fixed as well, then the (nontrivial) assumption that E δ ( ε ∣ T , L ) = 0 suffices.

This proves that conditions that are weaker than those traditionally supposed for making inferences about associations are sufficient for inference w.r.t. causal parameters. A statistician can target these parameters using linear regression under (almost) arbitrary sampling bias, insofar as linearity and U status in the error distributions are feasibly defensible w.r.t. the sample measures, at least the supports have been preserved, and she can defend the causal assertions. Importantly, A4 can be weakened further since we truly only require the preservation of functional averages.

We choose to highlight five additional points of interest. The first one is about the interpretation of α 1 . It uses similar language to current interpretations of regression coefficients. However, care is due. Although the word “average” is often used for interpreting the coefficients of typical linear regressions, this is imprecise slang for the change in expected value. In distinction, the word “average” is precise for the functional average since it is a uniform averaging over all possible values that the outcome can materialize.

The second point is a caveat. Proposition 5 enables reasoning about counterfactual conditional functional averages in high-dimensional settings. Often, however, the researcher cares mostly about a marginal estimate, and – although they can serve a similar purpose – E { A v { Y ( t ) ∣ L } } ≠ A v { Y ( t ) } and A v { A v { Y ( t ) ∣ L } } ≠ A v { Y ( t ) } generally. For clarity, we designate L as a single discrete covariate WLOG to exemplify these estimands below:

E { A v { Y ( t ) ∣ L } } = ∑ l ∈ S L { R Y ∣ l − 1 ∫ S Y ∣ l y d y } f ( l ) , A v { A v { Y ( t ) ∣ L } } = ∣ S L ∣ − 1 ∑ l ∈ S L { R Y ∣ l − 1 ∫ S Y ∣ l y d y } .

Although they might not reduce perfectly to an isolated marginal effect, this does not signify that these parameters do not possess scientific meaning or that they are not attractive in any way – and this is the core of our third point. For example, E { A v { Y ( t ′ ) ∣ L } } − E { A v { Y ( t ″ ) ∣ L } } can still be interpreted as an expected functional average effect over L . Furthermore, unlike iterated expectations, A v { A v { Y ( t ) ∣ L } } does not strictly require full positivity since it is always a well-defined average over those L = l that are observable when T ∈ { t ′ , t ″ } , the set of contrasted treatment exposures. This is also why we observed that A4 at least implies restricted positivity, since it requires that f ( t ∣ l ) > 0 for some variable L * that possibly represents a censored version of L . Once again, this censuring does not matter insofar as averaging over the L = l of scientific interest is possible for the targeted treatment exposures. Nevertheless, one special circumstance when the identity E { A v { Y ( t ) ∣ L } } = A v { Y ( t ) } does hold is when the premises of Proposition 4 hold: a point we return to shortly.

Expected or uniform averaging of functional averages over the strata of L does possess a succinct meaning when the underlying model is linear. To see this, again appreciate the model, assuming it is validly specified in the spirit of Proposition 5: A v { Y ( t ) ∣ L } = α 0 + α 1 t + L β . Obviously then, if this is the case, then A v { Y ( t ′ ) ∣ L } − A v { Y ( t ″ ) ∣ L } = α 1 ( t ′ − t ″ ) . Since this is constant, E { A v { Y ( t ′ ) ∣ L } − A v { Y ( t ″ ) ∣ L } } = A v { A v { Y ( t ′ ) ∣ L } − A v { Y ( t ″ ) ∣ L } } = α 1 ( t ′ − t ″ ) . In short, all averaged estimands are proportional to the regression coefficient.

Point four requires us to revisit the AIPW estimator. Although this estimator has been used to target a different estimand historically, it is a statistically consistent estimator of E { A v { Y ( 1 ) ∣ L } − A v { Y ( 0 ) ∣ L } } under mild dependency conditions if the outcome model is validly specified, C1–C3 hold, and the premises of Propositions 4 or 5 do as well. This is because – under these premises – E { A v { Y ( 1 ) ∣ L } − A v { Y ( 0 ) ∣ L } } = E { E { Y ( 1 ) ∣ L } − E { Y ( 0 ) ∣ L } } = E { Y ( 1 ) } − E { Y ( 0 ) } . Otherwise, the AIPW estimator and functional average estimators target different estimands.

That is not to assert, however, that the AIPW estimator is not useful for comparison or the testing of hypotheses. Recall: standard linear regressions already assume U errors. Since the AIPW estimator partially relies on a well-specified outcome model, the U assumption is often intrinsic to its use. Using the AIPW estimator, however, requires C1–C3, which guarantees A 4 . Therefore, in most contexts where it is employed, it is implicitly assumed that Y ( t ) ∣ L ∈ U and Y ∣ t , L ∈ U . In conclusion, the AIPW estimator and E { A v { Y ( 1 ) ∣ L } − A v { Y ( 0 ) ∣ L } } do target the same estimand in typical practice settings. This is useful to know. Researchers can then commence their analysis under A1–A3 or A4 to identify one level of causal estimands, including E { A v { Y ( 1 ) ∣ L } − A v { Y ( 0 ) ∣ L } } . If the distance between the latter and the AIPW estimator is sufficiently small, then this is information that can at least be used to support the notion that C1–C3 has been achieved. This same point stands for the plug-in g-formula estimator. In this sense, one can justify stepping upwards on the ladder of assumptions to more defensibly assert the identification of traditional causal estimands.

Table 2 offers a summary of some of these ladders of assumptions w.r.t. a subset of estimators explored in this article. It does not exhaust all possible assumptions and estimands in the context of functional averages and their close relationship to expected values. Nevertheless, it covers much ground and requires a short exposition for clarity. Each subsection of rows offers information on a particular family of estimators and commences with two sets of assumptions labeled (1) and (2). The first set abbreviates a version of the “test theory” basis for functional average preservation while the second starts from the main assumptions of this article. From there, additional assumptions are added to either grouping. These assumptions allow for the transformation of the targeted estimand into one that is closer to the traditional one. In general, assumption groupings that rest on the ladder’s lower rungs are easier to achieve, although this might not be true in particular research contexts. Also, the rows of Table 2 that correspond to probabilistic averaging over strata include C3 as an added assumption. Technically, this is unnecessary. However, C3 has been included since positivity is required for differences in counterfactual expected values to be well-defined. Recall also that E δ ( ⋅ ) denotes an expectation that is taken w.r.t. the sample distribution. When the table asserts that E δ ( ⋅ ) = E ( ⋅ ) is required, it is stating that the expected value w.r.t. the sample distribution must be unbiased for the population expected value. This is a strictly weaker condition than noninformative sampling, as previously discussed. Furthermore, although all sample estimators technically require δ notation, we have only included it in a few places. This is for ease of reading and to draw attention to this fact in more critical places, such as where expected values are involved or a uniform averaging exists over the support of a conditioning variable.

Table 2

Estimators and ladders of assumptions

Estimator	Assumption ladder	Estimand
MR ˆ ∣ t , l	(1): Y ∣ t , l = Y ( t ) ∣ l + τ , S Y ( t ) ∣ l + τ = S Y ( t ) ∣ l × S τ , A v ( τ ) = 0	A v { Y ( t ) ∣ l }
	(2): A1–A3 or A4
∣ S L δ ∣ ‒ 1 ∑ l ∈ S L δ MR ˆ ∣ t , l	(1) or (2)	A v { A v { Y ( t ) ∣ L δ } }
∑ l ∈ S L δ MR ˆ ∣ t , l ⋅ f ˆ ( L δ = l )	(3): (1) or (2), E δ ( ⋅ ) = E ( ⋅ ) , C3	E { A v { Y ( t ) ∣ L } }
	(3) and Y ( t ) ∣ l ∈ U for ∀ l ∈ S L δ	E { Y ( t ) }
Y ¯ ∣ t , l	(1): Y ¯ ∣ t , l = Y ¯ ( t ) ∣ l + τ , S Y ¯ ( t ) ∣ l + τ = S Y ¯ ( t ) ∣ l × S τ , A v ( τ ) = 0	A v { Y ¯ ( t ) ∣ l }
	(2): A1–A3, A4 for Y ¯ ∣ t , l and Y ¯ ∣ t , l ∈ U for large n
	(3) A1–A3 or A4 for Y ∣ t , l and Y ∣ t , l ∈ U	A v { Y ( t ) ∣ l }
	(3) and Y ( t ) ∣ l ∈ U	E { Y ( t ) ∣ l }
∣ S L δ ∣ ‒ 1 ∑ l ∈ S L δ Y ¯ ∣ t , l	(1) or (2)	A v { A v { Y ¯ ( t ) ∣ L δ } }
	(4): (1) or (2) and Y ¯ ( t ) ∣ l ∈ U for ∀ l ∈ S L δ	A v { E { Y ¯ ( t ) ∣ L δ } }
	(4) and E { Y ¯ ( t ) ∣ L δ } ∈ U	E { Y ¯ ( t ) }
∑ l ∈ S L δ Y ¯ ∣ t , l ⋅ f ˆ ( L δ = l )	(5): (1) or (2), E δ ( ⋅ ) = E ( ⋅ ) , C3	E { A v { Y ¯ ( t ) ∣ L } }
	(5) and Y ¯ ( t ) ∣ l ∈ U for ∀ l ∈ S L	E { Y ¯ ( t ) }
Y ˆ ∣ t , l	(1): Y ∣ t , l = Y ( t ) ∣ l + τ , S Y ( t ) ∣ l + τ = S Y ( t ) ∣ l × S τ , A v ( τ ) = 0 , Y ∣ t , l = g ( t , l , α ) + ε , E ( ε ∣ t , l ) = 0 , ε ∈ U	A v { Y ( t ) ∣ l }
	(2): A1–A3 or A4, Y ∣ t , l = g ( t , l , α ) + ε , E ( ε ∣ t , l ) = 0 , ε ∈ U
	(1) or (2) and Y ( t ) ∣ l ∈ U	E { Y ( t ) ∣ l }
∣ S L δ ∣ ‒ 1 ∑ l ∈ S L δ Y ˆ ∣ t , l	(1) or (2)	A v { A v { Y ( t ) ∣ L δ } }
	(1) or (2), S L δ = S L , Y ( t ) ∣ l ∈ U for ∀ l ∈ S L , E { Y ( t ) ∣ L δ } ∈ U	E { Y ( t ) }
∑ l ∈ S L δ Y ˆ ∣ t , l ⋅ f ˆ ( L δ = l )	(3): (1) or (2), E δ ( ⋅ ) = E ( ⋅ ) , C3	E { A v { Y ( t ) ∣ L } }
	(3) and Y ( t ) ∣ l ∈ U for ∀ l ∈ S L	E { Y ( t ) }

The last point is concise to state. The analysis of variance (ANOVA) is a special case of the linear regression model presented. It is an elementary tool that is ubiquitous in research. All prior discussion and Proposition 5 therefore apply to ANOVA procedures under conditions that are already stipulated. Hence, insofar as A1–A3 or A4 are defensible, this means that a plethora of prior work can be re-interpreted with a restricted causal lens in partnership with a much more general structural causal model that allows for unmeasured confounders.

4 Monte Carlo simulations

Before we apply our strategy to real data, we illustrate its utility with a set of Monte Carlo simulations that show how functional average and U concepts are useful for causal inference. For simplicity, we proceed with noninformative sampling conditions that presuppose mutual independence. All simulations use M = 1,000 iterations for sample sizes n ∈ { 500 , 2,500 , 5,000 , 10,000 } . Furthermore, all constructed 1 − α confidence sets use α = 0.05 . Three main simulation experiments are provided in total. The first examines the behavior of basic functional average estimators for symmetric and nonsymmetric distributions. The second and third simulations demonstrate that causal effects can be consistently estimated without controlling for confounding. All experiments examine the performance of Hoeffding style bootstrapping procedures.

4.1 Univariate functional average estimation

The first simulation of this experiment examines the performance of M R ˆ for three truncated normal distributions: Y 1 ∼ T N ( m = 0 , M = 20 , μ = 10 , σ = 5 ) , Y 2 ∼ T N ( m = 0 , M = 15 , μ = 10 , σ = 3 ) , and Y 3 ∼ T N ( m = 0 , M = 15 , μ = 5 , σ = 3 ) . The first random variable Y 1 is in the U class and hence θ 1 = A v ( Y 1 ) = E Y 1 = 10 . However, Y 1 and Y 2 are not U variables. Their distributions are skewed with tails that impact convergence behavior. For these variables, θ = A v ( Y ) = 7.5 .

Hoeffding bootstrap style confidence sets ( C I ˆ H ) are constructed as described in Section 2. To contrast its performance, we also use an m -out-of- n bootstrap. Again, an important requirement of the m -out-of- n bootstrap is that m → ∞ , but n − 1 m → 0 . To meet this criteria, we set m = r ( n ) since this setting produces relatively conservative results. Hence, if it fails to perform well, this highlights the utility of the Hoeffding procedure. Percentile confidence sets ( C I ˆ p , m ) are constructed from the m -out-of- n bootstraps. For reference we also construct Hoeffding style confidence sets ( C I ˆ H , m ) from this process. Importantly, all bootstrap procedures make use of simple random sampling with replacement and only 500 bootstrap samples. Although suboptimal, a low number of bootstrap samples is used to limit computational burden. A decent performance of the Hoeffding bootstrap at B = 500 is still a good indicator. Empirical coverage rates are estimated with E C = M − 1 ∑ i = 1 M 1 θ ∈ C I ˆ i WLOG.

Table 3 presents the results of this experiment. Importantly, all values in the tables henceforth represent the arithmetic average of simulated objects, including the endpoints of confidence sets.

Table 3

Functional average estimation, continuous

A v ( Y )	n	M R ˆ	C I ˆ H , m	E C H , m	C I ˆ p , m	E C p , m	C I ˆ H	E C H
θ = 10	500	10.01	(1.85, 18.15)	1	(8.02, 11.98)	1	(8.87, 11.13)	1
	2,500	10	(4.5, 15.5)	—	(8.72, 11.27)	—	(9.72, 10.28)	—
	5,000	10	(5.42, 14.58)	—	(8.97, 11.03)	—	(9.86, 10.14)	—
	10,000	10	(6.26, 13.74)	—	(9.17, 10.83)	—	(9.93, 10.07)	—
θ = 7.5	500	8.11	(2.1, 14.12)	1	(7.58, 10.64)	0.40	(6.51, 9.7)	0.88
	2,500	7.74	(2.98, 12.5)	—	(7.6, 10.09)	0.30	(6.82, 8.65)	0.94
	5,000	7.63	(3.31, 11.96)	—	(7.59, 9.87)	0.25	(6.96, 8.31)	0.97
	10,000	7.58	(3.7, 11.46)	—	(7.58, 9.67)	0.21	(7.13, 8.03)	0.98
θ = 7.5	500	6.91	(0.88, 12.94)	1	(4.37, 7.44)	0.43	(5.31, 8.51)	0.90
	2500	7.26	(2.46, 12.06)	—	(4.91, 7.4)	0.30	(6.34, 8.18)	0.95
	5,000	7.37	(3.04, 11.69)	—	(5.13, 7.41)	0.23	(6.69, 8.04)	0.97
	10,000	7.42	(3.52, 11.33)	—	(5.33, 7.42)	0.18	(6.97, 7.87)	0.97

† Each θ corresponds to the functional average of the TN distributions introduced above; “ – ” indicates that the value is the same as the first row.

These results demonstrate that M R ˆ behaves as intended. It is unbiased for the symmetric distribution and convergence behavior – albeit slow – is observable w.r.t. the target parameters for the asymmetric distributions that exhibit more problematic tail behavior. Importantly, the Hoeffding-style bootstrap also appears to behave as intended. Although it failed to uphold nominal coverage values for the skewed distributions at n = 500 , it quickly overcame this behavior to provide conservative empirical coverage for θ . Also, the difficulties of efficiently estimating A v ( Y ) are evident for non- U variables, highlighting the utility of this type of variable. Notably, the m -out-of- n percentile interval did not perform well even when m was O ( n ) .

The second experiment is for discrete variables. Like before, we use three different truncated normal distributions: Y 1 ∼ T N ( m = 0 , M = 40 , μ = 20 , σ = 5 ) , Y 2 ∼ T N ( m = 0 , M = 40 , μ = 25 , σ = 8 ) , and Y 3 ∼ T N ( m = 0 , M = 40 , μ = 15 , σ = 8 ) . These random variables are uniformly rounded to the nearest integer to induce discreteness. Here, we contrast M R ˆ with A v ˆ , the discrete plug-in for equation (2.1). Hoeffding style bootstraps are again employed to construct confidence sets. However, now we use a U approximation in accordance with Section 3. Even if U status does not hold exactly, insofar as convergence behavior to a constant holds, the method should remain robust.

Results for this simulation are available in Table 4. We no longer examine the performance of the m -out of- n bootstrap.

Table 4

Functional average estimation, discrete

A v ( Y ) = 20	n	A v ˆ	C I ˆ H	E C H	M R ˆ	C I ˆ H	E C H
r ( Y 1 )	500	20.012	(15.21, 24.82)	1	20.044	(13.88, 26.21)	1
	2500	19.992	(15.84, 24.14)	—	19.969	(14.93, 25.01)	—
	5,000	20.014	(16.18, 23.85)	—	20.023	(15.58, 24.47)	—
	10,000	20.020	(16.39, 23.65)	—	20.016	(16, 24.04)
r ( Y 2 )	500	21.921	(17.62, 26.22)	0.971	21.218	(16.48, 25.96)	0.955
	2,500	20.490	(18.19, 22.79)	0.980	20.365	(18.17, 22.56)	0.963
	5,000	20.194	(18.55, 21.83)	0.985	20.169	(18.67, 21.67)	0.975
	10,000	20.051	(19, 21.1)	0.972	20.050	(19.06, 21.04)	0.957
r ( Y 3 )	500	18.053	(13.8, 22.31)	0.963	18.788	(14.04, 23.54)	0.953
	2,500	19.518	(17.22, 21.82)	0.985	19.638	(17.45, 21.83)	0.967
	5,000	19.803	(18.16, 21.44)	0.987	19.833	(18.34, 21.32)	0.980
	10,000	19.941	(18.87, 21.01)	0.977	19.942	(18.94, 20.94)	0.963

The results of the discrete simulation largely match those of the preceding continuous one. The plug-in estimator performed more favorably than the sample mid-range only for r ( Y 1 ) : the distribution with the lightest tails. Importantly, although many simulations demonstrated departures from U status, the coverage remained robust due to the conservative nature of the Hoeffding bootstrap. Further details pertaining to this fact are available in the supplementary material.

4.2 Causal inference with functional averages

Next, we demonstrate a classical case of confounding where the functional average treatment effect is identifiable and statistically consistently estimable without any adjustment. We use the following variables for this simulation: C ∼ B e r n ( 0.5 ) , T ∼ B e r n ( 0.3 + 0.5 C ) , and e ∼ T N ( m = − 50 , M = 50 , μ = 0 , σ = τ ) . Moreover, we will say that Y ( 1 ) = 110 + 50 C + e , Y ( 0 ) = 100 + 50 C + e , and Y = Y ( 1 ) T + ( 1 − T ) Y ( 0 ) .

The confounding variable is C , which is present in half of the theoretical population on average. When C = 1 , the probability of allocation to treatment is larger. The structure of the confounding also preserves the support of the counterfactual distribution w.r.t. the observed one. We also set τ to 5 then 25 to demonstrate the performance of a functional average estimator when the tail probabilities are thin or heavy. Notably, this experiment is also structured s.t. A1–A3 and therefore A4 all hold.

We contrast simple linear regression with M R ˆ for estimating Δ = A v { Y ( 1 ) } − A v { Y ( 0 ) } = 10 . Here, the mid-range estimator of Δ is Δ ˆ MR = 2 − 1 { Y ( 1 ) ∣ 1 + Y ( n 1 ) ∣ 1 } − 2 − 1 { Y ( 1 ) ∣ 0 + Y ( n 0 ) ∣ 0 } . This simulation is executed WLOG since it can be implicitly assumed that the researcher has stratified on some set of variables – such as propensity scores – to limit confounding bias or achieve the equality of supports. Confidence sets and estimators of the empirical coverage are all otherwise constructed as previously explored using the Hoeffing bootstraps of Section 2. Power is estimated by EP H = M − 1 ∑ i = 1 M 1 0 ∉ C I ˆ H i . Table 5 has the results.

Table 5

Continuous effect estimators: Δ 1 , 0 = 10

n	Δ ˆ OLS	Δ ˆ MR	C I ˆ H	EP H
500	33.45	11.76	(1.53, 22)	0.71
2,500	34.39	11.69	(3.05, 20.33)	0.85
5,000	35.09	11.65	(3.67, 19.63)	0.90
10,000	35.18	11.47	(3.88, 19.07)	0.92

n	Δ ˆ OLS	Δ ˆ MR	C I ˆ H	EP H
500	33.44	12.82	( − 10.47 , 36.11)	0.05
2,500	34.40	10.84	(2.73, 18.95)	0.91
5,000	35.11	10.47	(5.79, 15.15)	1
10,000	35.20	10.25	(7.68, 12.82)	—

All empirical coverage estimators returned 1 for Δ ˆ MR .

As expected, Δ ˆ OLS is a confounded estimator of E { Y ( 1 ) } − E { Y ( 0 ) } = Δ 1 , 0 = 10 . However, this is not the case for Δ ˆ MR , which demonstrates convergence behavior towards the true parameter value, albeit at a suboptimal rate. Reiteration of a poignant fact here is valuable: M R ˆ demonstrates this behavior without adjustment. Moreover, although the sampling conditions supposed here were noninformative, this was unnecessary. Insofar as the supports are preserved, informative sampling conditions are unimportant w.r.t. statistical consistency, although they might impact convergence rates. These results are also consistent with prior discussions on the mid-range. We expect the mid-range to possess more favorable properties when nonnegligible mass or density rests in the tails, which is the case when τ = 25 for these simulations.

Next, we present a restricted discrete analogue to the last experiment. C and T retain their definitions, but ε ∼ B i n o m ( τ , 2 − 1 ) for τ ∈ { 30 , 50 } . Otherwise, Y ( 0 ) = 10 C + e , Y ( 1 ) = Y ( 0 ) + 5 , and Y = Y ( 1 ) T + ( 1 − T ) Y ( 0 ) as before. For this simulation, we employ the Hoeffing bootstrap for U statistics once more. Table 6 below presents the results.

Table 6

Discrete effect estimators: Δ 1 , 0 = 5

	n	Δ ˆ OLS	Δ ˆ A v ˆ	C I ˆ H	E P ˆ H	Δ ˆ M R ˆ	C I ˆ H	EP H
τ = 30	500	9.694	6.059	(0.55, 11.57)	0.681	5.899	( ‒ 0.47 , 12.27)	0.447
	2,500	9.875	5.91	(1.34, 10.48)	0.871	5.861	(0.63, 11.09)	0.680
	5,000	10.017	5.8	(1.45, 10.15)	0.890	5.757	(0.83, 10.69)	0.728
	10,000	10.037	5.765	(1.68, 9.85)	0.921	5.729	(1.19, 10.27)	0.805
τ = 50	500	9.696	6.443	( ‒ 0.16 , 13.04)	0.458	6.192	( ‒ 2.11 , 14.49)	0.236
	2,500	9.875	6.268	(0.75, 11.79)	0.728	6.150	( ‒ 0.69 , 12.99)	0.419
	5,000	10.017	6.147	(0.87, 11.42)	0.734	6.052	( ‒ 0.48 , 12.59)	0.457
	10,000	10.036	6.074	(1.08, 11.07)	0.804	5.995	( ‒ 0.03 , 12.02)	0.551

All empirical coverage estimators were > 0.998 .

The discrete estimators demonstrate finite sample bias. Nonetheless, they also demonstrate convergence toward Δ 1 , 0 as n becomes large, while Δ ˆ OLS remains confounded. Although Δ 1 , 0 = 5 is a small to moderate effect size, Δ ˆ A v ˆ shows ample power to detect it under the null hypotheses that Δ 1 , 0 = 0 for reasonable sample sizes when τ is set to 30. Traditional levels of acceptable power are only achieved by the Δ ˆ A v ˆ estimator at n = 10,000 when τ = 50 . The increase in variance and the lowered likelihood of observing some values of the support hinder both estimators’ performance. Although the mid-range estimator seems to possess less bias for these simulations, the plug-in estimator seems to possess more power.

Our last experiment demonstrates how linear regression can still be used for causal inference when mean exchangeability does not hold, but each error term is a U variable. To this end, we use a new setup for any measurable function g : T ∼ B e r n ( 0.3 ) , U 1 ∼ T N ( m = − 10 , M = 10 , μ = 0 , σ = 2 ) , and μ T = 100 + 20 T . Then we will say U 2 ∼ T N ( m = μ T − g ( T ) − 10 , M = μ T − g ( T ) + 10 , μ = 0 , σ = 2 ) , Y ∣ T = μ T + U 1 , and Y ( T ) = g ( T ) + U 2 .

For simplicity, we set g ( T ) = 90 + 10 T . The underlying model that results from these specifications can also be written as follows: Y ∣ T = Y ( T ) + 10 + 10 T + γ , where γ = U 1 − U 2 . This model structure can be amended to include more covariates. However, this is unnecessary for our demonstration. What is important is, conditional on a design matrix x , the functional averages are preserved and that linearity holds for Y ∣ x . The true generating process for the counterfactual distribution can be otherwise unknown. Here, Δ 1 , 0 = 20 . However, confounding is present (in addition to other violations) since E { Y ( 1 ) } − E { Y ( 0 ) } = 10 ≠ E ( Y ∣ 1 ) − E ( Y ∣ 0 ) = 20 .

The results are provided in Table 7. The C I ˆ t column provides the arithmetic average of the standard t -distribution confidence set endpoints. For reference, we also include C I ˆ U for confidence sets constructed from the concentration inequality in Table 1 for U errors.

Table 7

Linear regression for functional averages

n	Δ ˆ OLS	C I ˆ t	C I ˆ U	C I ˆ H
500	20	(19.62, 20.37)	(19.09, 20.91)	(19.9, 21.58)
2,500	—	(19.83, 20.17)	(19.52, 20.48)	(19.28, 20.71)
5,000	—	(19.88, 20.12)	(19.64, 20.35)	(19.49, 20.51)
10,000	—	(19.91, 20.09)	(19.74, 20.26)	(19.64, 20.36)

C I ˆ U is constructed as follows. Say x is the design matrix for a regression to estimate β and w = ( x ⊤ x ) − 1 x ⊤ . Then β ˆ − β = w ε . Therefore, β ˆ T − β T = ∑ i = 1 n w i , s ε i , where s is the row of w corresponding to treatment feature. Algebraic rearrangement of the concentration inequality then yields confidence sets of the form β ˆ T ± ∑ i = 1 n R i 2 ⋅ 6 − 1 log ( 2 ⁄ α ) , where R i is the population range of w i , s ε i . Under the assumption of a valid mean model specification, the maximum of the supports of the ε i is feasibly estimable with e ˆ ( n ) WLOG, where e ˆ is a typical residual. Hence, we can use an approximate confidence set of the following form when the extremes of the support of Y are unknown: β ˆ T ± { e ˆ ( n ) − e ˆ ( 1 ) } ⋅ ∑ i = 1 n w i , s 2 ⋅ 6 − 1 log ( 2 ⁄ α ) . We contrast these confidence sets to those constructed with the Hoeffding bootstrap of Section 2.

Since the properties of linear regression are well understood, a thorough discussion is unnecessary. The results again substantiate the utility of U random variables. Under their framework, efficient estimation of causal parameters is more readily achievable, especially if mutual independence is a feasible assumption.

5 A data application

In this section, we employ NHEFS data to demonstrate our concepts. The NHEFS conducted medical examinations from 1971–1975 from noninstitutionalized civilian adults aged 24–74 ( N = 14,407) in the United States as part of a national probability sample. Follow-up surveys were then administered in 1982, 1984, and subsequent years to collect measurements for behavioral, nutritional, and clinical variables. Further documentation is available elsewhere [58]. The subset of data we use here (n = 1,479) originates from the original 1971 medical examination and follow-up in 1982.

Exercise (0: moderate to much; 1: little to none) is the treatment variable ( T ) of interest. Age, chronic bronchitis/emphysema diagnosis (1: yes; 0: never), education attained in 1971 (1: < 8th grade; 2: HS dropout; 3: HS; 4: college dropout; 5: college), income, race (1: non-white; 0: white), sex (1: female; 0: male), years smoking, alcohol frequency, and weight (kilograms) are utilized as adjusting covariates (henceforth denoted as L ). Although SBP was measured with integer values, it is still treated as continuous in most instances. Pertinently, T and most covariates were all measured in 1971. Only SBP and weight were measured in 1982. All continuous covariates are centered on their observed sample means for this analysis.

Here, we are interested in seeing if a history of exercise exerted a causal effect on SBP in smokers. We aim to estimate E { A v { Y ( 1 ) ∣ L } − A v { Y ( 0 ) ∣ L } } and E { A v { Y ( 1 ) ∣ S } − A v { Y ( 0 ) ∣ S } } as summary causal effects, where S = s represents a stratum that has been constructed from the quintiles of e ˆ ( L ) , the estimated propensity scores. Recall that these estimands represent causal effects that have been averaged over all strata, either probabilistically or uniformly. Since we are presupposing linear models, these estimands are simply the treatment coefficients of each respective specification. Our goal is also to estimate the marginal functional average causal effect, A v { Y ( 1 ) } − A v { Y ( 0 ) } . Although we do not target expected causal effects intentionally, recall that, under the suppositions of Proposition 5, if C1-C3 do hold in addition, then we are implicitly targeting expected causal effects as well since this setup implies that A v { Y ( T ) ∣ L } = E { Y ( T ) ∣ L } and hence that Y ( T ) ∣ L ∈ U .

For this reason, and because attempting to account for the salient forces determining a phenomenon is good practice scientifically, we still try to achieve C1-C3 in this analysis. To this end, race is considered a confounder since it represents both genetic information and socio-historical constructions [59]. In a similar vein, we choose to adjust for sex to account for possible biological influences, and since it is also an imperfect proxy for social institutions that can impact exercise habits, other health behaviors, and therefore blood pressure. All other variates mentioned are adjusted for since they are either known to affect both SBP and exercise habits directly or to act as conduits for more general institutional or ecological influences that can transcend time. Theoretically, adjusting for them can help to block backdoor paths from a subset of unknown confounders, which, although mostly irrelevant here in terms of their probabilistic effects insofar as functional averages are concerned, can still impact supports. Furthermore, it seems intuitive to assert that fewer omitted variables can translate into a heightened likelihood that the probabilistic errors remaining follow a U pattern, although this is not guaranteed.

5.1 Methods

The following models are employed to estimate possible causal effects: simple linear regression (SLR) to estimate A v { Y ( 1 ) } − A v { Y ( 0 ) } under the working supposition that each outcome distribution is U class, multiple linear regression (MLR) to estimate E { A v { Y ( 1 ) ∣ L } − A v { Y ( 0 ) ∣ L } } , and linear regression adjusted for propensity score strata (PS) to estimate E { A v { Y ( 1 ) ∣ S } − A v { Y ( 0 ) ∣ S } } . For contrast, we also estimate A v { Y ( 1 ) } − A v { Y ( 0 ) } with the discrete plug-in (Av) from Section 2.1. In a similar vein, we also estimate E { A v { Y ( 1 ) ∣ L } − A v { Y ( 0 ) ∣ L } } with the AIPW estimator defined in Section 3. This is accomplished using the multiple linear regression model, an application of the parametric g-formula, and estimated propensity scores. Propensity scores are estimated with logistic regression using the same covariates as the regression model. No covariate transformations are used for the logit model, although we introduce higher-order terms to the regression if it appears to improve linearity.

Standard t -statistic-based confidence sets are employed for the coefficient estimators of the regression models. We use the Hoeffding bootstrap of Section 3 for equation (2.1) plug-in and a standard bootstrapping procedure for the AIPW estimator, both with B = 1,000 replications. The null hypothesis that E { A v { Y ( 1 ) ∣ L } − A v { Y ( 0 ) ∣ L } } = E { Y ( 1 ) } − E { Y ( 0 ) } is also tested via a standard bootstrapping procedure of the difference between the AIPW and MLR estimators. All tests are conducted at the α = 0.05 level using R version 4.2.2 statistical software [60]. The assumptions of U status and valid mean model specification are substantiated by inspecting residual versus fitted plots and empirical CDF plots.

5.2 Results

The coefficient estimate for the simple linear regression is β ˆ t = − 3.87 (95% CI: − 5.821 , − 1.916 ), while the plug-in functional average estimate for A v { Y ( 1 ) } − A v { Y ( 0 ) } is − 3.81 (95% CI: -24.17, 17.42). The latter estimates that, on average, the value of an adult smoker’s SBP is 3.81 mmHg less when they possess a history of exercise, although this estimate was not statistically significant. Additional model results are available in Table 8.

Table 8

Summary of model results

Parameter	Method	Estimate	95% Confidence interval
A v { Y ( 1 ) } ‒ A v { Y ( 0 ) }	Av	‒ 3.81	[ ‒ 24.17 , 17.42 ]
E { A v { Y ( 1 ) ∣ L } ‒ A v { Y ( 0 ) ∣ L } }	MLR	‒ 0.736	[ ‒ 2.56 , 1.09 ]
E { A v { Y ( 1 ) ∣ L } ‒ A v { Y ( 0 ) ∣ L } }	AIPW	‒ 0.69	[ ‒ 2.47 , 1.09 ]
E { A v { Y ( 1 ) ∣ S } ‒ A v { Y ( 0 ) ∣ S } }	PS	‒ 1.014	[ ‒ 2.96 , 0.94 ]

Initial fitting procedures for the multiple linear regression model showed mild departures from linearity. The addition of a quadratic term for age appeared to improve model fit. From this reformed model, the estimate for the expected functional average change in SBP is E ˆ { A v { Y ( 1 ) ∣ L } − A v { Y ( 0 ) ∣ L } } = − 0.736 (95% CI: − 2.56 , 1.09), meaning that adult smokers with a history of exercise were measured to have 0.736 mmHg less SBP on average than those who did not exercise, with all other covariates held equal. This estimate is not statistically significantly different from the AIPW estimate { Difference = − 0.043 ( 95 % CI : − 0.414 , 0.329 ) } , which estimates that the SBP of adult smokers is 0.69 less on average when they report a history of exercise (95% CI: − 2.47 , 1.09). Finally, the estimate of the expected functional average change in SBP w.r.t. the propensity score stratified model is E ˆ { A v { Y ( 1 ) ∣ S } − A v { Y ( 0 ) ∣ S } } = − 1.014 (95% CI: − 2.96 , 0.94).

5.3 Model checking

Empirical CDF plots for the SBP distributions are presented by propensity score strata and by treatment status in Figure 1. The residual versus fitted plot for the MR model is also included.

Figure 1

Model validation plots.

We note no remarkable departures from linearity in the residual versus fitted plot for the multiple linear regression model. Moreover, the residuals appear to possess an approximately symmetric spread around the zero horizontal line, although some outlying points appear to violate the critical assumption defining U status in this context: that the errors are supported around symmetric extremes. Altogether, although departures from strict U status are observable, the assumption of U status appears to be feasibly met. The stratified empirical CDF plots in Figure 1 do not appear to corroborate approximate sum-symmetric behavior since they show more area below the functional lines than above in most cases. This is also true for the empirical CDF plots of the nonadjusted conditional distributions. Hence, the results of the simple t -test cannot be afforded a causal interpretation w.r.t. a change in functional average.

5.4 Discussion

Since the NHEFS was a national probability sample of noninstitutionalized adults, there is little reason to believe that A4 was not fulfilled, conditional on our adjusting covariates. Recall that positing the opposing notion in this context is to affirm – for noninstitutionalized adults who smoked – that there were possible values of SBP in each treatment population that had zero probability of being observed in the sample of the observational design. Insofar as the NHEFS survey was truly a probability sample and hence noninformative, rejecting A4 also means that a subset of potential SBP values could never have been observed in the real world, in all likelihood. Stating this also means that there did not exist a set of adjusting variables s.t. expected causal effects could have been identified and estimated. Due to the feasible validity of A4 and the fact that U status appeared to be approximately verified for the basic multiple regression model, we are confident that – conditional on our covariates – the expected (functional average) causal effects were sufficiently identified and that the average effect of a history of exercise upon SBP in adult smokers is plausibly within a neighborhood of zero. This conclusion is further corroborated by the unadjusted estimate of the difference in functional averages between treatment groups, which was also not statistically significantly different from zero. However, this test was hampered by the fact that each conditional SBP population possessed a relatively light right tail and the sample size was small. In these circumstances, the mid-range and the discrete plug-in estimator are inefficient statistics. This fact, combined with the intrinsic conservatism of the Hoeffding bootstrap procedure, produced a confidence set with a length that was substantially greater than the lengths of others.

Vitally, we have little reason to believe that we successfully adjusted for all confounding variables. Hence, we do not purport to interpret the effect estimates as expected treatment effects. Nevertheless, since the AIPW estimator was not statistically significantly different from its functional average counterpart, and the outcome model seemed reasonably well specified, this at least seemed to provide evidence that bias was fairly attenuated. Of course, if mean exchangeability and consistency did hold, they would also be estimates of this type of contrast.

It is also apropos to note that the consistency assumption might be violated in this analysis. This is because respondents were asked if they exercised little to none, moderately, or much; however, the meanings of these words possess no absolutism. Hence, it is possible that multiple exercise treatments existed under the premise of one coding. This does not undermine what is formally specified in A4, at least conditional on the variables observed, although it does complicate the generalizability of the results if present.

6 Conclusion

We demonstrated that causal inference is achievable in the absence of mean exchangeability if the support of the counterfactual distribution is preserved. Moreover, we offered exposition on the possible utility and scientific meaningfulness of functional average change. To overcome some of the difficulties of functional average estimation, we introduced a simple class of random variables – the U class – that possesses a milieu of practical properties. By using the U random variable framework, we showed that ubiquitously employed statistical procedures produce estimates with causal interpretations under exceptionally mild conditions, many of which are already supposed in most applied settings to investigate associations. Hence, even if a researcher fails to control for all confounding variables, she still might be left with a second-prize of sorts, and one that possesses salient causal meaning. Since uncontrolled confounding is safely assumed to be nearly omnipresent outside of toy examples, we believe that this framework provides a strong defense of elementary methods. Further work is of course due. For instance, this article did not explore how sensitive the estimation of functional averages is to departures from U status in the regression errors. This is a critical question. Moreover, we observe that we did not pursue the sample minimum, maximum, or any other function of the supports as causal estimators in and of themselves, although the set of assumptions employed here also establishes their utility in this regard. Developing this area of theory will most certainly be advantageous. Much of extreme value theory can be immediately transformed into theory for causal inference, for instance. Furthermore, we restricted much of our analyses to basic linear models. However, the results of this article apply just as equally to nonlinear models, semi-parametric models, or causal estimands estimated via machine learning algorithms more generally.

Finally, we also presented a new approach to the bootstrapping process. We called this approach the Hoeffding bootstrap. Ultimately, we showed that a simple pair of inscrutable inequalities can be wielded to construct feasible and mostly conservative confidence sets for a very general class of statistics. Pertinently, although we excluded some dependency scenarios, they were only of the most extreme type. We cited no particular dependency theory, otherwise. Nevertheless, much of our justification for this procedure was asymptotic. This is a good start. However, this cannot be the end. A defensible set of sufficient conditions that ensure the approach for small samples will do much to lighten the burden of uncertainty. We conjecture that a body of literature on the behavior of extreme order statistics and their expected values, but within the universe of bounded random variables, will do much to improve this method.

Acknowledgements

We thank Dr. Miguel Hernán for making the NHEFS data accessible. Furthermore, we would like to thank the editor and reviewers for their diligent and insightful commentary, which substantially improved the quality of this manuscript.

Funding information: We have no funding information to declare.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and consented to its submission to the journal, reviewed all the results, and approved the final version. SS developed the theoretical content, proofs, and simulation experiments and prepared the manuscript. EG supervised the preparation of the manuscript, reviewed its contents, contributed to the data application, and participated in manuscript editing. LZ also supervised the manuscript preparation and participated in editing. Additionally, LZ reviewed and commented on the manuscript’s proofs and contributed to its mathematical content.
Conflict of interest: The authors state no conflicts of interest.
Data availability statement: The NHEFS data were acquired from Dr. Miguel Hernán’s faculty website (https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/), where it is publically accessible.

References

[1] Rubin DB. Essential concepts of causal inference: a remarkable history and an intriguing future. Biostat Epidemiol. 2019;3(1):140–55. 10.1080/24709360.2019.1670513Search in Google Scholar

[2] Pearl J. Causal inference. Causality: objectives and assessment. Cambridge, MA, USA: MIT Press; 2010. p. 39–58. Search in Google Scholar

[3] Holland PW. Statistics and causal inference. J Amer Stat Assoc. 1986;81(396):945–60. 10.1080/01621459.1986.10478354Search in Google Scholar

[4] Imbens GW, Rubin DB. Causal inference in statistics, social, and biomedical sciences. Cambridge: Cambridge University Press; 2015. 10.1017/CBO9781139025751Search in Google Scholar

[5] Ding P, Li F. Causal inference. Stat Sci. 2018;33(2):214–37. 10.1214/18-STS645Search in Google Scholar

[6] Hernán MA, Robins JM. Causal inference. Boca Raton, FL: CRC; 2010. Search in Google Scholar

[7] Imbens GW, Rubin DB. Rubin causal model. In: Microeconometrics. London: Palgrave Macmillan; 2010. p. 229–41. 10.1057/9780230280816_28Search in Google Scholar

[8] Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. 10.1093/biomet/70.1.41Search in Google Scholar

[9] Holland PW, Rubin DB. Causal inference in retrospective studies. ETS Res Report Series. 1987;1987(1):203–31. 10.1002/j.2330-8516.1987.tb00211.xSearch in Google Scholar

[10] Jin H, Rubin DB. Principal stratification for causal inference with extended partial compliance. J Amer Stat Assoc. 2008;103(481):101–11. 10.1198/016214507000000347Search in Google Scholar

[11] Belloni A, Chernozhukov V, Fernandez-Val I, Hansen C. Program evaluation and causal inference with high-dimensional data. Econometrica. 2017;85(1):233–98. 10.3982/ECTA12723Search in Google Scholar

[12] Gangl M. Causal inference in sociological research. Ann Rev Soc. 2010;36:21–47. 10.1146/annurev.soc.012809.102702Search in Google Scholar

[13] Cole SR, Frangakis CE. The consistency statement in causal inference: a definition or an assumption? Epidemiology. 2009;20(1):3–5. 10.1097/EDE.0b013e31818ef366Search in Google Scholar PubMed

[14] Hernán MA, Robins JM. Estimating causal effects from epidemiological data. J Epidemiol Commun Health. 2006;60(7):578–86. 10.1136/jech.2004.029496Search in Google Scholar PubMed PubMed Central

[15] Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999;10:37–48. 10.1097/00001648-199901000-00008Search in Google Scholar

[16] Greenland S, Pearl J, Robins JM. Confounding and collapsibility in causal inference. Stat Sci. 1999;14(1):29–46. 10.1214/ss/1009211805Search in Google Scholar

[17] Glantz SA, Parmley WW. Passive smoking and heart disease. Epidemiology, physiology, and biochemistry. Circulation. 1991;83(1):1–12. 10.1161/01.CIR.83.1.1Search in Google Scholar PubMed

[18] Stallones RA. The association between tobacco smoking and coronary heart disease. Int J Epidemiol. 2015;44(3):735–43. 10.1093/ije/dyv124Search in Google Scholar PubMed

[19] Narkiewicz K, Kjeldsen SE, Hedner T. Is smoking a causative factor of hypertension? Blood Pressure. 2005;14(2):69–71. 10.1080/08037050510034202Search in Google Scholar PubMed

[20] Elley CR, Arroll B. aerobic exercise reduces systolic and diastolic blood pressure in adults. Evidence Based Med. 2002;7(6):170. 10.1136/ebm.7.6.170Search in Google Scholar

[21] Park W, Miyachi M, Tanaka H. Does aerobic exercise mitigate the effects of cigarette smoking on arterial stiffness? J Clin Hypertension. 2014;16(9):640–4. 10.1111/jch.12385Search in Google Scholar PubMed PubMed Central

[22] Pfeffermann D, Sverchkov M. Inference under informative sampling. In: Handbook of statistics. Vol. 29. Elsevier; 2009. p. 455–87. 10.1016/S0169-7161(09)00239-9Search in Google Scholar

[23] Pfeffermann D, Krieger AM, Rinott Y. Parametric distributions of complex survey data under informative probability sampling. Statistica Sinica. 1998;8:1087–114. Search in Google Scholar

[24] Patil GP, Rao CR, Zelen M, Patil GP. Weighted distributions. Wiley, New York: Citeseer; 1987. Search in Google Scholar

[25] Patil GP, Rao CR. Weighted distributions and size-biased sampling with applications to wildlife populations and human families. Biometrics. 1978;38:179–89. 10.2307/2530008Search in Google Scholar

[26] Pearl J. Statistics and causal inference: A review. Test. 2003;12:281–345. 10.1007/BF02595718Search in Google Scholar

[27] Hansen BE. Uniform convergence rates for kernel estimation with dependent data. Econ Theory. 2008;24(3):726–48. 10.1017/S0266466608080304Search in Google Scholar

[28] Chen YC. A tutorial on kernel density estimation and recent advances. Biostat Epidemiol. 2017;1(1):161–87. 10.1080/24709360.2017.1396742Search in Google Scholar

[29] Zambom AZ, Ronaldo D. A review of kernel density estimation with applications to econometrics. Int Econ Rev. 2013;5(1):20–42. Search in Google Scholar

[30] Chernoff H, Gastwirth JL, Johns MV. Asymptotic distribution of linear combinations of functions of order statistics with applications to estimation. Ann Math Stat. 1967;38(1):52–72. 10.1214/aoms/1177699058Search in Google Scholar

[31] Hosking JR. L-moments: Analysis and estimation of distributions using linear combinations of order statistics. J R Stat Soc Ser B (Methodological). 1990;52(1):105–24. 10.1111/j.2517-6161.1990.tb01775.xSearch in Google Scholar

[32] Bickel PJ. On some analogues to linear combinations of order statistics in the linear model. Ann Stat. 1973;1:597–616. 10.1214/aos/1176342457Search in Google Scholar

[33] David HA, Nagaraja HN. Order statistics. Hoboken, NJ, USA: John Wiley & Sons; 2004. 10.1002/0471722162Search in Google Scholar

[34] Barndorff-Nielsen O. On the limit behaviour of extreme order statistics. Ann Math Stat. 1963;34(3):992–1002. 10.1214/aoms/1177704022Search in Google Scholar

[35] Sparkes S, Zhang L. Properties and deviations of random sums of densely dependent random variables; 2023. https://arxiv.org/abs/2310.11554. Search in Google Scholar

[36] Bingham N. The sample mid-range and symmetrized extremal laws. Stat Probabil Lett. 1995;23(3):281–8. 10.1016/0167-7152(94)00126-SSearch in Google Scholar

[37] Bingham N. The sample mid-range and interquartiles. Stat Probabil Lett. 1996;27(2):131–6. 10.1016/0167-7152(95)00054-2Search in Google Scholar

[38] Broffitt JD. An example of the large sample behavior of the midrange. Amer Stat. 1974;28(2):69–70. 10.1080/00031305.1974.10479075Search in Google Scholar

[39] Arce GR, Fontana SA. On the midrange estimator. IEEE Trans Acoustics Speech Signal Proces. 1988;36(6):920–2. 10.1109/29.1603Search in Google Scholar

[40] Efron B, Tibshirani RJ. An introduction to the bootstrap. United States: CRC Press; 1994. 10.1201/9780429246593Search in Google Scholar

[41] Bickel PJ, Freedman DA. Some asymptotic theory for the bootstrap. Ann Stat. 1981;9(6):1196–217. 10.1214/aos/1176345637Search in Google Scholar

[42] Bickel PJ, Sakov A. On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistica Sinica. 2008;18:967–85. Search in Google Scholar

[43] Bickel PJ, Ren JJ. The bootstrap in hypothesis testing. Lecture Notes-Monograph Series. 2001:91–112. 10.1214/lnms/1215090064Search in Google Scholar

[44] Swanepoel JW. A note on proving that the (modified) bootstrap works. Commun Stat-Theory Methods. 1986;15(11):3193–203. 10.1080/03610928608829303Search in Google Scholar

[45] Beran R, Ducharme GR. Asympotic theory for bootstrap methods in statistics. Montréal, Québec: Les Publications CRM; 1991. Search in Google Scholar

[46] Politis DN, Romano JP, Wolf M. On the asymptotic theory of subsampling. Statistica Sinica. 2001;11:1105–24. Search in Google Scholar

[47] Shao X. The dependent wild bootstrap. J Amer Stat Assoc. 2010;105(489):218–35. 10.1198/jasa.2009.tm08744Search in Google Scholar

[48] Hall P, Horowitz JL, Jing BY. On blocking rules for the bootstrap with dependent data. Biometrika. 1995;82(3):561–74. 10.1093/biomet/82.3.561Search in Google Scholar

[49] Kreiss JP, Paparoditis E. Bootstrap methods for dependent data: A review. J Korean Stat Soc. 2011;40(4):357–78. 10.1016/j.jkss.2011.08.009Search in Google Scholar

[50] Lahiri SN. Resampling methods for dependent data. New York: Springer Science & Business Media; 2003. 10.1007/978-1-4757-3803-2Search in Google Scholar

[51] Rider PR. The midrange of a sample as an estimator of the population midrange. J Amer Stat Assoc. 1957;52(280):537–42. 10.1080/01621459.1957.10501410Search in Google Scholar

[52] Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: A review. Rev Econ Stat. 2004;86(1):4–29. 10.1162/003465304323023651Search in Google Scholar

[53] Mansournia MA, Altman DG. Inverse probability weighting. BMJ. 2016;352. 10.1136/bmj.i189Search in Google Scholar PubMed

[54] Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M. Doubly robust estimation of causal effects. Amer J Epidemiol. 2011;173(7):761–7. 10.1093/aje/kwq439Search in Google Scholar PubMed PubMed Central

[55] Kang JD, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci. 2011;22(4):523–39. 10.1214/07-STS227Search in Google Scholar PubMed PubMed Central

[56] Glynn AN, Quinn KM. An introduction to the augmented inverse propensity weighted estimator. Politic Anal. 2010;18(1):36–56. 10.1093/pan/mpp036Search in Google Scholar

[57] Naimi AI, Cole SR, Kennedy EH. An introduction to g methods. Int J Epidemiol. 2017;46(2):756–62. Search in Google Scholar

[58] Madans JH, Kleinman JC, Cox CS, Barbano HE, Feldman JJ, Cohen B, et al. 10 years after NHANES I: report of initial followup, 1982–84. Public Health Reports. 1986;101(5):465. Search in Google Scholar

[59] Witzig R. The medicalization of race: scientific legitimization of a flawed social construct. Ann Internal Med. 1996;125(8):675–9. 10.7326/0003-4819-125-8-199610150-00008Search in Google Scholar PubMed

[60] R Core Team. R: A language and environment for statistical computing. Vienna, Austria; 2022. Available from: https://www.R-project.org/. Search in Google Scholar

Received: 2023-11-20

Revised: 2024-09-08

Accepted: 2024-09-27

Published Online: 2024-11-09

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary material

Articles in the same Issue

https://doi.org/10.1515/jci-2023-0076

Keywords for this article

causal inference; functional average; extreme order statistics; mean exchangeability; linear regression; Hoeffding bootstrap

Creative Commons

BY 4.0

The functional average treatment effect

Article

Abstract

1 Introduction

2 Functional average treatment effect

2.1 Examples of applicability

2.1.1 Example 1

2.1.2 Example 2

2.1.3 Example 3

2.2 Identifying and estimating counterfactual functional averages

Proposition 1

Proof

2.3 The problem of estimation

2.3.1 Discrete estimators of A v ˜ ( Y δ )

Proposition 2

Proof

2.3.2 Continuous outcomes

2.3.3 Mid-range estimation

Definition 1

2.3.4 The Hoeffding bootstrap

3 U random variables and counterfactual linear regression

Definition 2

3.1 The Hoeffding bootstrap, continued

Proposition 3

Proof

Proposition 4

Proof

3.2 Counterfactual linear regression

Proposition 5

Proof

4 Monte Carlo simulations

4.1 Univariate functional average estimation

4.2 Causal inference with functional averages

5 A data application

5.1 Methods

5.2 Results

5.3 Model checking

5.4 Discussion

6 Conclusion

Acknowledgements

References

Supplementary Material

Articles in the same Issue

Articles in the same Issue

Articles in the same Issue