An approach to nonparametric inference on the causal dose–response function

Aaron Hudson; Elvin H. Geng; Thomas A. Odeny; Elizabeth A. Bukusi; Maya L. Petersen; Mark J. van der Laan

doi:10.1515/jci-2024-0001

Article Open Access

An approach to nonparametric inference on the causal dose–response function

Aaron Hudson , Elvin H. Geng , Thomas A. Odeny , Elizabeth A. Bukusi , Maya L. Petersen and Mark J. van der Laan

Published/Copyright: December 21, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Causal Inference Volume 12 Issue 1

Abstract

The causal dose–response curve is commonly selected as the statistical parameter of interest in studies where the goal is to understand the effect of a continuous exposure on an outcome. Most of the available methodology for statistical inference on the dose-response function in the continuous exposure setting requires strong parametric assumptions on the probability distribution. Such parametric assumptions are typically untenable in practice and lead to invalid inference. It is often preferable to instead use nonparametric methods for inference, which only make mild assumptions about the data-generating mechanism. We propose a nonparametric test of the null hypothesis that the dose-response function is equal to a constant function. We argue that when the null hypothesis holds, the dose-response function has zero variance. Thus, one can test the null hypothesis by assessing whether there is sufficient evidence to claim that the variance is positive. We construct a novel estimator for the variance of the dose-response function, for which we can fully characterize the null limiting distribution and thus perform well-calibrated tests of the null hypothesis. We also present an approach for constructing simultaneous confidence bands for the dose-response function by inverting our proposed hypothesis test. We assess the validity of our proposal in a simulation study. In a data example, we study, in a population of patients who have initiated treatment for HIV, how the distance required to travel to an HIV clinic affects retention in care.

Keywords: dose–response function; continuous exposure; nonparametric testing; targeted minimum loss-based estimation

MSC 2010: 62G05; 62G10

1 Introduction

In many scientific studies, one of the main objectives is to use observational data to make inferences about the causal relationship between a treatment or exposure variable and some outcome. Many commonly-used for causal inference make strong, and often untenable, parametric assumptions about the data’s probability distribution. Consequently, using such methods in practice can result in invalid inference under model misspecification. Interest has grown in instead using more robust nonparametric and semiparametric approaches for causal inference, which only make mild assumptions about the data-generating mechanism.

Our objective is to make nonparametric inference about the probability distribution of the counterfactual outcomes, or the collection of potential outcomes that would have been observed if a subject had received any possible level of the exposure [1]. In many conventional analyses, it is of primary interest to make comparisons about the mean of the counterfactual outcome for different levels of the exposure. The target estimand is the function that maps any given exposure level to the corresponding mean counterfactual outcome. We commonly refer to this estimand as the causal dose–response function.

The literature on nonparametric inference on mean counterfactual outcomes most commonly focuses on the setting in which the exposure is binary. In this case, the mean counterfactual outcomes can, under standard causal assumptions, be characterized as pathwise differentiable estimands, or smooth functionals of the unknown probability distribution. For estimands that have this characterization, there exist many strategies for constructing nonparametric estimators that converge to the true quantity at the parametric rate and achieve a tractable sampling distribution. Commonly used estimators for the mean counterfactual outcomes with binary exposures that satisfy these properties include the augmented inverse probability weighted estimator [2] and the targeted maximum likelihood estimator [3]. When such estimators are available, it is straightforward to perform statistical inference on the mean counterfactual outcomes by constructing confidence sets or performing hypothesis tests. With minor modifications, these approaches for inference can be readily extended to the more general setting in which the exposure is a discrete random variable.

Nonparametric inference on the causal dose–response function is more challenging when the exposure is a continuous random variable. In this setting, the mean counterfactual outcomes are nonsmooth functions of the probability distribution, and consequently, nonparametric estimation is not possible at the parametric rate. There nonetheless exist proposals for consistent and rate-optimal estimation of the dose–response function in a nonparametric model. For instance, Díaz et al. [4] introduced a cross-validated targeted minimum loss (TML)-based estimator for the dose–response function, Kennedy et al. [5] and Colangelo and Lee [6] proposed nonparametric kernel smoothing estimators, and Westling et al. [7] proposed an estimator based on isotonic regression. Rate-optimal nonparametric estimators, such as those described above, typically retain a non-negligible asymptotic bias. Due to this bias retention, these estimators attain nonstandard limiting distributions, and constructing hypothesis tests or confidence sets based on the estimators can be challenging. In order to obtain an estimator for the dose–response curve that converges at the parametric rate, it is necessary to make parametric assumptions about either the conditional distribution of the exposure given the covariates [8–10] or the conditional mean of the outcome given exposure and covariates [11,12]. These approaches may not be preferred because specifying a parametric model correctly can be challenging in practice.

Nonparametric inference in the continuous treatment setting is a growing research area, and there have been many recent developments. For instance, Doss et al. [13] presented an omnibus and doubly robust test of the null of no treatment effect based on local linear regression, and an approach to confidence band construction using debiased local linear regression is proposed by Takatsu and Westling [14]. There also exist alternative approaches to nonparametric causal inference with continuous exposures which are based on the study of a finite-dimensional parameter that summarizes the effect of the exposure on the outcome [15–17]. While such summaries can be informative and carry meaningful interpretations, they may not suffice in settings where an investigator is primarily interested in learning about the dose–response function in its entirety.

Our work draws upon some recent advances in nonparametric inference on the dose–response function and inference on nonpathwise differentiable estimands more broadly. Westling [18] demonstrated that one can construct an omnibus nonparametric test of the hypothesis that the dose–response function is flat by simultaneously estimating primitive functions, or integrals, of the dose–response function. Hudson et al. [19] uses a generalization of this idea to develop a framework for testing hypotheses about nonpathwise differentiable estimands by estimating a suitably large collection of pathwise differentiable estimands that can effectively summarize the target. Hudson et al. [19] also propose a method for constructing simultaneous confidence sets for the target estimand by inverting the proposed hypothesis test.

In this article, we propose a novel method for inference on the dose–response function using the framework of Hudson et al. [19]. We develop a test of the null hypothesis that a mean-centered version of the dose–response function is equal to any given null function. Of particular importance is the instance where the null function is zero, in which case we perform a test of the null hypothesis that the dose–response function is flat. We also discuss the construction of simultaneous confidence sets for the centered dose–response function, and we describe a method for summarizing the confidence sets.

The remainder of the article is organized as follows. In Section 2, we provide a high-level overview of our proposed methodology, and in Section 3, we discuss our inferential procedure in more detail and describe its theoretical properties. In Section 4, we discuss implementation of our method. In Section 5, we study the behavior of our proposal in a simulation study. In Section 6, we apply our method to data from the Adaptive Strategies for Preventing and Treating Lapses of Retention in HIV Care (ADAPT-R) trial [20], which studies retention in care for people who initiated treatment for HIV. We conclude with a brief discussion in Section 7.

2 Overview

2.1 Identification of the dose–response function

Let O 1 , … , O n be i.i.d. random vectors drawn from a probability distribution P 0 , which resides in a model ℳ . We allow ℳ to be a rich nonparametric model that is essentially unrestricted and must only satisfy some mild regularity conditions. We express our data as O = ( W , A , Y ) , where Y is a bounded real-valued outcome with sample space Y ⊂ R , A is a bounded exposure variable with sample space A ⊂ R , and W is a q -dimensional vector of covariates, with sample space W ⊂ R q . Throughout, we write E P to denote the expectation under any probability distribution P ∈ ℳ , and we use the shorthand notation E 0 ≔ E P 0 .

Let Y ( a ) denote the counterfactual outcome under exposure level a , or the potential outcome that would have been observed if an individual had been observed with exposure level A = a . We define the counterfactual mean outcome as the mapping P ↦ θ P ( a ) ≔ E P [ Y ( a ) ] for a ∈ A . Our objective is to perform inference on θ 0 ≔ θ P 0 , which is commonly called the causal dose–response function.

The stochastic process of counterfactual outcomes for subject i , { Y i ( a ) : a ∈ A } is not observable, as we only observe the outcome under one single exposure level A i . However, under some standard causal assumptions, the counterfactual mean outcomes can nonetheless be estimated from the data. Let Q P ( w , a ) ≔ E P [ Y ∣ W = w , A = a ] denote the conditional mean of Y given the exposure and covariates, and let g P ( a ∣ w ) ≔ d d a P ( A ≤ a ∣ W = w ) be the conditional density of the exposure given the covariates. Similarly as above, we use the shorthand notation Q 0 ≔ Q P 0 and g 0 ≔ g P 0 . The dose–response function can be estimated from the data under the following assumptions:

Assumption A1 (Consistency). A = a implies Y = Y ( a ) .
Assumption A2 (No unmeasured confoundedness). Y ( a ) is conditionally dependent on A , given W .
Assumption A3 (Positivity). There exists g min > 0 so that inf a ∈ A g 0 ( a ∣ w ) > g min for all w ∈ W .

The consistency assumption states that the observed outcome for each subject is the potential outcome under their observed exposure level. The no unmeasured confoundedness assumption says that all confounding variables are contained within W . The positivity assumption means that it is possible for subjects with any covariate measurement to receive any treatment a ∈ A . When Assumptions A1–A3 hold, the counterfactual mean outcome at any exposure level a can be expressed as

θ 0 ( a ) = E 0 , W [ Q ( W , a ) ] ,

where E 0 , W denotes the expectation with respect to the marginal distribution of W under P 0 .

Assumptions A1–A3 typically hold in randomized trials, but they are generally unverifiable in observational studies. However, even when the causal assumptions fail, θ 0 retains an interpretation as a mean regression function, which can be used to study the conditional association between the exposure and the outcome, given the covariates.

2.2 Summary of proposed methodology

We first derive a test of the null hypothesis that the dose–response function, centered about its mean, is equal to a given candidate null function. Let θ ¯ P : a ↦ θ P ( a ) − E P [ θ P ( A ) ] be the mean-centered dose–response function. For an arbitrary function θ * from A to R , let θ ¯ P * = θ * − E P [ θ * ( A ) ] , and let θ ¯ 0 * ≔ θ ¯ P 0 * . We are interested in testing the null hypothesis that θ ¯ 0 is equal to the candidate null parameter θ ¯ 0 * ,

(1) H 0 : θ ¯ 0 ( a ) = θ ¯ 0 * ( a ) for all a ∈ A .

Of particular interest in many applications is the case where θ ¯ * = 0 , which corresponds to the null hypothesis that the dose–response function is flat, or that the counterfactual mean outcomes are the same at every level of the exposure.

We also propose an approach for constructing confidence sets for the centered dose–response function. That is, for any α ∈ ( 0 , 1 ) , we construct a set C n α that contain θ ¯ 0 with probably at least 1 − α as n tends to infinity, i.e.,

liminf n → ∞ P 0 ( θ ¯ 0 ∈ C n α ) ≥ 1 − α .

When it is possible to obtain an estimator for θ 0 that has negligible asymptotic bias and a characterizable limiting distribution, it is straightforward to derive hypothesis tests and construct confidence sets based on the estimator. When the dose–response function is pathwise differentiable, meaning that θ P ( a ) changes smoothly in P with respect to local fluctuations around P 0 , there are well-established theory and methodology for constructing n 1 ⁄ 2 -consistent and asymptotically efficient estimators [21]. However, θ P ( a ) is typically only pathwise differentiable in a nonparametric model when the event { A = a } occurs with probability greater than zero. While this condition can be satisfied when A has a discrete support under P 0 , it is generally not met when the exposure variable is continuous.

While the dose–response function is not pathwise differentiable when the exposure is continuous, one can perform inference by instead estimating a collection of pathwise differentiable functionals of the dose–response function, as suggested by, e.g., Westling [18] and Hudson et al. [19]. In this work, we present an approach based on the estimation of linear transformations of the causal dose–response function. In what follows, we briefly summarize our approach to inference.

For any function h from A to R , we define

ψ P , θ * ( h ) ≔ E P [ { θ ¯ P ( A ) − θ ¯ P * ( A ) } h ( A ) ]

as the L 2 ( P ) inner product of θ ¯ P − θ ¯ P * and h , and let ψ 0 , θ * ( h ) ≔ ψ P 0 , θ * ( h ) . Observe that if θ ¯ 0 is equal to θ ¯ 0 * almost everywhere, ψ 0 , θ * ( h ) = 0 for all functions h , and if θ ¯ 0 is not almost everywhere equal to θ ¯ 0 * , there exists a function h * such that ψ 0 , θ * ( h * ) ≠ 0 . Therefore, one could test the null hypothesis in (1) by assessing whether, for a large class ℋ of bounded functions from A to R ,

(2) Ψ 0 , θ * ( ℋ ) ≔ sup h ∈ ℋ ∣ ψ 0 , θ * ( h ) ∣ = 0 ,

or in other words, whether there exists a linear transformation of θ ¯ 0 − θ ¯ 0 * that is nonzero.

The manner in which the supremum in (2) is calculated will depend on the construction of ℋ . For instance, if ℋ is countable, one may approximate the supremum as the maximum of a finite approximation of { ψ 0 , h : h ∈ ℋ } . We also consider the setting in which ℋ is constructed using a basis expansion, and the supremum corresponds to an optimal weighting of the coefficients for the basis functions.

We develop a test of the null hypothesis based on the estimation of Ψ 0 , θ * ( ℋ ) . We later show that ψ 0 , θ * ( h ) is pathwise differentiable for any fixed h and can therefore be estimated at the parametric rate of n 1 ⁄ 2 . If, in addition, ℋ is not too complex, one can construct an estimator ψ n , θ * such that the standardized process { n 1 ⁄ 2 [ ψ n , θ * ( h ) − ψ 0 , θ * ( h ) ] : h ∈ ℋ } converges weakly to a Gaussian process G ≔ { G ( h ) : h ∈ ℋ } . As a consequence, we have that Ψ n , θ * ( ℋ ) ≔ sup h ∈ ℋ ∣ ψ n , θ * ( h ) ∣ is a consistent estimator for Ψ 0 , θ * ( ℋ ) , and under the null, when ψ 0 , θ * ( h ) = 0 for all h , n 1 ⁄ 2 Ψ n , θ * ( ℋ ) converges weakly to sup h ∈ ℋ ∣ G ( h ) ∣ . Because Ψ n , θ * ( ℋ ) has a tractable limiting distribution, it can be used as a test statistic. A hypothesis test that rejects the null when n 1 ⁄ 2 Ψ n , θ * ( ℋ ) is larger than the 1 − α quantile of the null limiting distribution would achieve the type-1 error level α asymptotically.

We can invert our proposed hypothesis test to obtain a confidence set for θ ¯ 0 . Let Θ be a large nonparametric class of functions from A to R that serve as candidate values for the dose–response function. Consider the set

(3) C n α ≔ { θ ¯ 0 * : θ * ∈ Θ , and we fail to reject θ ¯ 0 = θ ¯ 0 * at the level α based on O 1 , … , O n } .

We can interpret C n α as the set of functions in Θ that are in accordance with the observed data. If θ 0 belongs to Θ , then θ ¯ 0 belongs to C n α with probability tending 1 − α , and C n α is an asymptotically valid confidence set for θ ¯ 0 . Even when θ 0 does not belong to Θ , C n α retains a nice interpretation as long as Θ contains a good approximation for θ 0 .

The confidence set C n α can be difficult to visualize as it is a complex set of infinite-dimensional objects. We propose to simply summarize C n α by displaying the smallest band that contains all functions in C n α , which can be defined point-wise as

(4) C n α ( a ) ≔ [ inf { θ ¯ 0 * ( a ) : θ 0 * ¯ ∈ C n α } , sup { θ ¯ 0 * ( a ) : θ 0 * ¯ ∈ C n α } ] .

As we discuss later in Section 3.2, this confidence band is generally conservative for correctly specified Θ .

The inferential procedure described above requires selection of the class of functions ℋ , the elements of which index the set of linear transformations of θ ¯ 0 − θ ¯ 0 * that we need to estimate. Asymptotic type-1 error control is preserved using any choice ℋ that satisfies some mild regularity conditions, to be discussed later. The choice of ℋ does, however, influence the test’s statistical power. We use the following intuition for constructing an ℋ so that our test is well-powered. Because the inner product ψ 0 , θ * ( h ) is a measure of orthogonality of θ ¯ 0 − θ ¯ 0 * to a given h , the amount of evidence ψ 0 , θ * ( h ) provided against the null depends on the shape of h , rather than the scale. It is therefore sensible to construct ℋ as a class of equally scaled functions that contains the function that is least orthogonal θ ¯ 0 . We scale each h to have unit variance; while other measures of scale could be chosen, we scale by the standard deviation for simplicity. It can be shown by an application of the Cauchy-Schwarz inequality that the maximizer of ∣ ψ 0 , θ * ( h ) ∣ , among all h with unit standard deviation, can be expressed as

(5) h 0 ≔ Var 0 ( θ 0 ( A ) − θ * ( A ) ) − 1 ⁄ 2 ( θ ¯ 0 − θ ¯ 0 * ) ,

where Var 0 is the population variance under P 0 . The maximizer h 0 is a scaled difference between the true dose–response function and the candidate null parameter, and the maximal inner product ψ 0 , θ * ( h 0 ) is equal to the standard deviation of θ 0 ( A ) − θ 0 * ( A ) . We construct ℋ as a class of smooth functions that contain a good approximation of h 0 so that our target estimand Ψ 0 , θ * ( ℋ ) can be interpreted as an approximation of the standard deviation of the difference between the truth and the null.

The hypothesis testing procedure proposed by Westling [18] can be viewed as a special case of our method with ℋ taken as the class of binary indicators of whether an input exposure level a is greater than any given cutoff a 0 ∈ A . While Westling [18] showed that this choice results in an omnibus test of the null hypothesis, we later show that in finite samples, performance can be improved by considering a class of functions that contain the function that is least orthogonal to the difference between the true dose–response function and the null.

Our proposal is closely related to the nonparametric score test presented in the study of Hudson et al. [19]. The authors use the representation of function-valued nonpathwise differentiable parameters as the minimizer of a population risk functional to derive a set of estimating equations, indexed by functions h in a large class ℋ , that the true population parameter must satisfy. They then construct hypothesis tests that assesses whether the candidate null parameter also satisfies these estimating equations. Their proposal rejects the null hypothesis when the data provide contrary evidence. Our proposal can be viewed as a special case of this approach, where we assess whether θ * satisfies the estimating equations ψ 0 , θ * ( h ) = 0 for all h ∈ ℋ .

3 Inferential procedure

Having provided a brief overview of our inferential procedure in the previous section, we now provide theoretical details. In this section, we first discuss the estimation of ψ 0 , θ * ( h ) , and we subsequently describe an approach for determining whether there is sufficient evidence to conclude whether Ψ 0 , θ * ( ℋ ) is zero.

3.1 Estimation of ψ 0 , θ * ( h )

Recall that our proposal requires us to have an estimator ψ n ( h ) of ψ 0 ( h ) and a class of functions ℋ such that the process { n 1 ⁄ 2 [ ψ n ( h ) − ψ 0 ( h ) ] : h ∈ ℋ } converges weakly to a Gaussian process. In this subsection, we specify conditions on ψ n , θ * and ℋ such that the weak convergence property is satisfied, and we discuss the construction of a weakly convergent estimator.

As noted in Section 2, for any estimand that is pathwise differentiable in the sense of Bickel et al. [21], one can construct an estimator that, when centered around the true target estimand, converges weakly to a Gaussian distribution at the parametric rate of n 1 ⁄ 2 . Constructing such an estimator and establishing efficiency typically require knowledge of the efficient influence function of the estimand of interest. The following lemma states that ψ P , θ * ( h ) is pathwise differentiable in a nonparametric model and provides the form of the efficient influence function.

Lemma 1

The parameter ψ P , θ * ( h ) is pathwise differentiable in a nonparametric model, and its nonparametric efficient influence function is

ϕ P , θ * ( w , a , y ; h ) ≔ θ ¯ P ( a ) − θ ¯ P * ( a ) + E P [ g P ( a ∣ W ) ] g P ( a ∣ w ) { y − Q P ( w , a ) } { h ( a ) − E P [ h ( A ) ] } + E P [ Q P ( w , A ) { h ( A ) − E P [ h ( A ) ] } ] − { 2 E P [ θ ¯ P ( A ) h ( A ) ] − E P [ θ ¯ P * ( A ) h ( A ) ] } .

As before, we use the shorthand notation ϕ 0 , θ * ≔ ϕ P 0 , θ * to denote the efficient influence function at P 0 . Lemma 1 generalizes a result presented in [18] that characterizes the efficient influence function for the special case in which h is an indicator of whether the observed treatment level is greater than a specified cutoff, and θ ¯ 0 * = 0 .

Because ψ 0 , θ * ( h ) is pathwise differentiable, it is possible to construct an estimator ψ n , θ * ( h ) that is asymptotically linear in the sense that

(6) ψ n , θ * ( h ) − ψ 0 , θ * ( h ) = 1 n ∑ i = 1 n ϕ P 0 , θ * ( O i ; h ) + r n ( h ) ,

where r n ( h ) = o P ( n − 1 ⁄ 2 ) . In view of the central limit theorem and the fact that ϕ 0 , θ * has zero mean and finite variance, an asymptotically linear estimator ψ n , θ * ( h ) is asymptotically Gaussian for any fixed h . However, our proposal requires a stronger notion of uniform convergence of ψ n , θ * ( h ) for h in a large collection ℋ , so some additional conditions are needed. The following lemma, which is a consequence of Slutsky’s theorem, provides conditions under which the desired uniform convergence holds.

Lemma 2

Let ψ n , θ * ( h ) be an asymptotically linear estimator of ψ 0 ( h ) that has the representation in (6) for any h ∈ ℋ , and let ℓ ∞ ( ℋ ) denote the vector space of bounded real-valued functionals on ℋ . Assume the following conditions hold:

{ ϕ 0 , θ * ( ⋅ ; h ) : h ∈ ℋ } is a P 0 -Donsker class,
sup h ∈ ℋ ∣ r n ( h ) ∣ = o P ( n − 1 ⁄ 2 ) .

Then, { n 1 ⁄ 2 [ ψ n , θ * ( h ) − ψ 0 , θ * ( h ) ] : h ∈ ℋ } converges weakly to a tight Gaussian process G as an element of ℓ ∞ ( ℋ ) , where G has mean zero and covariance Σ : ( h 1 , h 2 ) ↦ E 0 [ ϕ 0 , θ * ( O ; h 1 ) ϕ 0 , θ * ( O ; h 2 ) ] .

The first condition of Lemma 2 is a constraint on the complexity of ℋ and typically holds when ℋ is a P 0 -Donsker class. The second condition directly involves the estimator ψ n and requires that the remainder term is asymptotically negligible in a uniform sense.

In what follows, we present two strategies for constructing weakly convergent estimators for { ψ 0 , θ * ( h ) : h ∈ ℋ } . We begin by considering a naïve plug-in estimator of ψ 0 ( h ) . Suppose we have a consistent estimator P ˆ n for P 0 . In practice, we do not need to estimate the entire distribution P 0 , and we must only estimate nuisance parameters upon which ψ 0 , θ * and ϕ 0 , θ * depend. In our setting, there are four nuisance parameters: (i) F 0 , W , the marginal cumulative distribution for W ; (ii) F 0 , A , the marginal cumulative distribution function for A ; (iii) Q 0 , the conditional mean of Y given A and W ; and (iv) g 0 , the conditional density of A given W . One can obtain estimators F n , W and F n , A for F 0 , W and F 0 , A nonparametrically using the empirical distribution function, and one typically requires machine learning to construct consistent nonparametric estimators Q n and g n for Q 0 and g 0 . Given the initial estimator P ˆ n , one can obtain the naïve plug-in estimator ψ P ˆ n ( h ) , which can be expressed as

ψ P ˆ n , θ * ( h ) = n − 1 ∑ i = 1 n θ n ( A i ) − θ * ( A i ) − n − 1 ∑ j = 1 n { θ n ( A j ) − θ * ( A j ) } h ( A i ) ,

where θ n : a ↦ n − 1 ∑ i = 1 n Q n ( a , W i ) is the plug-in estimator for the dose–response function.

The plug-in estimator ψ P ˆ n , θ * ( h ) will typically retain non-negligible asymptotic bias for ψ 0 , θ * ( h ) and consequently will not be asymptotically linear. This bias is attributable to the fact that nonparametric estimators Q n of Q 0 are usually obtained by balancing a bias-variance tradeoff that is sub-optimal for the objective of estimating ψ 0 , θ * ( h ) . We discuss two widely used strategies for correcting the bias of the naïve plug-in: one-step estimation [22] and TML-based estimation [23,24].

The estimation strategies we discuss require the following assumptions:

Assumption B1. There exists a P 0 -Donsker class Φ that contains ϕ 0 , θ * ( ⋅ ; h ) and ϕ P ˆ n , θ * ( ⋅ ; h ) for each h ∈ ℋ with probability tending to one.

Assumption B2. The nuisance parameter estimators satisfy

∫ { Q n ( w , a ) − Q 0 ( w , a ) } 2 d P 0 ( w , a ) = o P ( 1 ) , ∫ { g n ( a ∣ w ) − g 0 ( a ∣ w ) } 2 d P 0 ( w , a ) = o P ( 1 ) , ∫ ∣ { Q n ( w , a ) − Q 0 ( w , a ) } { g n ( a ∣ w ) − g 0 ( a ∣ w ) } ∣ d P 0 ( w , a ) = o P ( n − 1 ⁄ 2 ) .

Assumptions B1 and B2 impose conditions on the estimators for the nuisance parameters upon which the target estimand and the efficient influence function depend. Assumption B1 places a constraint on the complexity of the family of candidate estimators for the nuisance components. Many flexible nonparametric estimators, e.g., those constructed via the highly adaptive LASSO [25], satisfy this condition. Assumption B2 places a requirement on the rates of convergence that the conditional mean and conditional density estimators must achieve. This condition holds when both estimators are consistent, and the product of the convergence rates is greater than n 1 ⁄ 2 . In view of Assumption B2, the estimators of ψ 0 , θ * ( h ) we develop are doubly robust in the sense that consistency and asymptotic normality are achieved if one of the nuisance parameters is estimated at a slow rate as long as the other nuisance is estimated at a fast enough rate to compensate.

We first construct a one-step estimator for ψ 0 , θ * ( h ) . Given an estimator for the nuisance parameters upon which ϕ 0 , θ * ( h ) depends, we can obtain an estimator ϕ P ˆ n , θ * ( ⋅ ; h ) for the efficient influence function ϕ 0 , θ * ( ⋅ ; h ) . The empirical average of the estimator of the influence function ϕ P ˆ n , θ * can be shown to serve as a first-order approximation to the bias of the naïve plug-in [22]. This allows one to perform a, so-called, one-step bias correction. We define the one-step estimator as

(7) ψ n , θ * I ( h ) = ψ P ˆ n , θ * ( h ) + n − 1 ∑ i = 1 n ϕ P ˆ n , θ * ( O i ; h ) = ψ P ˆ n , θ * ( h ) + n − 1 ∑ i = 1 n n − 1 ∑ j = 1 n g n ( A i ∣ W j ) g n ( A i ∣ W i ) { Y i − Q n ( W i , A i ) } h ( A i ) − n − 1 ∑ j = 1 n h ( A j ) .

The following theorem states that, under mild regularity conditions, the one-step estimator is asymptotically linear and hence asymptotically Gaussian.

Theorem 1

Under assumptions B 1 and B 2 , the one-step estimator ψ n , θ * I ( h ) has the asymptotically linear representation in (6), with sup h ∈ ℋ ∣ r n ( h ) ∣ = o P ( n − 1 ⁄ 2 ) .

While one-step estimators are asymptotically efficient, they are not guaranteed to be compatible in the sense that there exists a probability distribution P in the model ℳ such that ψ P , θ * ( h ) = ψ n , θ * I ( h ) for all h in ℋ . TML-based estimation is an appealing alternative strategy that can be used to construct an estimator for { ψ 0 , θ * ( h ) : h ∈ ℋ } that is compatible in this sense. TML estimators correct the bias of the naïve plug-in by updating the initial estimator P ˆ n of P 0 to obtain a new estimator P ˜ n such that the updated plug-in { ψ P ˜ n , θ * : h ∈ ℋ } that takes as input P ˜ n has reduced bias for the target estimand { ψ 0 , θ * ( h ) : h ∈ ℋ } . Therefore, as long as P ˜ n resides within ℳ , the TML estimator is compatible. In our presentation, we only briefly summarize some of the main principles of targeted learning, and we refer readers to previous studies [23,24] for a comprehensive discussion.

The main idea behind TML-based estimation is to construct an updated estimator P ˜ n based on the initial estimator P ˆ n so that the following efficient influence function estimating equations are satisfied:

sup h ∈ ℋ n − 1 ∑ i = 1 n ϕ P ˜ n , θ * ( O i ; h ) = o P ( n − 1 ⁄ 2 ) ,

and such that P ˜ n remains sufficiently close to P ˆ n , so as to remain a good estimator for P 0 . Because estimating a marginal distribution function using the empirical distribution function does not generate bias for the target estimand, we do not need to update the initial estimators F A , n and F W , n for F A , 0 and F W , 0 . We only need to obtain an updated estimator Q ˜ n for the conditional mean Q 0 , since the initial estimator makes a bias-variance trade-off that is suboptimal for the estimation of the target parameter.

It can be verified algebraically that for any choice Q ˜ n , the empirical average of the efficient influence function evaluated at P ˜ n can be expressed as

n − 1 ∑ i = 1 n ϕ P ˜ n , θ * ( O i ; h ) = n − 1 ∑ i = 1 n { Y i − Q ˜ n ( W i , A i ) } Z n ( W i , A i ; h ) ,

where we define Z n ( w , a ; h ) as

(8) Z n ( w , a ; h ) ≔ n − 1 ∑ i = 1 n g n ( a ∣ W i ) g n ( a ∣ w ) h ( a ) − n − 1 ∑ i = 1 n h ( A i ) .

Thus, Q ˜ n satisfies the efficient influence function estimating equations at an adequate level if

(9) sup h ∈ ℋ n − 1 ∑ i = 1 n { Y i − Q ˜ n ( W i , A i ) } Z n ( W i , A i ; h ) = o P ( n − 1 ⁄ 2 ) .

We now discuss how to construct a Q ˜ n that satisfies (9). Let Q n , β be a parametric working model indexed by a scalar parameter β ∈ R , for which Q n , β = Q n when β = 0 . We construct the working model so that the derivative of the squared error loss is equal in magnitude to the supremum over ℋ of the empirical average of ϕ P ˜ n ( ⋅ ; h ) with Q ˜ n = Q n , β , i.e.,

(10) d d β ( 2 n ) − 1 ∑ i = 1 n { Y i − Q n , β ( W i , A i ) } 2 = sup h ∈ ℋ n − 1 ∑ i = 1 n { Y i − Q n , β ( W i , A i ) } Z n ( W i , A i ; h ) .

We then take Q ˜ n = Q n , β n , where β n is a near minimizer of the squared error loss and satisfies

n − 1 ∑ i = 1 n { Y i − Q n , β n ( W i , A i ) } 2 = inf β ∈ R n − 1 ∑ i = 1 n { Y i − Q n , β ( W i , A i ) } 2 + δ n ,

for a small positive sequence δ n ↓ 0 . Because β ˜ n is a near minimizer of the loss, we can see that this choice of Q ˜ n satisfies (9) for δ n sufficiently small. We note that a sub-model satisfying (10) is referred to as a universal least favorable submodel and provides the maximal reduction of the bias of { ψ P ˜ n , θ * ( h ) : h ∈ ℋ } as Q n , β moves away from Q n toward Q n , β n . Strategies based on a locally least favorable submodel, which would satisfy (10) only when β = 0 , could alternatively have been considered, but such approaches tend to perform worse in small samples, in particular when the target estimand is multidimensional or infinite-dimensional [26]. We recursively define the universal least favorable sub-model Q n , β point-wise as

(11) Q n , β ( w , a ) = Q n ( w , a ) + ∫ 0 β Z n ( a , w ; h n , b ) d b , β ≥ 0 Q n ( w , a ) − ∫ β 0 Z n ( w , a ; h n , b ) d b , β < 0 ,

where h n , β is a solution to

n − 1 ∑ i = 1 n { Y i − Q n , β ( W i , A i ) } Z n ( W i , A i ; h n , β ) = sup h ∈ ℋ n − 1 ∑ i = 1 n { Y i − Q n , β ( W i , A i ) } Z n ( W i , A i ; h ) .

It can be verified using the fundamental theorem of calculus that the above working model satisfies (10).

After obtaining the updated estimator Q ˜ n , we can construct the TML estimator as

(12) ψ n , θ * II ( h ) = ψ P ˜ n , θ * ( h ) = n − 1 ∑ i = 1 n θ P ˜ n ( A i ) − θ * ( A i ) − n − 1 ∑ j = 1 n { θ P ˜ n ( A j ) − θ * ( A j ) } h ( A i ) ,

where θ P ˜ n : a ↦ n − 1 ∑ i = 1 n Q ˜ n ( W i , a ) is the updated TML estimator for the dose–response function. The following theorem states that the TML estimator satisfies the conditions of Lemma 2.

Theorem 2

Under Assumptions B1 and B2, the TML-based estimator ψ n , θ * II ( h ) has the asymptotically linear representation in (6) with sup h ∈ ℋ ∣ r n ( h ) ∣ = o P ( n − 1 ⁄ 2 ) .

3.2 Inference on Ψ 0 , θ * ( ℋ )

We are at this point prepared to discuss inference on Ψ 0 , θ * ( ℋ ) = sup h ∈ ℋ ∣ ψ 0 , θ * ( h ) ∣ . Suppose that we have an estimator { ψ n , θ * ( h ) : h ∈ ℋ } for { ψ 0 , θ * ( h ) : h ∈ ℋ } and a function class ℋ that satisfy the conditions of Lemma 2. Such an estimator could be obtained using either of the strategies presented in Section 3.1. Consider the plug-in estimator Ψ n , θ * ( ℋ ) ≔ sup h ∈ ℋ ∣ ψ n , θ * ( h ) ∣ . The continuous mapping theorem implies that n 1 ⁄ 2 sup h ∈ ℋ ∣ ψ n , θ * ( h ) − ψ 0 , θ * ( h ) ∣ converges weakly to sup h ∈ ℋ ∣ G ( h ) ∣ where G is the Gaussian process in Lemma 2. This in combination with the reverse triangle inequality together imply that

∣ Ψ n , θ * ( ℋ ) − Ψ 0 , θ * ( ℋ ) ∣ = O P ( n − 1 ⁄ 2 ) .

Furthermore, because when the null hypothesis holds, ψ 0 , θ * ( h ) = 0 for all h , n 1 ⁄ 2 Ψ n , θ * ( ℋ ) converges weakly to sup h ∈ ℋ ∣ G ( h ) ∣ . Thus, Ψ n , θ * ( ℋ ) is a consistent estimator for Ψ 0 , θ * ( ℋ ) that has a fully characterizable null limiting distribution. This makes it possible to construct an asymptotically valid hypothesis test based on the estimator.

We make this explicit by providing the form of our proposed test below. Letting t 1 − α * denote the 1 − α quantile of the null limiting distribution of n 1 ⁄ 2 Ψ n , θ * ( ℋ ) , we consider the hypothesis test

T ( O 1 , … , O n ) = “Reject” n 1 ⁄ 2 Ψ n , θ * ( ℋ ) > t * 1 − α “Fail to reject” n 1 ⁄ 2 Ψ n , θ * ( ℋ ) ≤ t * 1 − α .

This test achieves type-1 error control at the level α when the conditions of Lemma 2 are satisfied. Additionally, when ℋ contains a function that is nonorthogonal to θ ¯ 0 * (so that Ψ 0 , θ * ( ℋ ) > 0 ), the above test is consistent in the sense that the probability of rejecting the null tends to one. To see this, we observe that n 1 ⁄ 2 Ψ n , θ * ( ℋ ) tends to infinity because Ψ n , θ * ( ℋ ) converges to Ψ 0 , θ * ( ℋ ) in probability.

Moreover, the fact that our proposed test achieves nominal type-1 error control asymptotically implies that the confidence band in (4) is a uniform confidence band. This follows from the facts that

P 0 ( θ ¯ 0 ( a ) ∈ C n α ( a ) for all a ∈ A ) ≥ P 0 ( θ ¯ 0 ∈ C n α ) ,

and that P 0 ( θ ¯ 0 ∈ C n α ) approaches 1 − α in the limit of large n when Θ contains θ 0 . This confidence band will in general achieve a coverage rate not lower than 1 − α , though this lower bound is not tight. This type of confidence band construction has been shown to result in over-coverage by Hudson et al. [19].

While the null limiting distribution of Ψ n , θ * ( ℋ ) can indeed be characterized, a closed form expression may not be available. It may therefore be necessary to use an approximation. We use the multiplier bootstrap method presented in the study of Hudson et al. [19], which makes use of the asymptotic linearity of ψ n , θ * ( h ) . We first note that due to the uniform asymptotic linearity of ψ n , θ * , under the null hypothesis, Ψ n , θ * ( ℋ ) can be expressed as the sum of the supremum of an empirical process and an asymptotically negligible remainder. That is,

(13) Ψ n , θ * ( ℋ ) = sup h ∈ ℋ n − 1 ∑ i = 1 n ϕ 0 , θ * ( O i ; h ) + o P ( n − 1 ⁄ 2 ) .

We can approximate the null distribution of Ψ n , θ * ( ℋ ) as the supremum of a bootstrapped empirical process that attains the same limiting distribution as the empirical process in (13), conditional on the observed data. For m = 1 , … , M and M large, let ξ 1 m , … , ξ n m be i.i.d. random variables, independent of O 1 , … , O n , with mean zero, unit variance, and E 0 [ ∣ ξ i m ∣ 2 + u ] < ∞ for some u > 0 , and let ϕ n , θ * be an estimator for the efficient influence function. We define the m th bootstrap sample of Ψ n , θ * ( ℋ ) as

(14) Ψ n , θ * m ( ℋ ) = sup h ∈ ℋ n − 1 ∑ i = 1 n ξ i m ϕ n , θ * ( O i ; h ) .

It is shown in the study of Hudson et al. [19] that if ϕ n , θ * is a consistent estimator for ϕ 0 , θ * , and if ℋ is not overly complex, the multiplier bootstrap statistic converges weakly to sup h ∈ ℋ ∣ G ( h ) ∣ , conditional on O 1 , … , O n . Thus, the distribution of the multiplier bootstrap samples closely approximates the null limiting distribution of Ψ n , θ * ( ℋ ) in the limit of large n .

To estimate the efficient influence function, one approach is to use the plug-in estimator

(15) ϕ n , θ * ( w , a , y ; h ) = θ n ( a ) − θ * ( a ) − n − 1 ∑ i = 1 n { θ n ( A i ) − θ * ( A i ) } h ( a ) − n − 1 ∑ i = 1 n h ( A i ) + n − 1 ∑ i = 1 n g n ( a ∣ W i ) g n ( a ∣ w ) { y − Q n ( w , a ) } h ( a ) − n − 1 ∑ i = 1 n h ( A i ) + n − 1 ∑ i = 1 n Q n ( w , A i ) h ( A i ) − n − 1 ∑ j = 1 n h ( A j ) − n − 1 ∑ i = 1 n { Y i − Q n ( W i , A i ) } Z n ( W i , A i ; h ) ,

where we recall that θ n : a ↦ n − 1 ∑ i = 1 n Q n ( W i , a ) is the plug-in estimator for θ 0 , and Z n is as defined in (8). Alternatively, we can observe that because θ ¯ 0 = θ ¯ * under the null, in (15), we can replace θ * with an estimator of θ 0 , such as the plug-in θ n . We note that when we substitute θ * by θ n in (15), some cancellation occurs, and the first line in the above expression vanishes. This strategy of replacing θ * with an estimator for θ 0 is appealing because the bootstrap approximation of the limiting distribution no longer depends on θ * , so one can test any hypothesis of the form (1) using the same bootstrap sample. This is particularly useful when we are interested in constructing a confidence set for θ ¯ 0 by inverting our proposed test, as this requires us to test a large collection of hypotheses.

In our presentation so far, we have assumed that the class ℋ is fixed. We acknowledge that in practice, fixing a class a priori may be challenging, so data-adaptive approaches for selecting ℋ may be preferable. It has been shown by Hudson et al. [19] that data-adaptive selection of ℋ does not affect the type-1 error rate of our proposed test as long as the data-adaptive choice converges to a fixed class. In Section 4, we propose an approach for data-adaptive selection of ℋ , and we later show in simulations that our approach is asymptotically valid.

4 Implementation

4.1 Construction of ℋ

Recall that it is our objective to construct ℋ as a model for h 0 in (5) so that Ψ 0 , θ * ( ℋ ) can be interpreted as the standard deviation of the difference between the true dose–response curve and the candidate null parameter. As discussed in Section 3.2, we can choose a flexible nonparametric model, so long as the model is not overly complex, and the conditions of Lemma 2 are satisfied. In what follows, we describe a practical approach for selecting such a class. Our approach is similar to that used by Hudson et al. [19] to implement their proposed nonparametric score test. In what follows, we describe an approach for constructing ℋ as a subspace of a reproducing kernel Hilbert space (RKHS). This construction of ℋ is justified because RKHSs are known to be Donsker classes [27].

For a positive-semidefinite kernel function K from A × A to R + , let S K denote its unique RKHS, endowed with the inner product ⟨ ⋅ , ⋅ ⟩ S K . The kernel function K has the eigen-decomposition

( a 1 , a 2 ) ↦ K ( a 1 , a 2 ) = ∑ d = 1 ∞ γ d η d ( a 1 ) η d ( a 2 ) ,

where the eigenfunctions { η 1 , η 2 , … } are orthogonal with respect to the RKHS inner product ⟨ ⋅ , ⋅ ⟩ S K , and 0 ≤ γ 1 ≤ γ 2 ≤ … are the eigenvalues. Any function s in the RKHS can be expressed as a linear combination of the eigenfunctions. That is, there exist coefficients c 1 , c 2 , … such that s ( a ) = ∑ d = 1 ∞ c d η d ( a ) for all a . The roughness of s can be measured by the RKHS norm as

J ( s ) ≔ ⟨ s , s ⟩ S K = ∑ d = 1 ∞ c d 2 γ d ,

with higher values of J ( s ) corresponding to greater roughness. We construct ℋ as a subset of functions in S K with bounded roughness and unit variance. That is

ℋ κ ≔ h = ∑ d = 1 ∞ c d η d : c 1 , c 2 , … ∈ R , J ( h ) ≤ κ , Var n ( h ( A ) ) = 1 ,

where Var n ( h ( A ) ) is the empirical variance of h ( A ) , and κ > 0 is a tuning parameter. To facilitate computation, we truncate the eigenbasis at some large level D .

In our implementation, we select S K as the second-order Sobolev space on [ 0 , 1 ] , which can be defined as an RKHS endowed with the inner product ( s 1 , s 2 ) ↦ ⟨ s 1 , s 2 ⟩ S K = ∫ 0 1 s ¨ 1 ( u ) s ¨ 2 ( u ) d u , where s ¨ denotes the second derivative of any given function s . In this case, the eigenfunctions and eigenvalues are available in closed form and can be expressed as

η 2 d − 1 : a ↦ 2 cos ( 2 π d a ) , η 2 d : a ↦ 2 sin ( 2 π d a ) , γ 2 d − 1 = γ 2 d = ( 2 π d ) − 4 ,

for d = 1 , 2 , … [28].

We conclude by discussing selection of the tuning parameter κ . Our goal is to select κ large enough so that ℋ κ contains a good approximation of an h 0 . Suppose h 0 belongs into the RKHS S K . With prior knowledge on h 0 of available, a natural choice would be to set κ = κ 0 , where we define

κ 0 = J ( h 0 ) = J ( θ 0 − θ * ) Var 0 ( θ 0 ( A ) − θ * ( A ) ) .

Because κ 0 is typically unknown as it depends on θ 0 − θ * , we may in practice rely upon an estimate.

We propose to use a simple plug-in estimator for κ 0 . Consider the following transformation of the observed data:

f n ( O i ) = n − 1 ∑ j = 1 n Q n ( W j , A i ) + n − 1 ∑ j = 1 n g n ( A i ∣ W j ) g n ( A i ∣ W i ) { Y i − Q n ( W i , A i ) } − θ * ( A i ) .

It is shown in [5] that one can consistently estimate θ 0 − θ * by regressing f n ( O i ) on A . We estimate θ 0 − θ * as c 0 , n + ∑ d = 1 D c d , n η d , where the coefficients are the minimizers of a penalized least squares loss, namely

(16) c 0 , n , c 1 , n , … , c d , n = arg min c 0 , c 1 , … , c d ∈ R n − 1 ∑ f n ( O i ) − c 0 − ∑ d = 1 D c d η d ( A i ) 2 + λ ∑ d = 1 D c d 2 γ d ,

where λ > 0 is a tuning parameter. The penalty term in (16) controls the RKHS norm of the resulting estimate, with smaller values of λ corresponding to a less smooth estimate. To select λ , we perform cross-validation for a large set of candidate values, and we choose the largest candidate for which the cross-validation error is within one standard error of the minimum cross-validation error. This strategy provides a parsimonious estimate of θ 0 − θ * that fits the observed data well. In practice, the resulting estimate will often be less rough than θ 0 − θ * [29]. Finally, we estimate h 0 as

h n = ∑ d = 1 D c d , n η j − n − 1 ∑ i = 1 n η j ( A i ) Var n 1 ⁄ 2 ∑ d = 1 D c d , n η d ( A ) ,

and we estimate κ 0 as

κ n = J ( h n ) = ∑ d = 1 D c d , n 2 γ d Var n ∑ d = 1 D c d , n η d ( A ) .

4.2 Calculation of Ψ n , θ * ( ℋ )

We now describe how to calculate ψ n , θ * ( h ) and Ψ n , θ * ( ℋ ) . It is first necessary to estimate the nuisance parameters Q 0 and g 0 . The conditional mean Q 0 can be estimated using any of a wide variety of flexible nonparametric estimators, such as artificial neural networks [30], the highly adaptive LASSO [25], or the Super Learner [31]. In this work, we use the highly adaptive LASSO, which is implemented in the publicly available R package hal9001.

To construct a nonparametric estimator for the conditional density function g 0 , we first observe that g 0 can be approximated by a conditional mean function. Let π be a non-negative and symmetric function from R to R + for which ∫ R π ( u ) d u = 1 . We define g 0 , r as

g 0 , r ( a ∣ w ) = E 0 [ r − 1 π ( r − 1 { A − a } ) ∣ W = w ] ,

where r > 0 is a bandwidth. It can be shown that g 0 , r ( a ∣ w ) tends to g 0 ( a ∣ w ) as r tends to zero. It is therefore sensible to estimate g 0 ( a ∣ w ) using an estimator for the conditional mean g 0 , r ( a ∣ w ) for sufficiently small r . We treat the bandwidth r as a tuning parameter that modulates the smoothness of the conditional density estimate in a , with smaller r corresponding to lesser smoothness. This estimator can be viewed as a generalization of the kernel density estimator for learning a marginal density function.

For a given bandwidth r and a fine grid of fixed points a 1 < a 2 < … , one can estimate each g 0 , r ( a j ∣ ⋅ ) using a flexible nonparametric estimator for the conditional mean. One can then obtain estimates at intermediate points via linear interpolation. To ensure that the conditional density estimate is non-negative, we fit a flexible nonparametric model for log g 0 , r using the highly adaptive LASSO and transform the resulting model fit, similarly as one would estimate a conditional mean in a generalized linear model with a log link. The bandwidth r can be selected by performing cross-validation using the log loss function. We note that performing cross-validation can be very slow as estimating g 0 , r ( a j ∣ ⋅ ) at a large number of grid points for several different choices of bandwidth can be computationally intensive. An alternative strategy that we suggest is to consider a kernel density estimator for the marginal density of A and to select r as the bandwidth for the kernel density estimator that minimizes the cross-validation error for the marginal density. We expect this approach to perform reasonably well as long as the conditional density of A at any given w is not much less smooth than the marginal density.

We now discuss calculation of the one-step and TML estimators for ψ 0 , θ * ( h ) , given that estimators Q n and g n for Q 0 and g 0 are available. One can calculate the one-step estimator at h = η d as

ψ n , θ * I ( η d ) = n − 1 ∑ i = 1 n θ n ( A i ) − θ * ( A i ) − n − 1 ∑ k = 1 n { θ n ( A k ) − θ * ( A k ) } η j ( A i ) − n − 1 ∑ k = 1 n η d ( A k ) + n − 1 ∑ i = 1 n n − 1 ∑ k = 1 n g n ( A i ∣ W k ) g n ( A i ∣ W i ) { Y i − Q n ( W i , A i ) } η j ( A i ) − n − 1 ∑ k = 1 n η d ( A k ) ,

where, as in Section 3.1, θ n : a ↦ n − 1 ∑ i = 1 n Q n ( a , W i ) is the plug-in estimator for the dose–response function. Because the one-step estimator is linear in h , it easy to see that for any h = ∑ j c d η d , ψ n I ( h ) can be expressed as

ψ n I ( h ) = ∑ d = 1 D c d ψ n I ( η d ) .

Computing the TML estimator is more involved than computing the one-step estimator as we need to calculate the TML update Q ˜ n of the initial conditional mean estimator Q n . Recall from Section 3.1 that we take Q ˜ n as the minimizer of the squared error loss along the parametric working model Q n , β in (11). For a small ε > 0 and a positive integer B , we approximate Q n , β at β = B ε as

Q n , B ε ( w , a ) = Q n ( w , a ) + ε ∑ b = 1 B Z n ( a , w ; h n , b ε ) ,

where we define h n , b ε as

(17) h n , b ε ≔ arg max h ∈ ℋ κ n − 1 ∑ i = 1 n { Y i − Q n , ε ( b − 1 ) ( A i , W i ) } Z n ( A i , W i ; h ) .

The optimization problem in (18) is a quadratically constrained quadratic program and can be solved by using publicly available software such as the CVXR package in R [32]. One can observe that because ℋ κ is symmetric in the sense that h ∈ ℋ κ implies − h ∈ ℋ κ , the derivative of the squared error loss,

(18) D n ( β ) ≔ − n − 1 ∑ i = 1 n { Y i − Q n , β ( A i , W i ) } Z i , n ( h n , β ) ,

is necessarily nonpositive. One can therefore find a near-minimizer of the squared error loss by calculating Q n , B ε for incrementally increasing B until D n ( B ε ) is sufficiently small. We take Q ˜ n = Q B n ε , where B n satisfies

D n ( B n ε ) = { n log ( n ) } − 1 ⁄ 2 Var n − 1 ⁄ 2 ( { Y − Q n ( W , A ) } Z n ( W , A ; h n ) ) ,

where h n is the estimator for h 0 described in Section 4.1. This choice of B n ensures that D n ( B n ε ) approaches zero at a rate faster than n − 1 ⁄ 2 , which is a key condition for establishing asymptotic linearity of the TML estimator. Additionally, choosing B n so that D n ( B n ε ) is not much smaller than necessary and tends to zero at only a slightly faster rate than n 1 ⁄ 2 helps to prevent Q ˜ n from being an overfitted estimator of Q 0 .

Now, for any η d , the TML estimator of ψ 0 , θ * ( η d ) can be expressed as

ψ n , θ * II ( η d ) = n − 1 ∑ i = 1 n θ P ˜ n ( A i ) − θ * ( A i ) − n − 1 ∑ k = 1 n { θ P ˜ n ( A k ) − θ * ( A k ) } η d ( A i ) ,

where we recall that θ P ˜ n : a ↦ n − 1 ∑ i = 1 n Q ˜ n ( a , W i ) is the updated TML estimator for the dose–response function. Because the TML estimator is linear in h , for any h = ∑ d = 1 D c d η d , we have

ψ n , θ * II ( h ) = ∑ d = 1 D c d ψ n , θ * II ( η d ) .

Having described how to compute the one-step and TML estimators for ψ 0 , θ * ( h ) , we now discuss how to calculate Ψ n , θ * ( ℋ κ ) . Observe that for h = ∑ d = 1 D c d η d , the one-step and TML estimators are linear in the coefficient vector c = ( c 1 , … , c d ) ⊤ and can be expressed as U ( θ * ) ⊤ c , where U ( θ * ) is a D -dimensional vector for which the d th element contains an estimator for ψ 0 , θ * ( η d ) . Let V be a D × D matrix where element ( d 1 , d 2 ) is

V d 1 , d 2 = n − 1 ∑ i = 1 n η d 1 ( A i ) − n − 1 ∑ j = 1 n η d 1 ( A j ) η d 2 ( A i ) − n − 1 ∑ j = 1 n η d 2 ( A j )

so that the empirical variance of h ( A ) is Var n ( h ( A ) ) = c ⊤ V c , and let Γ = diag ( γ 1 − 1 , … , γ d − 1 ) , where γ 1 , … , γ d are the eigenvalues for the kernel K . We can express Ψ n , θ * ( ℋ κ ) as U ( θ * ) ⊤ c n , where c n is defined as

(19) c n ≔ arg max { U ( θ * ) ⊤ c : c ⊤ V c = 1 , c ⊤ Γ c ≤ γ } .

The Karush-Kuhn-Tucker conditions for the optimization problem in (19) imply that c n is the solution to

U ( θ * ) − λ 1 ( V + λ 2 Γ ) c = 0 ,

where λ 1 > 0 and λ 2 > 0 are chosen so that the constraints are satisfied. With some algebra, one can show that

c n = λ 1 , n − 1 ( V + λ 2 , n Γ ) − 1 U ( θ * ) ,

where λ 2 , n satisfies

(20) U ( θ * ) ⊤ ( V + λ 2 , n Γ ) − 1 Γ ( V + λ 2 , n Γ ) − 1 U ( θ * ) U ( θ * ) ⊤ ( V + λ 2 , n Γ ) − 1 V ( V + λ 2 , n Γ ) − 1 U ( θ * ) = γ

and

(21) λ 1 , n = { U ( θ * ) ⊤ ( V + λ 2 , n Γ ) − 1 V ( V + λ 2 , n Γ ) − 1 U ( θ * ) } 1 ⁄ 2 .

Finally, we can express Ψ n ( ℋ κ ) as

Ψ n , θ * ( ℋ κ ) = λ 1 , n − 1 U ( θ * ) ⊤ ( V + λ 2 , n Γ ) − 1 U ( θ * ) .

4.3 Bootstrap approximation of the null limiting distribution

We now discuss how to draw multiplier bootstrap samples to estimate the null limiting distribution of Ψ n , θ * ( ℋ κ ) . For m = 1 , … , M and M large, we draw ξ 1 ( m ) , … , ξ n ( m ) as independent standard normal random variables. Let ξ ¯ ( m ) = n − 1 ∑ i = 1 n ξ i ( m ) , and let U ( m ) be a D -dimensional vector with d th element

U d ( m ) = n − 1 ∑ i = 1 n ( ξ i ( m ) − ξ ¯ ( m ) ) ϕ n , θ * ( O i ; η d ) ,

where we recall that ϕ n , θ * is the plug-in estimator for the efficient influence function in (15) (and as noted in Section 3.2, we may wish to replace θ * with an estimator for θ 0 ). The m th sample from the multiplier bootstrap estimate of the null distribution (see (14)) can be calculated as

Ψ n , θ * ( m ) ( ℋ κ ) = sup c ∈ R d { ( U ( m ) ) ⊤ c : c ⊤ Γ c ≤ γ , c ⊤ V c = 1 } .

The above optimization problem can be solved using the same routine described in Section 4.2, simply replacing U with U ( m ) . Finally, for a realization t of Ψ n , θ * ( ℋ κ ) , a bootstrap p-value can be calculated as

ρ M ( t ) = M − 1 ∑ m = 1 M 1 ( Ψ n , θ * ( m ) ( ℋ κ ) > t ) .

4.4 Confidence band construction

In this section, we discuss how to visualize the confidence set for θ ¯ 0 obtained by inverting our proposed hypothesis test. Recall from Section 2.2 that we propose to report the smallest band that contains all functions belonging to the confidence set. This confidence band is defined as C n α , and its form is provided in (4).

We first need to construct a function class Θ that contains a collection of candidate values for θ 0 . While Θ can be a rich class, it cannot be entirely unrestricted. In fact, if Θ is too large, the confidence band C n α can possibly have infinite width. To see this, note that the confidence set C n α in (3) contains a set of functions θ for which Ψ n , θ ( ℋ ) is close to zero. It is possible to construct θ so that at any given a , θ ( a ) takes an arbitrarily large positive or negative value, but Ψ n , θ ( ℋ ) = 0 . For instance, if Ψ n , θ ( ℋ ) is constructed using TML-based estimation, this could be achieved by setting θ ( A i ) = θ P ˜ n ( A i ) for i = 1 , … , n and allowing θ to take any value at points where no data are observed. We would encounter the same issue if we instead used the one-step estimator. By selecting Θ as, e.g., a class of smooth functions, we are able to avoid this problem. On the other hand, we note that if Θ is not large enough to contain θ 0 , C n α is not guaranteed to achieve the nominal coverage rate. Given these considerations, we suggest selecting Θ as a class of functions that is no less smooth than a reasonable approximation of θ 0 . In our implementation, we construct Θ as a subset of functions belonging to an RKHS for which the RKHS norm bounded above by a constant. That is, we take Θ = Θ ν , where

Θ ν ≔ θ = ∑ d = 1 D c d η d : ∑ d = 1 D c d 2 γ d < ν

for ν > 0 . We propose to set ν as the RKHS norm of a consistent estimate of θ 0 , which could be obtained using the method described in Section 4.1.

Let t 1 − α * be the 1 − α quantile of the null limiting distribution of n 1 ⁄ 2 Ψ n , θ * ( ℋ κ ) for a fixed κ . Given the above construction of Θ , the confidence band C n α takes the following form at any given point a 0 :

(22) C n α ( a 0 ) = inf ∑ d c d η d ( a 0 ) − n − 1 ∑ i = 1 n η d ( A i ) : ∑ d = 1 D c d 2 γ d ≤ ν , Ψ n , ∑ d c d η d ( ℋ κ ) ≤ n − 1 ⁄ 2 t 1 − α * , sup ∑ d c d η d ( a 0 ) − n − 1 ∑ i = 1 n η d ( A i ) : ∑ d = 1 D c d 2 γ d ≤ ν , Ψ n , ∑ d c d η d ( ℋ κ ) ≤ n − 1 ⁄ 2 t 1 − α * .

The optimization problems in (22) are challenging to solve because Ψ n , ∑ d c d η d ( ℋ ) does not have a closed form expression in the coefficients c 1 , … , c D . Recall from Section 4.2 that we can write

Ψ n , ∑ d c d η d ( ℋ κ ) = λ 1 , n − 1 U ∑ d = 1 D c d η d ⊤ ( V + λ 2 , n Γ ) − 1 U ∑ d = 1 D c d η d ,

where λ 1 , n and λ 2 , n are constants that depend on U ∑ d = 1 D c d η d . By instead treating λ 1 , n and λ 2 , n as fixed, we are able to obtain a closed form approximation of Ψ n , θ * ( ℋ κ ) . When Ψ n , θ * ( ℋ κ ) is constructed using either TML or one-step estimation, U ∑ d = 1 D c d η d is linear in the coefficients. As a result, when the closed form approximation of the test statistic is used, the optimization problem in (22) becomes a quadratically constrained quadratic program. As noted previously, this type of problem can be solved using publicly available software such as the CVXR package in R.

We conclude by discussing how to select the tuning parameters κ , λ 1 , and λ 2 . The choice of κ should have no bearing on the asymptotic coverage of the confidence set, though it may affect the confidence band’s width. In order for the confidence band to have optimal width, we need to select κ as to maximize the power to reject any null hypothesis H 0 : θ ¯ 0 = θ ¯ 0 * . Although the optimal choice of κ generally depends on the specific null hypothesis being tested, the optimization problem in (22) would become complicated if ℋ κ was not fixed. For computational ease, we fix κ as a single value that is large enough so that we have reasonable power to reject a large set of null hypotheses. We set κ as an estimate for J ( θ 0 ) ⁄ Var 0 ( θ 0 ( A ) ) , which can be obtained using the approach described in Section 4.1, so that we are well powered against nearly flat nulls when θ 0 is not very flat. Finally, we pick λ 1 , n and λ 2 , n to satisfy (20) and (21) with θ * = 0 . For this choice of λ 1 , n and λ 2 , n , the closed form approximation of Ψ n , θ * ( ℋ ) will be fairly accurate when θ * is nearly flat. Although λ 1 , n and λ 2 , n are data-dependent, asymptotic coverage will be unaffected if λ 1 , n and λ 2 , n converge to fixed constants.

5 Simulation study

5.1 Simulation setting

We begin by describing our approach for generating synthetic data sets. We first generate W 1 , … , W n as independent bivariate normal random vectors with mean zero, unit variance, and correlation 1 2 . Given W , we then draw A 1 , … , A n from a conditional distribution with density function

g 0 ( a ∣ w ) = expit ( ζ ( w ) a ) ∫ − 1 1 expit ( ζ ( w ) a ) d a 1 ( − 1 ≤ a ≤ 1 ) ,

where we define ζ as

ζ ( w ) = 3 expit ( w 1 + w 2 ) − 1 2 .

Random variables with the above conditional density can be generated via the inverse cumulative distribution function method.

We generate the outcome Y under the following settings.

Setting 1:

In the first setting, we construct the conditional distribution of the outcome given the exposure and covariates so that the centered dose–response curve is zero. We draw Y from the model

Y = − ζ ( W ) 1 − A 2 + ε ,

where ε is a uniform random variable on [ − 2 , 2 ] . Because ζ ( W ) has mean zero, it can be seen that E 0 [ Q 0 ( W , a ) ] = 0 as desired.

Setting 2:

In the second setting, we consider the case where θ ¯ 0 is nonzero. We construct a model for Y so that the dose–response function is

θ 0 : a ↦ 2 a + a 2 − 3 a 3 2 .

A plot of the dose–response function is provided in Figure 1. We generate Y as

Y = θ 0 ( A ) − ζ ( W ) 1 − A 2 + ε ,

where ε is again a uniform random variable on [ − 2 , 2 ] .

Figure 1

Plot of the dose–response function under the data-generating mechanism used in the simulation study.

Under each of the above settings, we use our proposed methodology to perform a test of the null hypothesis that the dose function is flat, i.e., H 0 : θ ¯ 0 = 0 . In the first setting, the null holds, and we would expect our approach to achieve nominal type-1 error control in the limit of large n . Under the second setting, the alternative holds, allowing us to assess the power of our proposed test.

We also examine the behavior of the proposed confidence bands when θ ¯ 0 ≠ 0 . We compute the interval C n α in (22) on an equally spaced grid of 50 points on the interval [ − 1 , 1 ] . We assess whether the bands are appropriate in the sense that they roughly capture the shape of the unknown dose–response function, and we evaluate the confidence bands’ coverage. We calculate both simultaneous coverage probability and the average of the pointwise coverage probability over the 50 evaluation points.

We study the behavior of the following four variations of our proposal:

A one-step estimator for Ψ 0 , θ * is used, and we set κ as the oracle κ 0 .
A TML estimator for Ψ 0 , θ * is used, and we set κ as the oracle κ 0 .
A one-step estimator for Ψ 0 , θ * is used, and we set κ as the data-adaptive choice κ n .
A TML estimator for Ψ 0 , θ * is used, and we set κ as the data-adaptive choice κ n .

In each case, we use D = 20 basis functions.

We compare our proposed hypothesis test with an approach similar to that described in the study of Westling [18], which is based on estimating primitive functions of the dose–response function. As noted above, their approach can be viewed as a variation of our proposal where we set ℋ = { h ( a ) = 1 ( a ≤ a 0 ) : a 0 ∈ A } . We use our own implementation of this procedure, which differs slightly in that we estimate ψ 0 , θ * ( h ) using a one-step estimator, whereas Westling [18] uses a cross-fitted estimator. We apply each of the above methods to 500 synthetic data sets for n ∈ { 100 , 200 , … , 500 } .

5.2 Simulation results

Figure 2 shows the Monte Carlo estimate of the distribution function for the p-values produced from each method under Setting 1, where the flat null holds. When the type-1 error rate is well-controlled for any significance level α , the distribution function should be linear. We find that our approach achieves type-1 error control near the nominal error level when the oracle choice of κ is provided, and the data-adaptive choice results in some modest anti-conservatism. The approach based on estimation of primitive functions also achieves nearly nominal type-1 error control.

Figure 2

Monte Carlo estimates of the empirical distribution of the p-values under the flat null.

Figure 3 shows the Monte Carlo estimate of the distribution functions for the p-values under Setting 2, where the alternative holds. We find that our proposal has high power when the oracle choice of κ is supplied, and power declines when a data-adaptive choice is used. The approach proposed in the study of Westling [18] outperforms our approach when κ is chosen data-adaptively but performs worse than our approach when the oracle choice is used. This suggests that making use of known structure on θ 0 can help us improve power to reject some alternative hypotheses, though there is a notable decline in performance when we attempt to learn the structure from the data.

Figure 3

Monte Carlo estimates of the empirical distribution of the p-values for a test against the flat null, under the alternative hypothesis.

Figure 4 shows the median upper and lower limits of the confidence bands that were constructed using our proposal. We find that the confidence bands are able to capture the shape of the dose–response curve, and the width of the bands decreases as the sample size grows, as expected. Figure 5 shows the uniform and average coverage probabilities of the proposed confidence bands. We find that when Θ is known, the confidence bands achieve simultaneous and average coverage at or above the nominal level. However, when Θ is selected data-adaptively, the uniform coverage rate falls far below the nominal level, though the average coverage rate remains near the nominal level. That the confidence bands have poorer coverage with data-adaptive selection of Θ is unsurprising as we only have theoretical coverage guarantees when Θ is known to contain θ 0 , and our approach only assures that the class contains a close approximation for θ 0 . This is evidently insufficient, in particular for uniform coverage.

$Figure 4 Monte Carlo estimates of the median upper and lower limits of the 95 % 95 \% bands obtained using our proposed methodology.$

Figure 4

Monte Carlo estimates of the median upper and lower limits of the 95 % bands obtained using our proposed methodology.

$Figure 5 Monte Carlo estimates uniform coverage probability and average coverage probability of the 95 % 95 \% bands obtained using our proposed methodology. The dashed gray line indicates the nominal coverage rate 0.95.$

Figure 5

Monte Carlo estimates uniform coverage probability and average coverage probability of the 95 % bands obtained using our proposed methodology. The dashed gray line indicates the nominal coverage rate 0.95.

6 Data example

As an example, we use our method to analyze data from the Adaptive Strategies for Preventing and Treating Lapses of Retention in HIV Care (ADAPT-R) trial [20]. ADAPT-R was a sequential multiple assignment randomized trial run in Kenya that studied the effectiveness of interventions for optimizing retention in HIV treatment in a population of people living with HIV who initiated care. In this study, a question of secondary interest is whether the distance a participant must travel to reach the nearest HIV clinic affects their retention in care. We perform an analysis to address this secondary aim, pooling across the trial’s randomized arms.

We conduct our analysis using a sample of 1,815 participants from the ADAPT-R trial. We treat as the exposure of interest the distance from the nearest clinic. The distribution of the exposure variable is highly skewed. Approximately 95% of the study participants lived within 20 km of the nearest clinic, and among the remaining participants, distance ranges between 20 and 500 km. Because these extreme values are fairly rare, there is concern about potential violation of the positivity assumption. To avoid this issue, we exclude from this analysis participants who lived more than 20 km from the nearest clinic, obtaining a final sample size of 1,600. Our outcome is a binary variable that is equal to one if a patient had neither experienced a lapse in care (defined as missing a scheduled clinic visit by at least 14 days) nor had unsuppressed HIV viral load 1 year after initiating care, and is zero otherwise. A total of 446 study participants experienced a lapse in care or had an unsuppressed HIV viral load within 1 year. As our exposure of interest is not randomized, we adjust for the following set of measured baseline variables that may either confound the exposure outcome relationship or predict the outcome and thus improve efficiency: age, sex assigned at birth, and a wealth index.

In Figure 6, we display the marginal distribution of distance to clinic by retention status. There does not appear to be a strong association between distance and retention, as the marginal distribution is nearly the same in both groups. To more formally assess the presence of an effect, we apply our method to perform a test of the flat null. We use the data-adaptive choice of κ described in Section 4.1, and we use a one-step estimator for Ψ 0 , θ * ( ℋ ) . Figure 6 shows a plug-in estimate of the centered dose–response function, in addition to 95% confidence bands and a p-value for a test of the flat null. The dose-response function appears to be nearly flat, and we are unable to reject the flat null hypothesis based as our p-value is quite large. These results suggest that there is no strong evidence to support that distance from clinic has a strong effect on retention in care in people living with HIV in this setting.

Figure 6

(Top) Violin plot of distribution of the distance to furthest clinic, by status of retention in care. (Bottom) Plug-in estimate of centered dose-response function and 95% confidence bands.

7 Conclusion

This work provides a novel approach to inference on the causal dose-response function. We show that, under mild regularity conditions, our nonparametric test achieves type-1 error control near the nominal level and is well-powered against the null. We also present a computationally tractable method for visualizing confidence sets constructed by inverting our proposed test. The recent proposal by Westling [18] also performs well under weak assumptions, though their work does not present a method for constructing or visualizing confidence sets. That we introduce a novel approach for constructing confidence sets is a key strength of our work.

The strategy for inference on the dose-response function we describe in this article can be adapted to address other problems of interest in the causal inference literature. For instance, one could use our approach to assess for treatment effect heterogeneity by testing the null hypothesis that a conditional average treatment effect curve is flat.

One of the main limitations of this work is that we require a specification for the function class ℋ for our test to be operational, though this class can be challenging to select in practice. When ℋ is specified a priori and contains h 0 , our proposal performs very well, but when we attempt to select ℋ data-adaptively, we suffer a loss in performance. We note that in some settings, selecting ℋ a priori may be possible. For instance, if an independent data set is available (e.g., from a closely related study), one could use this data set to construct ℋ without looking at the data set they are primarily interested in analyzing. Alternatively, it is sensible in many settings to assume, without looking at the data, that the dose-response function is monotone. Therefore, one could consider implementing a version of our procedure where ℋ is a class of bounded monotone functions. In future work, we plan to develop improved strategies for tuning parameter selection.

Our proposal also requires that the nuisance parameter estimators are not overly complex. This condition is somewhat prohibitive and disallows us from using more flexible estimators, such as gradient-boosted trees [33]. To avoid this assumption, one could develop a slightly modified version of our procedure where ψ 0 , θ * is estimated using cross-fitting [34,35].

Acknowledgements

We thank the Family AIDS Care and Education Services program operating HIV services in western Kenya, as well as the patients in these communities and the front-line health care workers in the region.

Funding information: This research was supported by grants (R01 MH104123, K24 AI134413, and R01 AI074345) from the National Institutes of Health.
Author contributions: AH developed this manuscript's theoretical contribution, conducted simulation experiments, led the data analysis, and prepared the manuscript. EHG, TAO, and EAB supported the data analysis. MLP and MJvdL jointly supervised this project.
Conflict of interest: Maya L. Petersen and Mark J. van der Laan serve on the Editorial Advisory Board for the Journal of Causal Inference, but neither was involved in the peer review process for this article.
Data availability statement: A de-identified data set from ADAPT-R can be shared upon request to the authors.

Appendix A Proofs of theoretical results

Proof of Lemma 1

Let P be a distribution function that satisfies conditions A1–A3, and let p denote the density of P with respect to a dominating measure μ . Let ω : W × A × Y → R be an arbitrary function with zero mean and finite variance under P . We define P ε as the parametric submodel with density function p ε : ( w , a , y ) ↦ p ( w , a , y ) { 1 + ε ω ( w , a , y ) } for ε small. We can observe that P ε is equal to P at ε = 0 , and the score function is ω . Every regular parametric model that passes through P and has score function equal to ω can be closely approximated using such a submodel. If there exists function ϕ ˜ P that satisfies

(A1) d d ε ψ P ε , θ * ( h ) ∣ ε = 0 = E P [ ϕ ˜ P ( O ; h ) ω ( O ) ] ,

for any choice of ω , then ψ P , θ * ( h ) is pathwise differentiable in a nonparametric model, and ϕ ˜ P ( ⋅ ; h ) − E P [ ϕ ˜ P ( O ; h ) ] is the nonparametric efficient influence function. We show that such a function exists, and when centered about its mean, it is equal to ϕ P , θ * ( ⋅ ; h ) .

We first evaluate the derivative of ψ P ε at ε = 0 . We express ψ P ε , θ * ( h ) as the sum of three components:

ψ P ε , θ * ( h ) = G P ε I ( h ) − G P ε II ( h ) − G P ε III ( h ) ,

where for any distribution P , we define

G P I ( h ) ≔ E P [ θ P ( A ) h ( A ) ] G P II ( h ) ≔ E P [ θ P ( A ) ] E P [ h ( A ) ] G P III ( h ) ≔ E P [ θ * ( A ) h ( A ) ] − E P [ θ * ( A ) ] E P [ θ * ( A ) ] .

To make calculation of the derivative of ψ P ε , θ * more manageable, we evaluate the derivative of the three additive components at ε = 0 .

We first calculate d d ε G P ε I ( h ) ∣ ∣ ε = 0 . We can write G P ε I ( h ) as

G P ε I ( h ) = ∫ θ P ε ( a ) h ( a ) p ( w , a , y ) { 1 + ε ω ( w , a , y ) } d μ ( w , a , y ) ,

where θ P ε is the dose-response function under P ε . To proceed, we need an expression for the derivative of θ P ε ( a ) with respect to ε . We can write

θ P ε ( a ) = ∫ Q P ε ( w , a ) ∫ p ( w , a , y ) { 1 + ε ω ( w , a , y ) } μ ( d y , d a ) μ ( d w ) ,

where Q P ε is the conditional mean of Y given A and W under P ε . Letting p i ( y ∣ a , w ) be the conditional density of Y given A and W , it can be shown that the derivative of Q P ε at ε = 0 is

d d ε Q P ε ( w , a ) ∣ ε = 0 = ∫ { y − Q P ( w , a ) } ω ( w , a , y ) p i ( y ∣ a , w ) d μ ( y ) .

Now, let the conditional density of ( A , Y ) given W be p ii ( a , y ∣ w ) , and let the marginal density of W be p iii ( w ) . We now calculate the derivative of θ P ε ( a ) at ε = 0 :

(A2) d d ε θ P ε ( a ) ε = 0 = ∫ d d ε Q P ε ( w , a ) ε = 0 ∫ p ( w , a , y ) μ ( d y , d a ) μ ( d w ) + ∫ Q P ( w , a ) ∫ p ( w , a , y ) ω ( w , a , y ) μ ( d y , d a ) μ ( d w ) = ∫ ∫ { y − Q P ( w , a ) } ω ( w , a , y ) p i ( y ∣ a , w ) d μ ( y ) p iii ( w ) μ ( d w ) + ∫ Q P ( w , a ) p iii ( w ) ∫ p ii ( a , y ∣ w ) ω ( w , a , y ) μ ( d y , d a ) μ ( d w ) .

We can now express the derivative of G P ε I ( h ) as

d d ε G P ε I ( h ) ε = 0 = ∫ d d ε [ θ P ε ( a ) ] ε = 0 h ( a ) p ( w , a , y ) d μ ( w , a , y ) + ∫ θ P ( a ) p ( w , a , y ) ω ( w , a , y ) d μ ( w , a , y ) = ∫ ∫ ∫ { y − Q P ( w , a ) } ω ( w , a , y ) p i ( y ∣ a , w ) d μ ( y ) p iii ( w ) μ ( d w ) h ( a ) p ( w , a , y ) d μ ( w , a , y ) + ∫ ∫ Q P ( w , a ) ∫ p ii ( a , y ∣ w ) ω ( w , a , y ) μ ( d y , d a ) p iii ( w ) μ ( d w ) h ( a ) p ( w , a , y ) d μ ( w , a , y ) + ∫ θ P ( a ) h ( a ) p ( w , a , y ) ω ( w , a , y ) d μ ( w , a , y ) .

Let the marginal density of A be p iv ( a ) , and let the conditional density of W given A be p v ( w ∣ a ) . Observe that we can express the marginal density of W as

p iii ( w ) = p v ( w ∣ a ) p iv ( a ) g P ( a ∣ w ) .

Using this fact, we can write

(A3) d d ε G P ε I ( h ) ε = 0 = E P E P p iv ( A ) g P ( A ∣ W ) E 0 [ { Y − Q P ( W , A ) } ω ( W , A , Y ) ∣ W , A ] ∣ A h ( A ) + E P E P p iv ( A ) g P ( A ∣ W ) Q P ( W , A ) E P [ ω ( W , A , Y ) ∣ W ] ∣ A h ( A ) + E P [ θ P ( A ) h ( A ) ω ( W , A , Y ) ] .

By applying the law of total expectation in the first two lines, we obtain

d d ε G P ε I ( h ) ε = 0 = E P p iv ( A ) g P ( A ∣ W ) { Y − Q P ( W , A ) } h ( A ) ω ( W , A , Y ) + E P E P p iv ( A ) g P ( A ∣ W ) Q P ( W , A ) h ( A ) ∣ W ω ( W , A , Y ) + E P [ θ P ( A ) h ( A ) ω ( W , A , Y ) ] .

Finally, by observing that in the second line above, the conditional densities of A given W cancel, we have

d d ε G P ε I ( h ) ε = 0 = E P p iv ( A ) g P ( A ∣ W ) { Y − Q P ( W , A ) } h ( A ) ω ( W , A , Y ) + E P [ E P , A [ Q P ( W , A ) h ( A ) ] ω ( W , A , Y ) ] + E P [ θ P ( A ) h ( A ) ω ( W , A , Y ) ] ,

where E P , A is the expectation over the marginal distribution of A under P .

We now evaluate the derivative of G P ε II ( h ) . We can express G P ε II as

G P ε II ( h ) = ∫ θ P ε ( a ) p ( w , a , y ) { 1 + ε ω ( w , a , y ) } d μ ( w , a , y ) ∫ h ( a ) p ( w , a , y ) { 1 + ε ω ( w , a , y ) } d μ ( w , a , y ) .

The evaluation of its derivative at zero is

d d ε G P ε II ( h ) ε = 0 = ∫ h ( a ) p ( w , a , y ) d μ ( w , a , y ) ∫ d d ε [ θ P ε ( a ) ] ε = 0 p ( w , a , y ) d μ ( w , a , y ) + ∫ h ( a ) p ( w , a , y ) d μ ( w , a , y ) ∫ θ P ( a ) p ( w , a , y ) ω ( w , a , y ) d μ ( w , a , y ) + ∫ h ( a ) ω ( w , a , y ) p ( w , a , y ) d μ ( w , a , y ) ∫ θ P ( a ) p ( w , a , y ) d μ ( w , a , y ) .

Performing similar steps as were used to calculate d d ε G P ε I ( h ) ∣ ε = 0 , it can be shown that

(A4) d d ε G P ε II ( h ) ε = 0 = E P [ h ( A ) ] E 0 p iv ( A ) g P ( A ∣ W ) { Y − Q P ( W , A ) } ω ( W , A , Y ) + E P [ E P [ h ( A ) ] E P , A [ Q P ( W , A ) ] ω ( W , A , Y ) ] + E P [ E P [ h ( A ) ] θ P ( A ) ω ( W , A , Y ) ] + E P [ E P [ θ P ( A ) ] h ( A ) ω ( W , A , Y ) ] .

Now, we take the derivative of the remaining term G P ε III ( h ) . Rather than perform this calculation directly, we recognize that G P III ( h ) is simply the covariance between h ( A ) and θ * ( A ) under P , and it is well known that the derivative can be expressed as

(A5) d d ε G P ε III ( h ) ε = 0 = E P [ { θ * ( A ) h ( A ) − E [ h ( A ) ] θ * ( A ) − E [ θ * ( A ) ] h ( A ) } ω ( W , A , Y ) ] .

Now, by (A3), (A4), and (A5), we can express d d ε ψ P ε , θ * ( h ) ∣ ε = 0 as

d d ε ψ P ε , θ * ( h ) ∣ ε = 0 = d d ε [ G P ε I ( h ) − G P ε II ( h ) − G P ε III ( h ) ] ε = 0 = E P [ ϕ P * ( W , A , Y ; h ) ω ( W , A , Y ) ] ,

where we define

ϕ P * ( w , a , y ; h ) ≔ θ ¯ P ( a ) − θ ¯ P * ( a ) + p iv ( a ) g P ( a ∣ w ) { y − Q P ( w , a ) } { h ( a ) − E P [ h ( A ) ] } + E P [ Q P ( w , A ) { h ( A ) − E P [ h ( A ) ] } ] .

The proof is completed by observing that p iv ( a ) = E P [ g P ( a ∣ W ) ] , and

□ ϕ P , θ * ( ⋅ ; h ) = ϕ P * ( ⋅ ; h ) − E P [ ϕ P * ( W , A , Y ; h ) ] .

Proof of Lemma 2

This result is an immediate consequence of Slutsky’s theorem [36] (see, e.g., Theorem 7.15 of [36]).□

Proof of Theorem 1

The one-step estimator can be expressed as

ψ n , θ * I ( h ) − ψ 0 , θ * ( h ) = n − 1 ∑ i = 1 n ϕ P 0 ( O i ; h ) + R 1 , n ( h ) + R 2 , n ( h ) ,

where we define

R 1 , n ( h ) ≔ − n − 1 ∑ i = 1 n ϕ P ˆ , θ * ( O i ; h ) − ϕ P 0 , θ * ( O i ; h ) − ∫ ϕ P ˆ ( o ; h ) − ϕ P 0 ( o ; h ) d P 0 ( o ) , R 2 , n ( h ) ≔ ψ n , θ * ( h ) − ψ 0 , θ * ( h ) − ∫ ϕ P ˆ , θ * ( o ; h ) d P 0 ( o ) .

By the triangle inequality, it is sufficient to show that sup h ∈ ℋ ∣ R 1 , n ( h ) ∣ and sup h ∈ ℋ ∣ R 2 , n ( h ) ∣ are both o P ( n − 1 ⁄ 2 ) .

To see that sup ∣ R 1 , n ( h ) ∣ = o P ( n − 1 ⁄ 2 ) , we note that by Assumption B1 and Theorem 2.10.6 of [37], ℋ is a P 0 -Donsker class, and therefore, sup h ∈ ℋ n − 1 ∑ i = 1 n h ( A i ) − E 0 [ h ( A ) ] = O P ( n − 1 ⁄ 2 ) . In view of this fact and Assumption B2, it can be concluded that the estimator ϕ P ˆ n , θ * of the efficient influence function is uniformly consistent over ℋ , that is

sup h ∈ ℋ ∫ { ϕ P ˆ n ( o ; h ) − ϕ 0 , θ * ( o ; h ) } 2 d P 0 ( o ) = o P ( 1 ) .

It is shown in the proof of lemma 19.26 in [38] that when this condition is satisfied, and Assumption B2 holds, sup h ∈ ℋ ∣ R 1 , n ( h ) ∣ = o P ( n − 1 ⁄ 2 ) .

It remains to be shown that R 2 , n ( h ) is asymptotically negligible. With some algebra, it can be shown that the second remainder term can be expressed as

R 2 , n ( h ) = R 2 , n i ( h ) + R 2 , n ii ( h ) + R 2 , n iii ( h ) + R 2 , n iv ( h ) + R 2 , n v ( h ) ,

where we define

R 2 , n i ( h ) ≔ ∫ E 0 [ g 0 ( a ∣ W ) ] g 0 ( a ∣ w ) − n − 1 ∑ i = 1 n g n ( a ∣ W i ) g n ( a ∣ w ) { Q 0 ( w , a ) − Q n ( w , a ) } { h ( a ) − E 0 [ h ( A ) ] } d P 0 ( w , a ) R 2 , n ii ( h ) ≔ n − 1 ∑ i = 1 n n − 1 ∑ j = 1 n Q n ( W i , A j ) { h ( A j ) − E 0 [ h ( A ) ] } − ∫ Q n ( W i , a ) { h ( a ) − E 0 [ h ( A ) ] } d P 0 ( a ) − ∫ n − 1 ∑ j = 1 n Q n ( w , A j ) { h ( A j ) − E 0 [ h ( A j ) ] } − ∫ Q n ( w , a ) { h ( a ) − E 0 [ h ( A ) ] } d P 0 ( a ) d P 0 ( w ) R 2 , n iii ( h ) ≔ n − 1 ∑ i = 1 n h ( A i ) − E 0 [ h ( A ) ] ∫ n − 1 ∑ i = 1 n g n ( a ∣ W i ) g n ( a ∣ w ) { Q 0 ( w , a ) − Q n ( w , a ) } d P 0 ( a ) R 2 , n iv ( h ) ≔ E 0 [ h ( A ) ] − n − 1 ∑ i = 1 n h ( A i ) n − 1 ∑ i = 1 n n − 1 ∑ j = 1 n Q n ( W i , A j ) − ∫ Q n ( W i , a ) d P 0 ( a ) − E 0 [ h ( A ) ] − n − 1 ∑ i = 1 n h ( A i ) ∫ n − 1 ∑ j = 1 n Q n ( w , A j ) − ∫ Q n ( w , a ) d P 0 ( a ) d P 0 ( w ) R 2 , n v ( h ) ≔ n − 1 ∑ i = 1 n h ( A i ) − E 0 [ h ( A ) ] n − 1 ∑ i = 1 n θ * ( A i ) − E 0 [ θ * ( A ) ] .

By the triangle inequality, it suffices to argue that each of the above components converges to zero in probability at a rate of n 1 ⁄ 2 uniformly in ℋ .

It follows directly from Assumption B1 that sup h ∈ ℋ ∣ R 2 , n i ( h ) ∣ = o P ( n − 1 ⁄ 2 ) . We now argue that sup h ∈ ℋ ∣ R 2 , n ii ( h ) ∣ = o P ( n − 1 ⁄ 2 ) . Observe that

sup h ∈ ℋ ∫ n − 1 ∑ i = 1 n Q n ( w , A i ) h ( A i ) − ∫ Q n ( w , a ) h ( a ) d P 0 ( a ) 2 d P 0 ( w ) ≤ 2 sup h ∈ ℋ ∫ n − 1 ∑ i = 1 n { Q n ( w , A i ) − Q 0 ( w , A i ) } h ( A i ) − ∫ { Q n ( w , a ) − Q 0 ( w , a ) } h ( a ) d P 0 ( a ) 2 d P 0 ( w ) + 2 sup h ∈ ℋ ∫ n − 1 ∑ i = 1 n Q 0 ( w , A i ) h ( A i ) − ∫ Q 0 ( w , a ) h ( a ) d P 0 ( a ) 2 d P 0 ( w ) = o P ( 1 ) ,

where the convergence follows from the rate conditions and the Donsker assumption. It can thus be concluded using the argument in the proof of Lemma 19.26 of [38] that sup h ∈ ℋ ∣ R 2 , n ii ( h ) ∣ = o P ( n − 1 ⁄ 2 ) . It can be seen that sup h ∈ ℋ ∣ R n iii ( h ) ∣ = o P ( n − 1 ⁄ 2 ) and sup h ∈ ℋ ∣ R n iv ( h ) ∣ = o P ( n − 1 ⁄ 2 ) by recalling that sup h ∈ ℋ ∣ n − 1 ∑ i = 1 n h ( A i ) − E 0 [ h ( A ) ] ∣ = O P ( n − 1 ⁄ 2 ) and noting that Q n is a consistent estimator for Q 0 . Similarly, that sup h ∈ ℋ ∣ R n v ( h ) ∣ = o P ( n − 1 ⁄ 2 ) follows from uniform consistency of n − 1 ∑ i = 1 n h ( A i ) . This completes our argument to show that the one-step estimator is uniformly asymptotically linear.□

Proof of Theorem 2

The proof of Theorem 2 is nearly the same as the proof of Theorem 1, so we only provide a brief outline of the argument. The TML estimator can be expressed as

ψ n , θ * II ( h ) − ψ 0 , θ * ( h ) = n − 1 ∑ i = 1 n ϕ P 0 ( O i ; h ) + R 1 , n ( h ) + R 2 , n ( h ) + R 3 , n ( h ) ,

where we define

R 1 , n ( h ) ≔ − n − 1 ∑ i = 1 n ϕ P ˜ , θ * ( O i ; h ) − ϕ P 0 , θ * ( O i ; h ) − ∫ ϕ P ˜ ( o ; h ) − ϕ P 0 ( o ; h ) d P 0 ( o ) , R 2 , n ( h ) ≔ ψ n , θ * ( h ) − ψ 0 , θ * ( h ) − ∫ ϕ P ˜ , θ * ( o ; h ) d P 0 ( o ) R 3 , n ( h ) ≔ − n − 1 ∑ i = 1 n ϕ P ˜ n ( O i ; h ) .

Uniform asymptotic negligibility of R 1 , n ( h ) and R 2 , n ( h ) follows from the same arguments we presented for establishing uniform asymptotic linearity of the one-step estimator. Additionally, we have by construction that sup h ∈ ℋ ∣ R 3 , n ( h ) ∣ = o P ( n − 1 ⁄ 2 ) . From this, we can conclude that the TML estimator is uniformly asymptotically linear.□

References

[1] Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ Psychol. 1974;66(5):688. 10.1037/h0037350Search in Google Scholar

[2] Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. J Am Stat Assoc. 1995;90(429):122–9. 10.1080/01621459.1995.10476494Search in Google Scholar

[3] van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1). 10.2202/1557-4679.1043Search in Google Scholar

[4] Díaz I, van der Laan MJ. Targeted data adaptive estimation of the causal dose-response curve. J Causal Inference. 2013;1(2):171–92. 10.1515/jci-2012-0005Search in Google Scholar

[5] Kennedy EH, Ma Z, McHugh MD, Small DS. Non-parametric methods for doubly robust estimation of continuous treatment effects. J R Stat Soc Ser B (Stat Meth). 2017;79(4):1229–45. 10.1111/rssb.12212Search in Google Scholar PubMed PubMed Central

[6] Colangelo K, Lee Y-Y. Double debiased machine learning nonparametric inference with continuous treatments. 2020. arXiv: http://arXiv.org/abs/arXiv:2004.03036. Search in Google Scholar

[7] Westling T, Gilbert P, Carone M. Causal isotonic regression. J R Stat Soc Ser B (Stat Meth). 2020;82(3):719–47. 10.1111/rssb.12372Search in Google Scholar PubMed PubMed Central

[8] Hirano K, Imbens GW. The propensity score with continuous treatments. Appl Bayesian Model Causal Inference Incomplete-data Perspectives. 2004;226164:73–84. 10.1002/0470090456.ch7Search in Google Scholar

[9] Imai K, Van Dyk DA. Causal inference with general treatment regimes: Generalizing the propensity score. J Am Stat Assoc. 2004;99(467):854–66. 10.1198/016214504000001187Search in Google Scholar

[10] Galvao AF, Wang L. Uniformly semiparametric efficient estimation of treatment effects with a continuous treatment. J Am Stat Assoc. 2015;110(512):1528–42. 10.1080/01621459.2014.978005Search in Google Scholar

[11] Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000;11(5):550–60. 10.1097/00001648-200009000-00011Search in Google Scholar PubMed

[12] Zhang Z, Zhou J, Cao W, Zhang J. Causal inference with a quantitative exposure. Stat Methods Med Res. 2016;25(1):315–35. 10.1177/0962280212452333Search in Google Scholar PubMed

[13] Doss CR, Weng G, Wang L, Moscovice I, Chantarat T. A nonparametric doubly robust test for a continuous treatment effect. Ann. Statist. 2024;52(4):1592–615.10.1214/24-AOS2405Search in Google Scholar

[14] Takatsu K, Westling T. Debiased inference for a covariate-adjusted regression function. J R Stat Soc B: Stat. Methodol. 2024;qkae041.10.1093/jrsssb/qkae041Search in Google Scholar

[15] Neugebauer R, van der Laan M. Nonparametric causal effects based on marginal structural models. J Stat Plan Inference. 2007;137(2):419–34. 10.1016/j.jspi.2005.12.008Search in Google Scholar

[16] Muñoz ID, van der Laan M. Population intervention causal effects based on stochastic interventions. Biometrics. 2012;68(2):541–9. 10.1111/j.1541-0420.2011.01685.xSearch in Google Scholar PubMed PubMed Central

[17] Hines O, Diaz-Ordaz K, Vansteelandt S. Parameterising the effect of a continuous exposure using average derivative effects. 2021. arXiv: http://arXiv.org/abs/arXiv:2109.13124. Search in Google Scholar

[18] Westling T. Nonparametric tests of the causal null with nondiscrete exposures. J Am Stat Assoc. 2021;117:1–12. 10.1080/01621459.2020.1865168Search in Google Scholar

[19] Hudson A, Carone M, Shojaie A. Inference on function-valued parameters using a restricted score test. 2021. arXiv: http://arXiv.org/abs/arXiv:2105.06646. Search in Google Scholar

[20] Geng EH, Odeny TA, Montoya LM, Iguna S, Kulzer JL, Adhiambo HF, et al. Adaptive strategies for retention in care among persons living with HIV. NEJM Evidence. 2023;2(4):EVIDoa2200076. 10.1056/EVIDoa2200076Search in Google Scholar PubMed PubMed Central

[21] Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. New York, NY: Springer; 1998. Search in Google Scholar

[22] Pfanzagl J. Contributions to a general asymptotic statistical theory. New York, NY: Springer; 1982. 10.1007/978-1-4612-5769-1Search in Google Scholar

[23] van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York, NY: Springer Science & Business Media; 2011. 10.1007/978-1-4419-9782-1Search in Google Scholar

[24] van der Laan MJ, Rose S. Targeted learning in data science. Cham: Springer International Publishing; 2018. 10.1007/978-3-319-65304-4Search in Google Scholar

[25] Benkeser D, van der LaanM. The highly adaptive LASSO estimator. In: 2016 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 2016. pp. 689–96. 10.1109/DSAA.2016.93Search in Google Scholar PubMed PubMed Central

[26] van der Laan M, Gruber S. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. Int J Biostat. 2016;12(1):351–78. 10.1515/ijb-2015-0054Search in Google Scholar PubMed PubMed Central

[27] Hable R. Asymptotic normality of support vector machine variants and other regularized kernel methods. J Multivariate Anal. 2012;106:92–117. 10.1016/j.jmva.2011.11.004Search in Google Scholar

[28] Wahba G. Spline models for observational data. Philadelphia: SIAM; 1990. 10.1137/1.9781611970128Search in Google Scholar

[29] Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. vol. 2. New York, NY: Springer; 2009. 10.1007/978-0-387-84858-7Search in Google Scholar

[30] Barron AR. Statistical properties of artificial neural networks. In: Proceedings of the 28th IEEE Conference on Decision and Control. IEEE; 1989. pp. 280–5. 10.1109/CDC.1989.70117Search in Google Scholar

[31] van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genetics Mol Biol. 2007;6(1). 10.2202/1544-6115.1309Search in Google Scholar PubMed

[32] Fu A, Narasimhan B, Boyd S. CVXR: An r package for disciplined convex optimization. J Stat Softw. 2020;94(14):1–34.10.18637/jss.v094.i14Search in Google Scholar

[33] Friedman JH. Stochastic gradient boosting. Comput Stat DAta Anal. 2002;38(4):367–78. 10.1016/S0167-9473(01)00065-2Search in Google Scholar

[34] Zheng W, van der Laan MJ. Cross-validated targeted minimum-loss-based estimation. In Targeted learning. New York, NY: Springer; 2011. pp. 459–74. 10.1007/978-1-4419-9782-1_27Search in Google Scholar

[35] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, et al. Double/debiased machine learning for treatment and structural parameters. Econometr J. 2018;21(1):C1–68. 10.1111/ectj.12097Search in Google Scholar

[36] Kosorok MR. Introduction to empirical processes and semiparametric inference. New York, NY: Springer Science & Business Media; 2008. 10.1007/978-0-387-74978-5Search in Google Scholar

[37] van der Vaart A, Wellner J. Weak convergence and empirical processes. New York, NY: Springer; 1996. 10.1007/978-1-4757-2545-2Search in Google Scholar

[38] van der Vaart A. Asymptotic statistics. vol. 3. Cambridge, England: Cambridge University Press; 2000. Search in Google Scholar

Received: 2024-01-08

Revised: 2024-05-29

Accepted: 2024-07-01

Published Online: 2024-12-21

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2024-0001

Keywords for this article

dose–response function; continuous exposure; nonparametric testing; targeted minimum loss-based estimation

Creative Commons

BY 4.0