Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive lasso

Mark J. van der Laan; David Benkeser; Weixin Cai

doi:10.1515/ijb-2019-0092

Enjoy 40% off

academic books on De Gruyter Brill *

Article Publicly Available

Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive lasso

Mark J. van der Laan , David Benkeser and Weixin Cai

Published/Copyright: July 15, 2022

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal The International Journal of Biostatistics Volume 19 Issue 1

Abstract

We consider estimation of a functional parameter of a realistically modeled data distribution based on observing independent and identically distributed observations. The highly adaptive lasso estimator of the functional parameter is defined as the minimizer of the empirical risk over a class of cadlag functions with finite sectional variation norm, where the functional parameter is parametrized in terms of such a class of functions. In this article we establish that this HAL estimator yields an asymptotically efficient estimator of any smooth feature of the functional parameter under a global undersmoothing condition. It is formally shown that the L ₁-restriction in HAL does not obstruct it from solving the score equations along paths that do not enforce this condition. Therefore, from an asymptotic point of view, the only reason for undersmoothing is that the true target function might not be complex so that the HAL-fit leaves out key basis functions that are needed to span the desired efficient influence curve of the smooth target parameter. Nonetheless, in practice undersmoothing appears to be beneficial and a simple targeted method is proposed and practically verified to perform well. We demonstrate our general result HAL-estimator of a treatment-specific mean and of the integrated square density. We also present simulations for these two examples confirming the theory.

Keywords: asymptotically efficient estimator; canonical gradient; cross-validation; highly adaptive lasso; sectional variation norm; undersmoothing

1 Introduction

We consider the estimation problem in which we observe n independent and identically distributed copies of a random variable with probability distribution known to be an element of an infinite-dimensional statistical model, while the goal is to estimate a particular smooth functional of the data distribution. It is assumed that the target parameter is a pathwise differentiable functional of the data distribution so that its derivative is characterized by its canonical gradient.

A regular asymptotically linear estimator is asymptotically efficient if and only if it is asymptotically linear with influence curve the canonical gradient [1] and a number of general methods for efficient estimation have been developed in the literature. If the model is not too large, then a regularized or sieve maximum likelihood estimator or minimum loss estimator (MLE) generally results in an efficient substitution estimator [2–4]. For a general theory on sieve estimation that also demonstrates sieve-based maximum likelihood estimators that are asymptotically efficient in large models, we refer to [5, 6]. These results generally require a sieve-based MLE that overfits the data (or equivalently, undersmooths the estimated functional parameter) and are only applicable for certain type of sieves [7–10].

An alternative to undersmoothing is to use targeted estimator based on the canonical gradient, such as: the one-step estimator, which adds to an initial plug-in estimator the empirical mean of the canonical gradient at the estimated data distribution [1]; an estimating equations-based estimator, which defines the estimator of the target parameter as the solution of an estimating equation with the estimated canonical gradient as estimating function [11, 12]; and targeted minimum loss-estimation, which updates an initial estimator of the data distribution with an MLE of a least favorable parametric submodel through the initial estimator [13–16]. By using an initial estimator of the relevant parts of the data distribution that converges with respect to L ²-type norm to the truth at a rate faster than n ^−1/4, such as achieved with the HAL-estimator [17, 18], these three procedures will generally result in an efficient estimator.

In this article we focus on a HAL-MLE, a particular sieve MLE described in [17, 18]. The HAL-MLE is defined as the minimizer of an empirical mean of the loss function (e.g., log-likelihood loss) over a particular class of functions. As such, such estimators could also be referred to as empirical risk minimizers. The particular class over which HAL-MLE minimizes risk are functions that can be arbitrarily well approximated by linear combinations of tensor products of univariate zero order spline-basis functions, but where the L ₁-norm of the coefficient vector is constrained. The L ₁-norm of the coefficients equals the sectional variation norm of the function [18, 19] so that the HAL-MLE corresponds with minimizing the empirical risk over all cadlag functions with a bound on their sectional variation norm.

The class of k ₁-dimensional real-valued cadlag functions with finite sectional variation norm differs from typical smoothness classes that assume pointwise derivatives (e.g., Hölder classes) by assuming a global rather than local constraint. This finite sectional variation norm constraint allows for functions that are discontinuous, but puts a bound on the total variation of the measures generated by the section of the function that sets some of the coordinates equal to the left-origin of its support (a cube). In spite of the constraint, the class of functions with finite section variation norm is reasonably large, including for example any function whose first-order cross derivatives are uniformly bounded. In spite of its size, this class turns out to be a uniform Donsker class with a well-behaved entropy integral. In turn, this Donkser property affords appealing properties of the estimator, such as n − 1 / 3 ( log ⁡ n ) k 1 / 2 -rate of convergence in loss-based dissimilarity (i.e., L ²-norm), as well as control over certain key empirical process conditions that are useful for proving asymptotic efficiency.

The target parameter is defined as a particular smooth real- or Euclidean-valued function of the functional parameter estimated by HAL-MLE, so that the HAL-MLE results in a plug-in estimator of the target parameter. In this case the sieve is indexed by a bound on the L ₁-norm. By increasing this bound up to a large, finite value, the sieve includes the total parameter space for the true functional parameter. If the goal is to estimate the functional itself, then the constraint on the L ₁-norm is optimally chosen with cross-validation.

In this article we investigate whether and how an appropriately undersmoothed HAL-MLE can be used to produce an efficient plug-in estimator of smooth functions of the functional parameter. There are essentially three key ingredients to establishing efficiency of a plug-in estimator:

negligibility of the empirical mean of the canonical gradient;
control of the second-order remainder; and
asymptotic equicontinuity.

For (i), we argue that since the canonical gradient is a score, we essentially require that HAL-MLE solves a particular score equation. Because HAL-MLE is an MLE, it solves a large class of score equations, and we investigate whether these score equations might also approximate the particular score equation implied by the canonical gradient of the smooth target feature. We find that the larger the L ₁-norm of the HAL-MLE, the more such score equations are solved by the HAL-MLE. We also find that the HAL-MLE solves the score equations of paths that ignore the L ₁-norm constraint at rate O _P(n ^−2/3), thereby better as the desired o _P(n ^−1/2). Nonetheless, one might need to select a larger L ₁-norm than the cross-validation selector to make sure that the basis functions selected by HAL generate enough scores to approximate the desired canonical gradient at the desired precision for the given sample. Either way, by increasing the L ₁-norm of the HAL-MLE, the linear span of equations solved by the HAL-MLE will approximate any canonical gradient score equation at the desired precision.

However, in order to satisfy (ii), we must preserve the n ^−1/4-rate of convergence of achieved by the HAL-MLE, which is naturally achieved when the L ₁-norm is selected with cross-validation. Fortunately, the rate of the HAL-MLE is not affected by the size of the L ₁-norm as long as it remains bounded and, for n large enough, exceeds the sectional variation norm of the true function. Similarly, the asymptotic equicontinuity condition (iii) will also be satisfied for any bounded L ₁-norm, since the class of cadlag functions with a finite sectional variation norm is a Donsker class. In fact, one can prove that this L ₁-norm is allowed to slowly converge to infinity as sample size increases without affecting the asymptotic equicontinuity condition and the n ^−1/4-rate of convergence of the HAL-MLE.

Taken together, our analysis highlights that when selecting the level of undersmoothing of a HAL-MLE, one wants to undersmooth enough to solve the efficient score equation up to an appropriate level of approximation, but in order to reasonable finite-sample performance one should not undersmooth beyond that level. This discussion highlights the need to establish empirical criterion by which the level of undersmoothing may be chosen to appropriately satisfy the conditions required of an efficient plug-in estimator. For that purpose we propose to simply select the L ₁-norm till the empirical mean of the canonical gradient is solved at the desired level.

This article is organized as follows. In the next Section 2 we define the HAL-MLE. In Section 3 we establish our main theorem providing the undersmoothing conditions under which the HAL-MLE is asymptotically efficient for any pathwise differentiable parameter. In Section 4 we apply our theorem to the treatment-specific mean example providing a theorem for this particular nonparametric estimation problem. In Section 5 we apply our theorem to a nonparametric estimation problem with target parameter the integrated square of the data density. In Section 6 we demonstrate a simulation study for both examples, providing a practical verification of our theoretical results. We conclude with a discussion in Section 7. Some of the proofs are presented in the Appendices A and B.

2 Defining the functional estimation problem and HAL-MLE

2.1 Functional estimation problem

Suppose we observe O 1 , … , O n ∼ iid P 0 ∈ M , where O is a Euclidean random variable of dimension k ₁ with support contained in [ 0 , τ o ] ⊂ I R k 1 . Let Q : M → Q ( M ) = { Q ( P ) : P ∈ M } be a functional parameter. It is assumed that there exists a loss function L(Q) so that P 0 L ( Q ( P 0 ) ) = min P ∈ M P 0 L ( Q ( P ) ) , where we use the notation Pf ≡ ∫f(o)dP(o). Thus, Q(P) can be defined as the minimizer of the risk function Q → PL(Q) over all Q in the parameter space. Let d ₀(Q, Q ₀) ≡ P ₀ L(Q) − P ₀ L(Q ₀) be the loss-based dissimilarity. We assume that M 20 ≡ sup P ∈ M P 0 { L ( Q ( P ) ) − L ( Q 0 ) } 2 / d 0 ( Q ( P ) , Q 0 ) < ∞ , and M 1 ≡ sup 0 , P ∈ M ∣ L ( Q ( P ) ) ( o ) ∣ < ∞ , thereby guaranteeing good behavior of the cross-validation selector [20–24].

Parameter space for functional parameter Q : Cadlag and uniform bound on sectional variation norm. We assume that the parameter space Q ( M ) is a collection of multivariate real valued cadlag functions on a cube [0, τ] ⊂ IR^k with finite sectional variation norm ∥ Q ( P ) ∥ v * < C u for some C ^u < ∞. That is, for all P, Q(P) is a k-variate real valued cadlag function on [ 0 , τ ] ⊂ I R ≥ 0 k with ∥ Q ( P ) ∥ v * < C u , where the sectional variation norm is defined by

∥ Q ∥ v * ≡ Q ( 0 ) + ∑ s ⊂ { 1 , … , k } ∫ [ 0 s , τ s ] ∣ d Q s ( u s ) ∣ .

For a given subset s ⊂ {1, …, k}, Q _s: (0_s, τ _s] → IR is defined by Q _s(x _s) = Q(x _s, 0_−s). That is, Q _s is the s-specific section of Q which sets the coordinates in the complement of subset s ⊂ {1, …, k} equal to 0. Since Q _s is right-continuous with left-hand limits and has a finite variation norm over (0_s, τ _s], it generates a finite measure, so that the integrals with respect to Q _s are indeed well defined. For a given vector x ∈ [0, τ], we define x _s = (x(j): j ∈ s). Sometimes, we will also use the notation x(s) for x _s.

Note also that [ 0 , τ ] = { 0 } ∪ ∪ s ( 0 s , τ s ] is partitioned in the singleton {0}, the s-specific left-edges (0_s, τ _s] × {0_−s} of cube [0, τ], and, in particular, the full-dimensional inner set (0, τ] (corresponding with s = {1, …, k}). Therefore, the above sectional variation norm equals the sum over all subsets s of the variation norm of the s-specific section over its s-specific edge. An important result is that any cadlag function Q with finite sectional variation norm can be represented as

Q ( x ) = Q ( 0 ) + ∑ s ⊂ { 1 , … , k } ∫ ( 0 s , x s ] d Q s ( u s ) .

That is, Q(x) is a sum of integrals up to x _s over all the s-specific edges with respect to the measure generated by the corresponding s-specific section Q _s. We will refer to Q _s as a cadlag function as well as a measure. We note that this representation represents Q as an infinitesimal linear combination of indicator basis functions x → ϕ s , u s ( x ) ≡ I ( x s ≥ u s ) indexed by knot-point u _s with coefficient dQ _s(u _s):

Q ( x ) = Q ( 0 ) + ∑ s ⊂ { 1 , … , k } ∫ ϕ s , u s ( x ) d Q s ( u s ) .

Note that these basis functions are tensor products over the coordinates j ∈ s of univariate indicator basis functions I(x(j) ≥ u(j)), which are also known as zero-order splines. Note that the L ₁-norm of the coefficients in this representation is precisely the sectional variation norm ∥ Q ∥ v * .

2.2 Definition of HAL-MLE

Recall Q ( C u ) = Q ∈ D [ 0 , τ ] : ∥ Q ∥ v * < C u be the class of cadlag functions which with sectional variation norm bounded by C ^u. Let C 0 ≡ ∥ Q 0 ∥ v * be the sectional variation norm of Q ₀, and let C ^u be an upper bound guaranteeing that C ₀ < C ^u. For a constant C < C ^u, consider the class Q ( C ) ≡ Q ∈ D [ 0 , τ ] : ∥ Q ∥ v * < C u ⊂ Q ( C u ) . For a data adaptive selector C _n, we define

(1) Q n ≡ arg min Q ∈ Q ( C n ) P n L ( Q )

be the HAL-MLE. We will restrict the minimization to Q for which, for all subsets s, dQ _s(u) is a discrete measure with a finite support {z _s,j: j = 1, …, n _s}. That is, for each s, the dQ _s is absolutely continuous with respect to a discrete counting measure μ _n,s. We will denote this form of absolute continuity with Q ≪ *μ _n. In that case, the HAL-MLE is supported by J ( μ n ) = { z s , j : s ⊂ { 1 , … , d } , j = 1 , … n s } . Thus, the HAL-MLE then becomes

Q n ≡ arg min Q ∈ Q ( C n ) , Q ≪ * μ n P n L ( Q ) .

In this case the HAL MLE can be represented as Q n = ∑ j ∈ J ( μ n ) β n ( j ) ϕ j , where

β n ≡ arg min β , ∥ β ∥ 1 ≤ C n L ∑ j ∈ J ( μ n ) β ( j ) ϕ j ,

and ϕ _j corresponds with one of the indicator basis functions I ( X s ≥ u s , j 1 ) indexed by subset s ⊂ {1, …, d} and knot-point u s , j 1 for some s and j ₁. Note that Q n = Q ̂ ( P n ) is the realization of a mapping from the empirical probability measure to the parameter space.

As noted earlier, the data adaptive selector C _n might be selected larger or equal than cross-validation selector C n , c v = arg min C E B n P n , B n 1 L ( Q ̂ P n , B n 0 ) , where B _n ∈ {0,1}ⁿ represents a random sample split (e.g., V-fold cross-validation) into a training sample {i: B _n(i) = 0} and validation sample {i: B _n(i) = 1}, while P n , B n 0 and P n , B n 1 are the corresponding empirical probability measures. One wants that C _n ≥ C ₀ for n large enough, so that Q 0 ∈ Q ( C n ) .

Typically, one is able to prove that the unrestricted MLE (1) will be discrete on a support in which case our μ _n-discretization does not restrict the definition of the HAL-MLE. Generally, if O includes observing X where L(Q)(0) depends on Q through Q(X), we recommend to select the support of dQ _s as a subset (or whole set) of the observed data X _i(s), i = 1, …, n.

3 Efficiency of the HAL MLE for pathwise differentiable target parameters

3.1 Defining the efficient estimation problem and plug-in HAL-MLE

Let Ψ : M → I R d be the d-dimensional statistical target parameter of interest of the data distribution. We assume that Ψ is pathwise differentiable at any P ∈ M with canonical gradient D*(P). That is, for a class of paths P h = P ϵ h : ϵ ∈ ( − δ , δ ) through P at ϵ = ϵ ₀ ≡ 0 with score h ∈ L 0 2 ( P ) , the pathwise derivative d d ϵ 0 Ψ P ϵ 0 h is a bounded linear operator on the tangent space T P ⊂ L 0 2 ( P ) spanned by all the scores h. As a consequence, the pathwise derivative can be represented as an inner product PD*(P)h for an element D*(P) in the tangent space T _P which is called the canonical gradient. From efficiency theory we know that an estimator ψ _n is asymptotically efficient among the class of all regular estimators if and only if ψ_n − Ψ(P ₀) = P _n D*(P ₀) + o _P(n ^−1/2). For a pair P , P 0 ∈ M , we define the exact second order remainder by

R 2 ( P , P 0 ) ≡ Ψ ( P ) − Ψ ( P 0 ) + P 0 D * ( P ) .

Relevant functional parameter and its loss function: Let Q : M → Q ( M ) = { Q ( P ) : P ∈ M } be a functional parameter such that Ψ(P) = Ψ₁(Q(P)) for some Ψ₁. It is assumed that Q is a functional parameter with parameter space Q ( M ) ⊂ Q ( C u ) = D C u [ 0 , τ ] as defined above in Section 2, so that the model M does not make any smoothness assumptions on Q beyond that it is a cadlag function with sectional variation norm bounded by C ^u. In particular, we have the HAL-MLE has a rate of convergence d ₀(Q _n, Q ₀) = O _P(n ^−1/3(log n)^d/2) [25].

Nuisance parameter for canonical gradient: Let G : M → G be a functional nuisance parameter so that D*(P) only depends on P through (Q(P), G(P)), and the remainder R ₂(P, P ₀) only involves differences between (Q, G) and (Q ₀, G ₀):

D * ( P ) = D * ( Q ( P ) , G ( P ) ) , while R 2 ( P , P 0 ) = R 20 ( ( Q , G ) , ( Q 0 , G 0 ) ) .

Here R ₂₀ could have some remaining dependence on P ₀ and P, and G = G ( M ) is the parameter space for G.

Canonical gradient of target parameter in tangent space of loss function: We also assume that this loss function L(Q) is such that there exists a class of submodels Q ϵ h : ϵ ⊂ Q ( M ) indexed by a choice h ∈ H 1 , through Q at ϵ = 0, so that for any G ∈ G , one of these directions h generates a score that equals the canonical gradient D*(Q, G) at (Q, G):

d d ϵ L Q ϵ h ϵ = 0 = D * ( Q , G ) .

Since the canonical gradient is an element of the tangent space and thereby typically a score of a submodel, this generally holds for Q defined as the density of P and the log-likelihood loss L(Q) = − log Q. However, for any Q there are typically more direct loss functions L(Q), so that the loss-based dissimilarity d ₀(Q, Q ₀) = P ₀ L(Q) − P ₀ L(Q ₀) directly measures a dissimilarity between Q and Q ₀, for which this condition holds as well.

Plug-in HAL-MLE: In this section, we are concerned with analyzing the plug-in estimator Ψ(Q _n) of Ψ(Q ₀), where Q _n is the C _n-tuned HAL-MLE Q ̂ m ( P n ) = Q ̂ C n ( P n ) , which minimizes the empirical risk over Q ( C n ) . We assume that Q is defined such that Q _n is in the interior of the model based parameter space Q (so that there are submodels through Q _n that generate the tangent space and the canonical gradient), even though Q _n is typically on the edge of the parameter subspace Q ( C n ) ⊂ Q = Q ( M ) over which the estimator is minimizing the empirical risk. It is understood that verification of our conditions might require using a C _n different from the cross-validation selector.

Remark: Target parameter could be component of real target parameter. In many situations the real target parameter is a P → Ψ(Q ₁(P), Q ₂(P)) for two (or more) functional parameters Q ₁ and Q ₂. One could apply our efficiency theorem below to the target parameter Ψ Q 10 ( Q 2 ) = Ψ ( Q 10 , Q 2 ) and Ψ Q 20 ( Q 1 ) = Ψ ( Q 1 , Q 20 ) treating the indices Q ₁₀ and Q ₂₀ as known, and HAL-MLEs Q _1n and Q _2n of Q ₁₀ and Q ₂₀, respectively. Application of our theorem to these two cases then proves that Ψ(Q ₁₀, Q _2n) and Ψ(Q _1n, Q ₂₀) are both asymptotically efficient, if both HAL-MLEs are appropriately tuned with respect to sectional variation norm bound. Since

Ψ ( Q 1 n , Q 2 n ) − Ψ ( Q 10 , Q 20 ) = Ψ ( Q 1 n , Q 2 n ) − Ψ ( Q 10 , Q 2 n ) + Ψ ( Q 10 , Q 2 n ) − Ψ ( Q 10 , Q 20 ) ,

this then also establishes asymptotic efficiency of Ψ(Q _1n, Q _2n) as estimator of Ψ(Q ₁₀, Q ₂₀), under the condition that

Ψ ( Q 1 n , Q 2 n ) − Ψ ( Q 10 , Q 2 n ) − ( Ψ ( Q 1 n , Q 20 ) − Ψ ( Q 10 , Q 20 ) ) = o P ( n − 1 / 2 ) .

This latter term can be viewed as a second order difference of (Q _1n, Q _2n) and (Q ₁₀, Q ₂₀) so that the latter condition will generally hold by using the already established rates of convergence O P ( n − 2 / 3 ( log ⁡ n ) k 1 ) with respect to risk based dissimilarity for Q _1n and Q _2n. The above immediately generalizes to the case that the target parameter is a function of more than two Q-components.

3.2 The HAL MLE solves the unconstrained score-approximation of the efficient influence curve equation by including sparse basis functions

Let Q n = arg min Q ∈ Q ( C n ) P n L ( Q ) be the HAL-MLE. Theorem 2 establishes that Ψ(Q _n) is asymptotically efficient for Ψ(Q ₀) for large enough C _n, and some weak conditions specific towards the target parameter. The key property is P _n D*(Q _n, G ₀) = o _P(n ^−1/2). This is addressed in two steps we describe now.

Main idea of result: The HAL-MLE minimizes over a class of functions so that is solves a class of score equations P _n S _h(Q _n) = 0 corresponding with paths Q n , ϵ h : ϵ ⊂ Q ( C n ) through the HAL-MLE Q _n that keep the L ₁-norm constant which happens to be arranged by a simple linear real valued constraint r(h, Q _n) = 0. The directions h will be vectors with one component h(j) for each coefficient β _n(j) in the representation Q _n = ∑_j β _n(j)ϕ _j as a linear combination of spline-basis functions, while the paths through β _n are of form (1 + ϵh(j))β _n(j) with r(h, β _n) = 0, which implies the path through Q _n. The canonical gradient D*(Q _n, G ₀) can be well approximated by the class of all scores {S _h(Q _n): h} that ignore the L ₁-norm constraint. We will refer to this best approximation with the linear span of such scores with D n * ( Q n , G 0 ) = S h * ( Q n , G 0 ) ( Q n ) . Indeed, the first key condition is that P 0 D n * ( Q n , G 0 ) − D * ( Q n , G 0 ) = o P ( n − 1 / 2 ) , or, equivalently, the second order difference P 0 D n * ( Q n , G 0 ) − D * ( Q n , G 0 ) − P 0 D n * ( Q 0 , G 0 ) − D * ( Q 0 , G 0 ) = o P ( n − 1 / 2 ) , which behaves as a product of d 0 1 / 2 ( Q n , Q 0 ) = O P ( n − 1 / 3 ( log ⁡ n ) k 1 / 2 ) times L ²-norm of difference of D * ( Q n , G 0 ) − D * ( Q 0 , G 0 ) / d 0 1 / 2 ( Q n , Q 0 ) (normalized to not converge to zero) minus its projection on the linear span of scores {S _h(Q _n): h}. This itself corresponds with an undersmoothing condition since the more one undersmooths the better the approximation by the linear span of scores will be. This latter condition will be captured by Theorem 2 in next subsection.

It then remains to approximate the latter D n * ( Q n , G 0 ) with the scores {S _h(Q _n): h, r(h, Q _n) = 0} of the paths that enforce the L ₁-norm constraint. This requires approximating h*(Q _n, G ₀) by a choice h with r(h, Q _n) = 0. For that purpose, we select h equal to h*(Q _n, G ₀) for all its components, except one j*. The key is then to select that choice j* so that it minimizes the difference P n S h * ( Q n , G 0 ) ( Q n ) − P n S h ( Q n ) over all choices h that are equal to h*(Q _n, G ₀) except at j* (which equals P n S h * ( Q n , G 0 ) since P _n S _h(Q _n) = 0). We then want this minimizer to be o _P(n ^−1/2) so that it establishes the desired P n S h * ( Q n , G 0 ) = o P ( n − 1 / 2 ) . This will correspond with having a basis function j* with non-zero coefficient β _n(j*) for which its score equation is small enough, and correspondingly this is implied by P n ϕ j * being small enough. As a result, our condition will correspond with undersmoothing enough to including a sparsely supported basis function. This result is addressed by Theorem 1 below.

Both of these conditions needed for P _n D*(Q _n, G ₀) = o _P(n ^−1/2) might easily (asymptotically) hold for without any undersmoothing, but either way both will be guaranteed by enough undersmoothing. In practice we find that undersmoothing is important.

The statement of Theorem 1 relies on the following definitions that also provide the basis of the proof of the theorem as outlined above.

Definitions:

Recall we can represent Q n = arg min Q ∈ Q ( C n ) P n L ( Q ) as follows:
Q n ( x ) = Q n ( 0 ) + ∑ s ⊂ { 1 , … , d } ∫ ( 0 s , x s ] d Q n , s ( u s ) .
For notational convenience, we define the extended measure d Q n ( u ) = ∑ s ⊂ { 1 , … , d } I E s ( u ) d Q n , s ( u ) onto the full cube [0, τ], not just (0, τ], where [0, τ] = ∪_s E _s, E _∅ = {0}, E _s = (0_s, τ _s] × {0_−s} is the s-specific left-edge for subsets s ⊂ {1, …, d}, and dQ _n,s(u) is the measure on E _s defined by the section Q _n,s of Q. Note that E _s is defined by having coordinates in the complement of s being equal to zero. In this manner, we can use the compact representation:
Q n ( x ) = ∫ [ 0 , τ ] ϕ x ( u ) d Q n ( u ) ,
where we note that ϕ _x(u) ≡ I(x ≥ u) reduces to I(x _s ≥ u _s) when u is on the edge E _s of [0, τ].
Consider the family of paths Q n , ϵ h : ϵ ∈ ( − δ , δ ) through Q _n at ϵ = 0 for arbitrarily small δ > 0, indexed by any uniformly bounded h ∈ D[0, τ], defined by
(2) Q n , ϵ h ( x ) = ∫ [ 0 , τ ] ϕ x ( u ) ( 1 + ϵ h ( u ) ) d Q n ( u ) ,
Let
r ( h , Q n ) ≡ ∫ [ 0 , τ ] ϕ x ( u ) h ( u ) ∣ d Q n ( u ) ∣ ,
For any uniformly bounded h with r(h, Q _n) = 0 we have that for a small enough δ > 0 Q n , ϵ h : ϵ ∈ ( − δ , δ ) ⊂ Q ( C n ) .
Let S h ( Q n ) = d d ϵ L Q n , ϵ h ϵ = 0 be the score of this h-specific submodel.
Consider the set of scores
(3) S ( Q n ) = S h ( Q n ) = d d Q n L ( Q n ) ( f ( h , Q n ) ) : ∥ h ∥ ∞ < ∞ ,
where
f ( h , Q n ) ( x ) ≡ d d ϵ Q n , ϵ h ϵ = 0 ( x ) = ∫ [ 0 , τ ] ϕ x ( u ) h ( u ) d Q n ( u ) .
This is the set of scores generated by the above class of paths if we do not enforce constraint r(h, Q _n) = 0.
We have that Q _n solves the score equations P _n S _h(Q _n) = 0 for any uniformly bounded h satisfying r(h, Q _n) = 0.
Let D n * ( Q n , G 0 ) ∈ S ( Q n ) be an approximation of D*(Q _n, G ₀) that is contained in this set of scores S ( Q n ) .
We also consider a special case in which D n * ( Q n , G 0 ) = D * ( Q n , G 0 n ) for an approximation G 0 n ∈ G of G ₀. Let
G n = G ∈ G : D * ( Q n , G ) ∈ S ( Q n )
be the set of G’s for which D*(Q _n, G) equals a score S _h(Q _n) for some uniformly bounded h. One can then define G 0 n ∈ G n as an approximation of G ₀.
Let h*(Q _n, G ₀) be the index so that D n * ( Q n , G 0 ) = S h * ( Q n , G 0 ) ( Q n ) .

Remark: Understanding G n . It might seem that the class of paths Q n , ϵ h : ϵ for any bounded h above is rich enough to generate the full tangent space at Q _n and thereby D*(Q _n, G ₀), even for finite n. However, a special property of this class of paths is that it is contained in the linear span of (order n) the basis functions ϕ _j that have non-zero coefficients β _n(j) in Q _n. On the other hand, if n increases, and thereby the number of basis functions converges to infinity, this class of paths will indeed be able to approximate any function in the tangent space. Since the true G ₀ or the relevant function of G ₀ is generally not contained in this linear span of basis functions that make up Q _n, D * ( Q n , G 0 ) ∉ S ( Q n ) is not contained in its set S ( Q n ) of scores. For example, in the treatment-specific mean example, we would need that 1 / G ̄ 0 ( W ) is approximated by this linear span of spline basis functions that are present in the fit Q _n. Therefore, indeed, there will be G ∈ G whose shape is such that 1/G(W) is in the linear span, which can then be used to define a G _0n so that D * ( Q n , G 0 n ) ∈ S ( Q n ) . Alternatively, one directly approximates 1 / G ̄ 0 ( W ) with a linear span, without being concerned if it results in a representation 1 / G ̄ 0 n , thereby determining an approximation D n * ( Q n , G 0 ) . Since in this example G ̄ 0 can be any function of W with values in (0, 1), in this example, both methods are equivalent: i.e., if 1 / G ̄ 0 is approximated by ∑_j α _j ϕ _j, then we can solve for G ̄ 0 n by setting 1 / G ̄ 0 n = ∑ j α j ϕ j , giving G _0n = 1/∑_j α _j ϕ _j. This explains that indeed this set G n will approximate G as n converges to infinity, so that G _0n will approximate G ₀, typically as fast as Q _n approximates Q ₀ (although that will depend on the undersmoothing of Q _n as well in the case that G ₀ requires basis functions that are not needed for approximating Q ₀). By increasing C _n, the number of selected basis functions in Q _n with non-zero coefficient will increase, thereby making the approximation G _0n better and better.

As is evident from Theorem 2 below, this approximation G _0n should aim to approximate G ₀ in the sense that R ₂₀(Q _n, G _0n, Q ₀, G ₀) = o _P(n ^−1/2) while also arranging P 0 D * ( Q n , G 0 n ) − D * ( Q 0 , G 0 ) 2 → p 0 (the latter being trivial by not requiring any rate).

Convenient notation for finite dimensional spline-representation of Q _n : Due to finite support condition Q ≪*μ _n in the definition of the HAL-MLE, we have

(4) Q n ( x ) = ∑ j ∈ J ( μ n ) β n ( j ) ϕ j ( x ) ,

where ϕ _j(x) = I(x ≥ u _j) for the set of indices of all the knot-points {u _j: j} ⊂ [0, τ], varying over the s-specific edges E _s of [0, τ] and across the different subsets s ⊂ {1, …, d}. Note β _n(j) = dQ _n(u _j), j ∈ J ( μ n ) . Let J ( Q n ) = { j : β n ( j ) ≠ 0 } ⊂ J ( μ n ) be the indices for the basis functions that have non-zero coefficient.

The following theorem establishes an undersmoothing condition (5) on C _n that guarantees P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) . We remind the reader of the definition of a directional derivative d d Q L ( Q ) ( h ) ≡ d d δ 0 L ( Q + δ 0 h ) in the direction h, where δ ₀ = 0.

Theorem 1

Consider an approximation D n * ( Q n , G 0 ) ∈ S ( Q n ) (i.e., scores of submodels not enforcing L ₁-norm constant of HAL-MLE) of D*(Q _n, G ₀) as defined above, and let h n * be so that D n * ( Q n , G 0 ) = S h n * ( Q n ) . Consider the representation (4) of Q _n. Note that β _n minimizes β → P n L ∑ j ∈ J ( μ n ) β n ( j ) ϕ j over all β = ( β ( j ) : j ∈ J ( μ n ) ) with ∑ j ∈ J ( μ n ) ∣ β ( j ) ∣ ≤ C n . This theorem applies to any Q n = ∑ j ∈ J ( μ n ) β n ( j ) ϕ j with β _n a minimizer of the latter empirical risk. Let S j ( Q ) = d d Q L ( Q ) ( ϕ j ) .

Assume ∥ h n * ∥ ∞ = O P ( 1 ) , and

(5) min j ∈ J ( Q n ) ∥ P n d d Q n L ( Q n ) ( ϕ j ) ∥ = o P ( n − 1 / 2 ) .

Then,

P n D n * ( Q n , G 0 ) = o P ( n − 1.2 ) .

Let j * = arg min j ∈ J n ( Q n ) P 0 ϕ j . We can replace (5) by the following: P 0 S j * ( Q n ) 2 → p 0 (which will generally hold whenever P 0 ϕ j * = o P ( 1 ) ); { S j ( Q ) : Q ∈ Q , j ∈ J ( μ n ) } is contained in a Donsker class (e.g., the class of cadlag functions with uniformly bounded sectional variation norm);

(6) ∥ P 0 d d Q n L ( Q n ) ( ϕ j * ) − d d Q 0 L ( Q 0 ) ( ϕ j * ) ∥ = o P ( n − 1 / 2 ) ,

and P 0 d d Q n L ( Q n ) ( ϕ j * ) 2 → p 0 .

Regarding (6), if we have

∥ P 0 d d Q n L ( Q n ) ( ϕ j * ) − d d Q 0 L ( Q 0 ) ( ϕ j * ) = O P P 0 1 / 2 ϕ j * d 0 1 / 2 ( Q n , Q 0 ) ;

and d 0 ( Q n , Q 0 ) = O P ( n − 2 / 3 ( log ⁡ n ) k 1 ) (as we showed for HAL-MLE), then (5) is implied by

(7) min j ∈ J ( Q n ) P 0 ϕ j = o P ( n − 1 / 3 ( log ⁡ n ) − k 1 ) .

Condition (5) is directly verifiable on the data and can thus be used to select the sectional variation norm bound C _n for the HAL-MLE. For example, one could select a constant a and set C to the smallest value (larger than the cross-validation selector) for which the left-hand side is smaller than a / ( n log ⁡ n ) for some constant a. The sufficient assumption (6) provides understanding of what it requires in terms of Q _n and P ₀. We note that P 0 d d Q n L ( Q n ) ( ϕ s * , j * ) 2 → p 0 is a relatively weak condition and is generally implied by the support of ϕ s * , j * converging to zero, and is thereby a non-condition, given our undersmoothing condition (6). In the folllowing lemma we consider a common structure on the loss function and demonstrate that if we know that Q _n − Q ₀ converges to zero in supremum norm at rate close to n − 1 / 3 ( log ⁡ n ) k 1 / 2 , then condition (7) be significantly weakened.

Lemma 1

Consider the special case that O = (Z, X), L(Q)(O) depends on Q through Q(X) only, and d d Q L ( Q ) ( ϕ ) = d d Q L ( Q ) × ϕ , i.e., the directional derivative d d ϵ L ( Q + ϵ ϕ ) ϵ = 0 of L() at Q in the direction ϕ is just multiplication of a function d d Q L ( Q ) of O with ϕ(X). Assume lim sup n ∥ d d Q n L ( Q n ) ∥ ∞ < ∞ . Let j * = arg min j ∈ J ( Q n ) P 0 ϕ j . Assume P 0 ϕ j * = o P ( 1 ) . Then, a sufficient condition for P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) is given by (6).

Assume

∥ d d Q n L ( Q n ) − d d Q 0 L ( Q 0 ) ∥ ∞ = O ( ∥ Q n − Q 0 ∥ ∞ ) .

Then, P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) if

(8) ∥ Q n − Q 0 ∥ ∞ min j ∈ J ( Q n ) P 0 ϕ j = o P ( n − 1 / 2 ) .

The condition (8) can be replaced by

∥ Q n − Q 0 ∥ ∞ min j ∈ J ( Q n ) P n ϕ j = o P ( n − 1 / 2 ) .

Here P ₀ ϕ _j and P _n ϕ _j can be bounded by P ₀(X ≥ u _j) and P _n(X ≥ u _j), respectively.

Alternatively, we apply Theorem 1 above so that (6) holds if min j ∈ J ( Q n ) P n ϕ j = o P ( n − 1 / 3 ( log ⁡ n ) − k 1 ) .

In [26]; we proved that ∥ Q _n − Q ₀ ∥_∞→_p0 under an absolute continuity condition. However, we expect that the rate of convergence with respect to supremum norm to be close to its rate n − 1 / 3 ( log ⁡ n ) k 1 / 2 with respect to d 0 1 / 2 ( Q n , Q 0 ) , in which case this would only require that min j ∈ J ( Q n ) P n ϕ j = o P ( n − 1 / 6 ) .

3.3 Condition for solving the unconstrained score approximation of the efficient influence curve in terms of number of non-zero coefficients in HAL-MLE fit

Let Q _n be the HAL-MLE. It solves scores along paths Q n , ϵ h ( x ) = ∫ ϕ u ( x ) ( 1 + ϵ h ) ( u ) d Q n ( u ) with r(h, Q _n) = ∫h∣dQ _n(u)∣ = 0. This correspond with paths Q _n(x) + ϵ∫ϕ _u(x)h(u)dQ _n(u). Let d Z Q n ( u ) = h ( u ) d Q n ( u ) . Note that the constraint

r ( h , Q n ) = ∫ h ∣ d Q n ∣ / d Q n d Q n = ∫ ∣ d Q n ∣ / d Q n d Z Q n ( u ) .

Let ℓ Q n = ∣ d Q n ∣ / d Q n , which is a vector with elements in {−1, 1} representing the sign of dQ _n(u). So in terms of Z Q n the constraint r(h, Q _n) = 0 corresponds with ∫ ℓ Q n d Z Q n = 0 . This shows that we can view the paths as Q _n + ϵZ for any Z ≪*Q _n and ∫ ℓ Q n d Z = 0 . Since dQ _n is discrete we can use notation β _n(u) = dQ _n(u). Then the paths correspond with β n , ϵ z = β n + ϵ z with z any vector with z/β _n < ∞ so that ⟨ z , ℓ β n ⟩ = ∑ j ∈ S n z ( j ) ℓ β n ( j ) = 0 . The HAL-MLE Q _n satisfies for any z ⊥ ℓ β n with z/β _n < ∞:

(9) 0 = P n d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) z ( j ) ϕ j .

The following lemma establishes a bound for the latter score equation for z without the orthogonality constraint z ⊥ ℓ β n , where this bound is in terms of the number J _n of knot-points in the fit Q _n with non-zero coefficient. Specifically, the bound is given by J n − 1 d 0 ( Q n , Q 0 ) 1 / 2 . One then wonders what the approximate rate is for J _n, and specifically when using the cross-validation selector. For this purpose, we note that Q _n is also an MLE for the parametric model ∑ j ∈ J ( Q n ) β ( j ) ϕ j and one can show that the rate of convergence of Q _n to the best approximation Q _0,n in this J _n-dimensional parametric model is O P ( ( J n / n ) 1 / 2 ) (to be addressed in detail in future research). Given that we know that the rate of convergence of Q _n to Q ₀, using the cross-validation selector C _n,cv, is given by n − 1 / 3 ( log ⁡ n ) k 1 / 2 , this suggest that J n ∼ n 1 / 3 ( log ⁡ n ) k 1 / 2 , which then implies the rate O _P(n ^−2/3) for the score Eq. (9) without the constraint z ⊥ ℓ β n . Therefore, this result appears to formally establish that even without undersmoothing the score equation P n D n * ( Q n , G 0 ) = O P ( n − 2 / 3 ) is already solved at the desired error (asymptotically). However, undersmoothing might still be needed, even asymptotically, for achieving the desired approximation of D*(Q _n, G ₀) by an element D n * ( Q n , G 0 ) in the linear span of the scores S _h(Q _n) for uniformly bounded h (i.e., the selected basis functions in Q _n, even though sufficient to fit Q ₀ at a good rate, might not generate enough scores to approximate the possibly more complex D*(Q _n, G ₀), due to complexity of G ₀).

Lemma 2

Let J _n be the number of elements in support J ( Q n ) of Q _n. Assume

P 0 d d Q n L ( Q n ) − d d Q 0 L ( Q 0 ) ∑ j ∈ J ( Q n ) z ( j ) ϕ j = O P ( ∥ z ∥ 1 d 0 1 / 2 ( Q n , Q 0 ) ) ;

and that, uniformly in z with ∥z∥₁ < M for some M < ∞, the random function d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) z ( j ) ϕ j falls in a fixed P ₀-Donsker class (e.g., cadlag functions with universal bound on sectional variation norm).

We have that for any z/β _n < ∞,

P n d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) z ( j ) ϕ j = O P ∥ z ∥ 1 J n − 1 n − 1 / 2 + ∥ z ∥ 1 J n − 1 d 0 1 / 2 ( Q n , Q 0 ) .

Clearly, the first term is o _P(n ^−1/2) as long as J _n → ∞. For example, if J n = n 1 / 3 ( log ⁡ n ) k 1 / 2 and d 0 1 / 2 ( Q n , Q 0 ) = O P ( n − 1 / 3 ( log ⁡ n ) k 1 / 2 ) , then this becomes

P n d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) z ( j ) ϕ j = O P ∥ z ∥ 1 n − 2 / 3 .

Moreover, then

sup ∥ z ∥ 1 < M , z / β n < ∞ P n d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) z ( j ) ϕ j = O P ( M n − 2 / 3 ) .

Finally, if for an M < ∞, D n * ( Q n , G 0 ) = d d Q n L ( Q n ) ( ∑ j ∈ J ( Q n ) z ( j ) ϕ j ) for a z = z(Q _n, G ₀), and ∥z(Q _n, G ₀)∥₁ < M with probability tending to 1, then this implies P n D n * ( Q n , G 0 ) = O P ( J n − 1 n − 1 / 3 ( log ⁡ n ) k 1 / 2 ) (and thus O _P(n ^−2/3) if J n = n 1 / 3 ( log ⁡ n ) k 1 / 2 ).

Proof

Let z with ∥z∥₁ < M and z/β _n < ∞ be given. We define z ̃ = z − Π ( z ∣ ℓ β n ) . Note that ⟨ ℓ β n , ℓ β n ⟩ = J n 2 . So

z ̃ = z − ∑ j ∈ J ( Q n ) z ( j ) ℓ β n ( j ) J n 2 ℓ β n .

In short notation we write z ̃ = z − π n ( z ) with π n ( z ) = Π ( z ∣ ℓ β n ) . Above shows that

π n ( z ) = 1 J n ∑ j ∈ J ( Q n ) z ( j ) ℓ β n ( j ) ℓ β n / J n ≡ J n − 1 π n * ( z ) ,

where

∥ π n * ( z ) ∥ 1 = ∑ j ∈ J ( Q n ) z ( j ) ℓ β n ( j ) ≤ ∥ z ∥ 1 ≤ M .

We have

0 = P n d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) z ̃ ( j ) ϕ j = P n d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) z ( j ) ϕ j − P n d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) π n ( z ) ( j ) ϕ j

Therefore, using that π n ( z ) = J n − 1 π n * ( z ) with ∥ π n * ( z ) ∥ 1 < M , the Donker class and bounding condition of the Lemma, it follows that

− P n d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) z ( j ) ϕ j = P n d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) π n ( z ) ( j ) ϕ j = J n − 1 ( P n − P 0 ) d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) π n * ( z ) ( j ) ϕ j + J n − 1 P 0 d d Q n L ( Q n ) ∑ j ∈ J ( Q n ) π n * ( z ) ( j ) ϕ j = O P J n − 1 n − 1 / 2 + J n − 1 P 0 d d Q n L ( Q n ) − d d Q 0 L ( Q 0 ) × ∑ j ∈ J ( Q n ) π n * ( z ) ( j ) ϕ j = O P J n − 1 n − 1 / 2 + J n − 1 M d 0 1 / 2 ( Q n , Q 0 ) .

This completes the proof. □

3.4 Efficiency of the plug-in HAL MLE

The typical general efficiency proof used to analyze the TMLE (e.g., [18]) can be easily generalized to the condition that P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) for some approximation D n * ( Q , G ) of the actual canonical gradient D*(Q, G ₀). This results in the following theorem.

Theorem 2

Assume M ₁, M ₂₀ < ∞. We have d 0 ( Q n , Q 0 ) = O P ( n − 2 / 3 ( log ⁡ n ) k 1 ) . Assume condition (5) or conditions of Lemma 2 so that P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) .

If D n * ( Q n , G 0 ) = D * ( Q n , G 0 n ) , then we assume

R ₂((Q _n, G _0n), (Q ₀, G ₀)) = o _P(n ^−1/2) and P 0 D * ( Q n , G 0 n ) − D * ( Q 0 , G 0 ) 2 → p 0 .
{ D * ( Q , G ) : Q ∈ Q , G ∈ G } is contained in the class of k ₁-variate cadlag functions on a cube [ 0 , τ o ] ⊂ I R k 1 in a Euclidean space and that sup Q ∈ Q , G ∈ G ∥ D * ( Q , G ) ∥ v * < ∞ .

Otherwise, we assume

R ₂((Q _n, G ₀), (Q ₀, G ₀)) = o _P(n ^−1/2), P 0 D n * ( Q n , G 0 ) − D * ( Q n , G 0 ) = o P ( n − 1 / 2 ) , and P 0 D n * ( Q n , G 0 ) − D * ( Q 0 , G 0 ) 2 → p 0 .
D n * ( Q , G 0 ) , D * ( Q , G 0 ) : Q ∈ Q is contained in the class of k ₁-variate cadlag functions on a cube [ 0 , τ o ] ⊂ I R k 1 in a Euclidean space and that sup Q ∈ Q max ( ∥ D * ( Q , G 0 ) ∥ v * , ∥ D n * ( Q , G 0 ) ∥ v * ) < ∞ .

Then, Ψ(Q _n) is asymptotically efficient.

The proof is straightforward, analogue to typical efficiency proof for TMLE, and is presented in the Appendix. Regarding the condition, P 0 D n * ( Q n , G 0 ) − D * ( Q n , G 0 ) = o P ( n − 1 / 2 ) , we note the following. For a typical choice D n * ( Q n , G 0 ) in the set of scores S ( Q n ) , we have P 0 D n * ( Q 0 , G 0 ) = 0 , so that

− P 0 D n * ( Q n , G 0 ) − D * ( Q n , G 0 ) = P 0 D * ( Q n , G 0 ) − D * ( Q 0 , G 0 ) − P 0 D n * ( Q n , G 0 ) − D n * ( Q 0 , G 0 ) .

However, this equals the P ₀-mean of the function D * ( Q n , G 0 ) − D * ( Q 0 , G 0 ) = O P ( n − 1 / 3 ( log ⁡ n ) k 1 / 2 ) minus its projection onto the linear span of scores {S _h(Q _n): h}. This will generally behave as a second order remainder involving a product of differences Q _n − Q ₀ and D n * − D * . The latter is addressed in detail in our integrated square density example.

Remark about general impact of undersmoothing on behavior of plug-in HAL-MLE. Though undersmoothing is beneficial for controlling P _n D*(Q _n, G ₀) = o _P(n ^−1/2), it might harm the degree to which the Donsker class condition holds. This puts a clear restriction on the amount of undersmoothing allowed for asymptotic efficiency. The impact of undersmoothing on the second order remainder R ₂((Q _n, G _0n), (Q ₀, G ₀)) is beneficial with respect to the approximation error of G ₀ − G _0,n but might make Q _n a poor approximation of Q ₀. However, in many estimation problems, it appears that undersmoothing reduces the second order remainder as well in the sense that the second order remainder R ₂(Q _n, G _n, Q ₀, G ₀) can itself be represented as a score P ₀ S _h(Q _n) for some h, so that undersmoothing reduces the size of the second order remainder. Therefore, undersmoothing may be generally beneficial for the behavior of the estimator as long as its variation norm stays bounded by a universal constant (or a slowly converging constant) as sample size increases.

3.5 Inference for the plug-in undersmoothed HAL-MLE

The undersmoothed HAL-MLE Ψ(Q _n) is asymptotically linear with influence curve D*(Q ₀, G ₀) so that it is approximately N ( Ψ ( Q 0 ) , σ 0 2 = P 0 D * ( Q 0 , G 0 ) 2 ) . There are various possible methods for estimating this normal limit distribution with corresponding confidence intervals. Let σ n 2 be an estimator of this asymptotic variance. Then, an asymptotic 0.95-confidence interval is given by Ψ(Q _n) ± 1.96σ _n/n ^1/2. Let G ̂ : M n p → G be an estimator of G ₀. Then we can estimate σ 0 2 with σ n 2 = P n D * ( Q n , G n ) 2 , or a cross-validated estimator σ n , c v 2 = E B n P n , B n 1 D * ( Q ̂ P n , B n 0 , G ̂ P n , B n 0 ) 2 , based on cross-validation scheme B _n ∈ {0,1}ⁿ. The cross-validated estimator is generally more accurate by not suffering from overfitting, just as a cross-validated MSE is a better estimator than the empirical plug-in MSE. In both of these plug-in estimators σ n 2 and σ n , c v 2 there is no argument for preferring an undersmoothed Q _n over the HAL-MLE based on the cross-validation selector of the L ₁-norm. Therefore, we recommend the latter.

This approach for obtaining inference would require the construction of an estimator of G ₀, even though the HAL-MLE Q _n does not require this. To avoid such reliance on an estimator of G ₀ and to improve finite sample coverage, we can use the non-parametric bootstrap in which the sampling distribution of n ^1/2(Ψ(Q _n) − Ψ(Q ₀)) is estimated with the distribution of n 1 / 2 ( Ψ Q n # − Ψ ( Q n ) ) , conditional on P _n, where Q n # is the undersmoothed HAL-MLE based on an i.i.d. sample from P _n. This method was proposed and analyzed in [27]; showing that the nonparametric bootstrap is a valid method for estimating the limit distribution of the plug-in HAL-MLE. In this bootstrap one can fix the L ₁-norm of the HAL-MLE at the L ₁-norm selected by the undersmoothed HAL-MLE, thereby making it computationally feasible. In [27] we also proposed a more conservative version of this bootstrap method by carrying out the bootstrap distribution for each L ₁-norm and selecting the L ₁-norm at which the width of the bootstrap confidence intervals reaches a plateau. In this manner, we guarantee that we sample from a maximal complex bootstrap distribution which was shown to yield robust finite sample coverage.

4 Example: HAL-MLE of treatment-specific mean

4.1 Formulation and relevant quantities for statistical estimation problem

Data and statistical model: Let O = (W, A, Y) ∼ P ₀, where Y ∈ {0, 1} and A ∈ {0, 1} are binary random variables. Let (A, W) have support [0, τ] ∈ IR^k, where A ∈ [0, 1] with only support on the edges {0, 1}. Similarly, certain components of W might be discrete so that it only has a finite set of support points in its interval. Note O ∈ [0, τ _o] = [0, τ] × [0, 1], where [0, τ _o] is a cube in Euclidean space of same dimension as (W, A, Y). Let G ̄ ( W ) = E P ( A ∣ W ) and Q ̄ ( W ) = E P ( Y ∣ A = 1 , W ) . Assume the positivity assumption G ̄ 0 ( W ) > δ > 0 for some δ > 0; Q ̄ 0 , G ̄ 0 are cadlag functions with ∥ Q ̄ 0 ∥ v * ≤ C u and ∥ G ̄ 0 ∥ v * ≤ C 2 u for some finite constants C u , C 2 u ; δ < Q 0 ̄ < 1 − δ for some δ > 0. This defines the statistical model M for P ₀.

Target parameter, canonical gradient and exact second order remainder: Let Ψ : M → I R denote the treatment-specific mean, defined by Ψ(P) = E _P E _P(Y∣W, A = 1). If an alternative quantity, such as the average treatment effect, E _P{E _P(Y∣W, A = 1) − E _P(Y∣W, A = 0)}, is of interest, the following strategy could be employed separately in each treatment group. Let Q ̃ = ( Q W , Q ̄ ) , where Q _W is the probability distribution of W. Note that Ψ ( P ) = Ψ ( Q ̃ ) = Q W Q ̄ ( ⋅ , 1 ) . We have that Ψ is pathwise differentiable at P with canonical gradient given by D * ( Q ̃ , G ) = A / G ̄ ( W ) ( Y − Q ̄ ( W , A ) ) + Q ̄ ( 1 , W ) − Ψ ( Q ̃ ) . Let L ( Q ̄ ) ( O ) = − Y ⁡ log Q ̄ ( W , A ) + ( 1 − Y ) log ( 1 − Q ̄ ( W , A ) ) be the log-likelihood loss for Q ̄ , and note that by the above bounding assumptions on Q ̄ , we have that this loss function has finite universal bounds M ₁ < ∞ and M ₂₀ < ∞. Let D 1 * ( Q ̄ , G ̄ ) = A / G ̄ ( Y − Q ̄ ) be the Q ̄ -component of the canonical gradient, D 2 * ( Q ̃ ) = Q ̄ ( 1 , W ) − Ψ ( Q ) the Q _W-component, and note that D * ( Q ̃ , G ) = D 1 * ( Q ̄ , G ) + D 2 * ( Q ̃ ) . We have Ψ ( Q ̃ ) − Ψ ( Q ̃ 0 ) = − P 0 D * ( Q ̃ , G ) + R 20 ( Q ̄ , G ̄ , Q ̄ 0 , G ̄ 0 ) , where

R 20 ( Q ̄ , G ̄ , Q ̄ 0 , Q ̄ 0 ) = P 0 G ̄ − G ̄ 0 G ̄ ( Q ̄ − Q ̄ 0 ) .

Bounds on sectional variation norm and exact second order remainder: We have sup P ∈ M ∥ D * ( Q ̃ ( P ) , G ( P ) ) ∥ v * < C C u , C 2 u for some finite constant C implied by the universal bounds C u , C 2 u on the sectional variation norm of Q ̄ , G ̄ . We also note that, using Cauchy–Schwarz inequality, R 20 ( Q ̄ , G ̄ , Q ̄ 0 , G ̄ 0 ) ≤ 1 δ ∥ Q ̄ − Q ̄ 0 ∥ P 0 ∥ G ̄ − G ̄ 0 ∥ P 0 , where ∥ f ∥ P 0 2 = ∫ f 2 ( o ) d P 0 ( o ) .

4.2 HAL-MLE

Let Q = Logit Q ̄ and let’s L ( Q ) ( O ) = − A { Y ⁡ log Q ̄ ( W ) + ( 1 − Y ) log ( 1 − Q ̄ ( W ) ) } be the log-likelihood loss restricted to the observations with A = 1. Let Q C , n = arg min Q , ∥ Q ∥ v * < C P n L ( Q ) be the C-specific 0-order Spline-HAL-MLE for a given bound C on the sectional variation norm. Let C _n ≤ C ^u be a data adaptive selector that is larger or equal than the cross-validation selector, so that P C n , c v ≤ C n ≤ C u = 1 . Let Q n = Q C n , n , and Q _W,n be the empirical probability measure of W ₁, …, W _n. We can represent Q n = ∑ j ∈ J ( μ n ) β n ( j ) ϕ j , where ϕ _j = I(W ≥ w _j) for a knot point w _j over all observations {(W _s,i, 0_−s): i = 1, …, n} across all subsets s ⊂ {1, …, k ₁}. By our rate of convergence results on the HAL-MLE we have that ∥ Q n − Q 0 ∥ P 0 = O P ( n − 1 / 3 ( log ⁡ n ) k 1 / 2 ) . The HAL-MLE of Ψ ( Q ̃ 0 ) is the plug-in estimator Ψ ( Q ̃ n ) = Q W , n Q ̄ n . Note that P n D 2 * ( Q ̃ n ) = 0 for any Q _n. Thus, we are only concerned with showing that P n D 1 * ( Q n , G 0 ) = o P ( n − 1 / 2 ) .

Class of paths absolute continuous with respect to Q _n : Consider the following class of paths

Q n , ϵ h ( x ) = ∫ [ 0 , τ ] ϕ x ( u ) ( 1 + ϵ h ( u ) ) d Q n ( u ) ,

where the right-hand side can also be written as Q _n(x) + ϵf(h, Q _n)(x), where

f ( h , Q n ) ( x ) = h ( 0 ) Q n ( 0 ) + ∑ s ⊂ { 1 , … , m } ∫ 0 s , x s h ( s , u s ) Q n , s ( d u s ) .

This defines a path Q n , ϵ h : ϵ ∈ ( − δ , δ ) for each uniformly bounded function h, as in our general representation.

Set of scores generated by class of paths: The scores generated by this family of paths are given by:

S h ( Q n ) ≡ d d ϵ L Q n , ϵ h ϵ = 0 = A f ( h , Q n ) ( Y − Q n ( W ) ) .

This defines a set of scores S ( Q n ) = { S h ( Q ̄ n ) : ∥ h ∥ ∞ < ∞ } . Note that in order to solve for an h so that S h ( Q ̄ n ) = D 1 * ( Q ̄ n , G ̄ 0 ) would require f ( h , Q ̄ n ) ( W ) = 1 / G ̄ 0 ( W ) . However, since G ̄ 0 is not sectional absolute continuous with respect to Q _n (i.e., Q _n,s is discrete for all subsets s, while G ̄ 0 , s is (say) continuous), there does not exist a h for which f ( h , Q n ) = 1 / G ̄ 0 . Thus, D * ( Q n , G ̄ 0 ) ∉ { S h ( Q n ) : ∥ h ∥ ∞ < ∞ } .

Score equations solved by HAL-MLE:

Let

r ( h , Q n ) = ∫ [ 0 , τ ] h ( u ) ∣ d Q n ( u ) ∣ ,

which can also be written as

r ( h , Q n ) ≡ h ( 0 ) ∣ Q n ( 0 ) ∣ + ∑ s ⊂ { 1 , … , m } ∫ ( 0 s , x s ] h ( s , u s ) ∣ d Q n , s ( u s ) ∣ .

The HAL-MLE solves

P n S h ( Q n ) = 0 for all h with r ( h , Q n ) = 0 .

4.3 Defining approximation G _0n

We define

G n ≡ G ̄ ∈ G : G ̄ ≪ * Q ̄ n .

We note that if G ̄ s ≪ Q ̄ n , s , then we also have 1 / G ̄ s ≪ Q ̄ n , s as well. Here we use that if g(x) = 1/f(x), then g s ( d x s ) = − 1 / f s 2 ( x s ) f s ( d x s ) . Therefore, if G ̄ ≪ * Q ̄ n , then we can find a h so that f ( h , Q n ) ( A , W ) = A / G ̄ ( W ) , and thereby that D 1 * ( Q n , G ̄ ) = S h ( Q n ) .

Let

G 0 n = arg min G ̄ ∈ G n ∥ G ̄ − G ̄ 0 ∥ P 0 ,

where ∥ G ̄ − G ̄ 0 ∥ P 0 is the L ²(P ₀)-norm of G ̄ − G ̄ 0 . Then, D 1 * ( Q n , G ̄ 0 n ) ∈ { S h ( Q n ) : h } so that we can find a h * ( Q n , G ̄ 0 ) so that

D 1 * ( Q n , G ̄ 0 n ) = S h * ( Q n , G ̄ 0 ) ( Q n ) .

4.4 Application of Theorem 2

We need to assume R ₂((Q _n, G _0n), (Q ₀, G ₀)) = o _P(n ^−1/2) and P 0 D * ( Q n , G 0 n ) − D * ( Q 0 , G 0 ) 2 → p 0 . The latter already holds if ∥ G ̄ 0 n − G ̄ 0 ∥ P 0 → p 0 . However, the first condition relies on a rate of convergence. Given the rate of convergence for the HAL-MLE Q _n, it thus suffices that ∥ G ̄ 0 n − G ̄ 0 ∥ P 0 = o P ( n − 1 / 6 ( log ⁡ n ) k 1 / 2 ) . This appears to be a reasonable condition, since G ̄ 0 n is the L ²(P ₀)-projection of G ̄ 0 onto G n , so that the only concern would be that the set G n does not approximate fast enough G as n converges to infinity. However, if the set of basis functions is rich enough for Q ̄ n to converge at a rate than n − 1 / 3 ( log ⁡ n ) k 1 / 2 to Q ̄ 0 (not allowing to choose the coefficients based on P ₀), then the resulting linear combination of indicator basis functions should generally also be rich enough for approximating the true G ₀ with a rate n − 1 / 6 ( log ⁡ n ) k 1 / 2 (now allowing to select the coefficients of the basis functions in terms of G ₀).

Verification of Assumption 5 of Theorem 2 : Assumption (5) is stating that

min j ∈ J ( Q n ) P n d d Q n L ( Q n ) ( ϕ j ) = 2 1 n ∑ i ϕ j ( 1 , W i ) I ( A i = 1 ) I ( W i > w j ) ( Y i − Q ̄ n ( 1 , W i ) ) = o P ( n − 1 / 2 ) .

We apply the last part of Theorem 1. Since d d Q L ( Q ) ( ϕ ) = ϕ ( A , W ) ( Y − Q ̄ ( A , W ) ) , it follows that have

(10) ∥ d d Q n L ( Q n ) − d d Q 0 L ( Q 0 ) ∥ P 0 = O ( ∥ Q n − Q 0 ∥ P 0 ) .

Given that we have d 0 ( Q n , Q 0 ) = O P ( n − 2 / 3 ( log ⁡ n ) k 1 ) , it follows that the remaining condition is (7), or, equivalently,

min j ∈ J ( Q n ) P n ϕ j = o P n − 1 / 3 ( log ⁡ n ) − k 1 .

This reduces to the assumption that O min j ∈ J ( Q n ) P n ( W ≥ w j ) = o P ( n − 1 / 3 ( log ⁡ n ) − k 1 ) . We arrange this assumption to hold by selecting C _n accordingly.

Verification of assumptions of Lemma 2 : Given our assumptions, it is straightforward to verify the conditions of Lemma 2. This lemma provides the bound J n − 1 d 0 1 / 2 ( Q n , Q 0 ) for P n D n * ( Q n , G 0 ) . This provides then the alternative condition for choosing C _n (and thereby J _n) to establish P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) .

This proves the following efficiency theorem for the HAL-MLE in this particular estimation problem.

Theorem 3

Consider the formulation above of the statistical estimation problem. Let

G n = G ̄ ∈ G : G ̄ ≪ * Q ̄ n ,

and

G ̄ 0 n = arg min G ̄ ∈ G n ∥ G − G 0 ∥ P 0 .

Assumptions:

∥ G ̄ 0 n − G ̄ 0 ∥ P 0 = o P ( n − 1 / 6 ( log ⁡ n ) − k 1 / 2 ) , where we can use that ∥ Q n − Q 0 ∥ P 0 = o P ( n − 1 / 3 ( log ⁡ n ) k 1 / 2 ) .
Given the fit Q n = ∑ j ∈ J ( Q n ) β n ( j ) ϕ j with knot-points the observations {W _j(s): j = 1, …, n, s} and indicator basis functions ϕ _j(W) = I(W > W _j), we assume that C _n < C ^u for some finite C ^u is chosen so that
min j ∈ J ( Q n ) P n ϕ j = o P ( n − 1 / 3 ( log ⁡ n ) − k 1 ) .
Alternatively, select the number of knot-points J _n (i.e., C _n) so that J n − 1 d 0 1 / 2 ( Q n , Q 0 ) = o P ( n − 1 / 2 ) (e.g. J n = n 1 / 3 ( log ⁡ n ) k 1 / 2 ).

Then, Ψ(Q _n) is an asymptotically efficient estimator of Ψ(Q ₀).

5 Example: HAL-MLE for the integrated square of the data density

Let O ∼ P ₀ be a k ₁-variate random variable with Lebesgue density p ₀ that is assumed to be bounded from below by a δ > 0 and from above by an M < ∞. Let { P Q : Q ∈ Q } be a parametrization of the probability measure of O in terms of a functional parameter Q that varies over a class of multivariate real valued cadlag functions on [0, τ] with finite sectional variation norm. Below we will focus on the particular parameterization given by p _Q = c(Q){δ + (M − δ)expit(Q)}, where expit(x) = 1/(1 + exp(−x)), and c(Q) is the normalizing constant defined by ∫p _Qdo = 1. Note that in this parameterization Q can be any cadlag function with finite sectional variation norm, thereby allowing that the densities p _Q are discontinuous (but cadlag). Another possible parametrization is obtained through the following steps: (1) modeling the density p(x) as a product ∏ j = 1 k p j ( x j ∣ x ̄ ( j − 1 ) ) of univariate conditional densities of x _j, given x ̄ ( j − 1 ) ; (2) modeling each univariate conditional density p _j in terms of its univariate conditional hazard λ _j; (3) modeling this hazard as λ ( x j ∣ x ̄ ( j − 1 ) ) = exp ( Q j ( x j , x ̄ ( j − 1 ) ) ) (or discretizing it and modeling it with a logistic function in Q _j), and (4) setting Q = ( Q 1 , … , Q k 1 ) . With this latter parametrization each Q _j varies over a parameter space of cadlag functions with finite sectional variation norm.

Let the statistical model M = P Q : Q ∈ Q ( C u ) for P ₀ be nonparametric beyond that each probability distribution is dominated by the Lebesgue measure, Q varies over cadlag functions with sectional variation norm bounded by C ^u. The statistical target parameter Ψ : M → I R is defined by Ψ(P) = ∫p ²(o)do. The canonical gradient of Ψ at P is given by D*(P)(O) = 2(p(O) − Ψ(P)), and, the exact second order remainder R ₂(P, P ₀) = Ψ(P) − Ψ(P ₀) + P ₀ D*(P) is given by R 2 ( P , P 0 ) = − ∫ ( p − p 0 ) 2 ( o ) d o .

Let L(Q) = −log p _Q be the log-likelihood loss function for Q. Let Q _n be an HAL-MLE bounding the sectional variation norm by a C _n < C ^u. We wish to establish conditions on C _n so that Ψ ( Q n ) = ∫ p Q n 2 d o is an asymptotically efficient estimator of Ψ ( Q 0 ) = ∫ p Q 0 2 d o . We assume this HAL-MLE is discrete so that we can use the finite dimensional representation Q n = ∑ j ∈ J ( Q n ) β n ( j ) ϕ j with ∥ β n ∥ L 1 ≤ C n , as in our general presentation. Let Q n , ϵ h ( x ) = ∫ [ 0 , τ ] ϕ x ( u ) ( 1 + ϵ h ( u ) ) d Q n ( u ) , indexed by any bounded function h, be the paths as defined in our general presentation (and previous section). Let S h ( Q n ) = d d ϵ L Q n , ϵ h ϵ = 0 be score of this path under the log-likelihood loss. These scores are given by

S h ( Q n ) = d d Q n L ( Q n ) ( f ( h , Q n ) ) ,

where f(h, Q _n) = ∫_[0,τ] ϕ _x(u)h(u)dQ _n(u), which also equals Q n ( 0 ) h ( 0 ) + ∑ s ∫ ( 0 s , x s ] h ( s , u s ) d Q n , s ( u s ) . Let S ( Q n ) = { S h ( Q n ) : h } be the collection of scores. In order to apply Theorem 2 we need to determine an approximation D n ( Q n ) ∈ S ( Q n ) of the canonical gradient D * ( Q n ) = 2 ( p Q n − Ψ ( Q n ) ) . We have

S h ( Q ) = A ( f ( h , Q ) ) / C ( Q ) − M exp ( Q ) ( 1 + exp ( Q ) ) ( δ + δ exp ( Q ) + M ) f ( h , Q ) ,

where

A ( f ) = ∫ exp ( Q ) / ( 1 + exp ( Q ) ) 2 f d o ∫ ( δ + M / ( 1 + exp ( Q ) ) ) d o 2 .

Let G ( Q ) = − M exp ( Q ) ( 1 + exp ( Q ) ) ( δ + δ exp ( Q ) + M ) , so that the equation S _h(Q) = D*(Q) corresponds with G(Q)f(h, Q) + C(Q)⁻¹ A(f(h, Q)) = D*(Q), which can be rewritten as f(h, Q) + G ₁(Q)A(f(h, Q)) = D*(Q)/G(Q), and G ₁(Q) = 1/(C(Q)G(Q)). Let D ₁(Q) = D*(Q)/G(Q), so that the equation becomes f + G ₁(Q)A(f) = D ₁(Q). Once we have solved for f, whose solution we will denote with f(Q), then it remains to solve for h in f(h, Q) = f(Q) or find a closest solution. It is important to note the f → A(f) is a linear real valued operator. Applying this operator to both sides yields A(f) + A(f)A(G ₁(Q)) = A(D ₁(Q)), so that we obtain the solution

A ( f ) = A ( D 1 ( Q ) ) 1 + A ( G 1 ( Q ) ) .

Plugging this back into the equation, we obtain f ( Q ) ≡ D 1 ( Q ) − G 1 ( Q ) A ( D 1 ( Q ) ) 1 + A ( G 1 ( Q ) ) . Thus, we have shown that if we can set f(h, Q _n) = f(Q _n), then we have S _h(Q _n) = D*(Q _n). It remains to determine a choice h(Q _n) so that f(h, Q _n) ≈ f(Q _n). The space {f(h, Q _n): h} equals { ∑ j ∈ J ( Q n ) α ( j ) ϕ j : α } the linear span of the basis functions { ϕ j : j ∈ J ( Q n ) } . Let f _n(Q _n) be the projection of f(Q _n) onto this linear space, for example, with respect to L ²(P ₀)-norm. Let h _n(Q _n) be the solution of f(h, Q _n) = f _n(Q _n), and let D n * ( Q n ) = S h n ( Q n ) ( Q n ) be our desired approximation of D*(Q _n) which is an element of the set of scores {S _h(Q _n): h}. We note that

D n * ( Q n ) − D * ( Q n ) = S h n ( Q n ) ( Q n ) − D * ( Q n ) = G ( Q n ) f n ( Q n ) + C ( Q n ) − 1 A ( f n ( Q n ) ) − D * ( Q n ) = G ( Q n ) f n ( Q n ) + C ( Q n ) − 1 A ( f n ( Q n ) ) − G ( Q n ) f ( Q n ) − C ( Q n ) − 1 A ( f ( Q n ) ) = G ( Q n ) ( f n ( Q n ) − f ( Q n ) ) + C ( Q n ) − 1 A ( f n ( Q n ) − f ( Q n ) ) .

We will assume that ∥ f n ( Q n ) − f ( Q n ) ∥ P 0 = o P ( n − 1 / 4 ) . The main condition beyond (5) of Theorem 2 is that P 0 D n * ( Q n ) − D * ( Q n ) = o P ( n − 1 / 2 ) . Note that P 0 D n * ( Q 0 ) = 0 = P 0 D * ( Q 0 ) . Therefore,

P 0 D n * ( Q n ) − D * ( Q n ) = P 0 D n * ( Q n ) − D n * ( Q 0 ) − P 0 D * ( Q n ) − D * ( Q 0 ) = P 0 { G ( Q n ) ( f n ( Q n ) − f ( Q n ) ) } + P 0 C ( Q n ) − 1 A ( f n ( Q n ) − f ( Q n ) ) − P 0 G ( Q 0 ) ( f n ( Q 0 ) − f ( Q 0 ) ) − C ( Q 0 ) − 1 A ( f n ( Q 0 ) − f ( Q 0 ) ) .

Let Π_n be the projection operator on the linear span generated by the basis function of Q _n, which is of the same dimension as the number of basis functions. The latter difference can also be represented as

P 0 D * ( Q n ) − D * ( Q 0 ) − Π n ( D * ( Q n ) − D * ( Q 0 ) ) ,

or, if we define Π n ⊥ = ( I − Π n ) as the projection operator onto the orthogonal complement of the linear space spanned by the basis functions in Q _n, then this term can be denoted as

(11) P 0 Π n ⊥ ( D * ( Q n ) − D * ( Q 0 ) ) ,

which can, in particular, be bounded by the operator norm ∥ Π n ⊥ ∥ of Π n ⊥ times the L ²(P ₀)-norm of D*(Q _n) − D*(Q ₀). Thus, if we assume that ∥ Π n ⊥ ∥ = o P ( n − 1 / 6 ( log ⁡ n ) − k 1 / 2 ) , then it follows that this term is o _P(n ^−1/2). We will simply assume (11) to be o _P(n ^−1/2). The other conditions, beyond (5) of Theorem 2 hold by the fact that ∥ Q n − Q 0 ∥ P 0 = O P ( n − 1 / 3 ( log ⁡ n ) k 1 / 2 ) and that D * ( Q n ) , D n * ( Q n ) fall in a P ₀-Donsker class of cadlag functions with universal bound on sectional variation norm.

Verification of Assumption 5 of Theorem 2 : Assumption (5) is stating that

min j ∈ J ( Q n ) P n d d Q n L ( Q n ) ( ϕ j ) = o P ( n − 1 / 2 ) .

We apply the last part of Theorem 1. We have

(12) ∥ d d Q n L ( Q n ) − d d Q 0 L ( Q 0 ) ∥ P 0 = O ( ∥ Q n − Q 0 ∥ P 0 ) .

Given that we have d 0 ( Q n , Q 0 ) = O P ( n − 2 / 3 ( log ⁡ n ) k 1 ) , it follows that the remaining condition is (7), or, equivalently,

min j ∈ J ( Q n ) P n ϕ j = O P ( n − 1 / 3 ( log ⁡ n ) − k 1 ) .

This reduces to the assumption that O min j ∈ J ( Q n ) P n ( O ≥ u j ) = O P ( n − 1 / 3 ( log ⁡ n ) − k 1 ) , where u _j are the knot-points making up the support of μ _n. We arrange this assumption to hold by selecting C _n accordingly.

Alternatively, as in previous example, by applying Lemma 2, we select the number J _n of knot-points with non-zero coefficients so that J n d 0 1 / 2 ( Q n , Q 0 ) = o P ( n − 1 / 2 ) .

This proves the following efficiency theorem for the HAL-MLE in this particular estimation problem.

Theorem 4

Let O ∼ P ₀ be a k ₁-variate random variable with Lebesgue density p ₀ that is assumed to be bounded from below by a δ > 0 and from above by an M < ∞. Let p _Q = c(Q){δ + (M − δ)expit(Q)}, where expit(x) = 1/(1 + exp(−x)), and c(Q) is the normalizing constant defined by ∫p _Qdo = 1, where Q ∈ Q ( C u ) can be any cadlag function with finite sectional variation norm bounded by C ^u. Let the statistical model M = P Q : Q ∈ Q ( C u ) for P ₀ be nonparametric beyond that each probability distribution is dominated by the Lebesgue measure, Q varies over cadlag functions with sectional variation norm bounded by C ^u. The statistical target parameter Ψ : M → I R is defined by Ψ(P) = ∫p ²(o)do, which we also denote with Ψ(Q). The canonical gradient of Ψ at P is given by D*(P)(O) = 2(p(O) − Ψ(P)), and, the exact second order remainder R ₂(P, P ₀) = Ψ(P) − Ψ(P ₀) + P ₀ D*(P) is given by R 2 ( P , P 0 ) = − ∫ ( p − p 0 ) 2 d o .

Consider the formulation above of the statistical estimation problem. We have ∥ Q n − Q 0 ∥ P 0 = O P ( n − 1 / 3 ( log ⁡ n ) k 1 ) .

Assumptions :

Given the fit Q n = ∑ j ∈ J ( μ n ) β n ( j ) ϕ j with support points the observations J ( μ n ) = { ( O j ( s ) , 0 ( − s ) ) : j = 1 , … , n , s } and indicator basis functions ϕ _j(W) = I(O > u _j) with u j ∈ J ( μ n ) , we assume that C _n < C ^u for some finite C ^u is chosen so that
min j ∈ J ( Q n ) P n ϕ j = O P ( n − 1 / 3 ( log ⁡ n ) − k 1 ) .
Alternatively, select the number J _n of knot-points with non-zero coefficients so that J n d 0 1 / 2 ( Q n , Q 0 ) = o P ( n − 1 / 2 ) .
Let Π n ⊥ be the projection operator in L ²(P ₀) onto the orthogonal complement of the linear span of the basis functions { ϕ j : j ∈ J ( Q n ) } in the fit of Q _n. Assume
(13) P 0 Π n ⊥ ( D * ( Q n ) − D * ( Q 0 ) ) = o P ( n − 1 / 2 ) .
A sufficient condition is that the operator norm ∥ Π n ⊥ ∥ of Π n ⊥ is o P ( n − 1 / 6 ( log ⁡ n ) − k 1 / 2 ) .

Then, Ψ(Q _n) is an asymptotically efficient estimator of Ψ(Q ₀).

6 Simulation study

Our global undersmoothing conditions only specify a sufficient rate at which the sparsest selected basis function should converge to zero, or at which the number of basis functions selected will converge to infinity, but it does not provide a constant in front of this rate. Thus, it does not immediately yield a practical method for tuning the level of undersmoothing. In our simulation studies, we investigate the targeted L ₁-norm selector that is chosen so that the empirical mean of the canonical gradient at the HAL-MLE (indexed by L ₁-norm) and possibly a HAL-MLE of the nuisance parameter in the canonical gradient is o _P(n ^−1/2). In extensive simulations, this method appears to give better practical results than several direct implementations of our global undersmoothing criterion (i.e., the choice of constant matters for practical performance). More research will be needed to investigate if one can construct a global undersmoothing selector (according to our theorem) that would result in well behaved efficient plug-in estimators across a large class of target parameters. Our simulations also demonstrate that our targeted selection method for undersmoothing controls the sectional variation norm of the fit, which is a crucial part of the Donsker class or asymptotic equicontinuity condition.

6.1 Simulations for the treatment-specific mean

We simulated a vector W = (W ₁, W ₂), with W ₁ created by drawing Z ∼ Beta(0.85, 0.85) and setting W ₁ = 4Z − 2. W ₂ was drawn independently from a Bernoulli(0.5) distribution. Given W = w, a binary random variable A was drawn with probability A = 1 equal to G ̄ 0 ( w ) = logit − 1 { w 1 − 2 w 1 w 2 } . Thus, the required positivity condition holds by design with P 0 { 0.119 < G ̄ 0 ( W ) < 0.881 } = 1 . Given W = w, we set Y = Q ̄ 0 ( w ) + ϵ , where Q ̄ 0 ( w ) = logit − 1 { w 1 − 2 w 1 w 2 } and ϵ ∼ Normal(0, 0.25). The true value of the treatment-specific mean is Ψ ( P 0 ) = E P 0 E P 0 ( Y ∣ W , A = 1 ) = 0.5 . We refer readers back to Section 4 for the form of the canonical gradient.

We built our undersmoothed estimator of Ψ(P ₀) as follows. We estimate Q ̄ 0 using a HAL regression estimator and select the regularization of the estimator by choosing the smallest value of C for L ₁-norm such that

P n D * ( Q C , n , G ̄ n ) < P n 1 / 2 D * ( Q C , n , G ̄ n ) 2 log ( n ) n 1 / 2 ,

where G ̄ n is the HAL-MLE estimate of G ̄ 0 (i.e., a HAL regression that uses cross-validated choice for C). We then computed the plug-in estimator as described in Section 4.

We generated 3000 data sets in this way and computed the undersmoothed HAL estimate. We report the estimator’s bias (scaled by n ^1/2), Monte Carlo variance (scaled by n), mean squared error (by n), and the sampling distribution of n 1 / 2 { Ψ ( Q ̃ n ) − Ψ ( P 0 ) } . We additionally report on the behavior of n 1 / 2 P n D * ( Q C n , n , G ̄ 0 ) and

n 1 / 2 min s , j ∈ J n ( s ) , β n ( s , j ) ≠ 0 ∥ P n d d Q n L ( Q n ) ( ϕ s , j ) ∥ .

As predicted by theory, the bias of the estimator diminishes faster than n ^−1/2 and the variance of the estimator approaches the efficiency bound in larger samples (Figure 1 and 2). The empirical average of the canonical gradient is appropriately controlled (top right) and our selection criteria for the HAL tuning parameter appears to also satisfy the global criteria stipulated by Eq. (5). At all sample sizes, the sampling distribution of the scaled and centered estimator is well-approximated by the efficient asymptotic distribution.

Figure 1:

Simulation results for the treatment-specific mean parameter: (a) bias in absolute value, (b) variance, (c) mean-squared error (all scaled by n ^1/2), (d) Sampling distribution of scaled and centered estimator, (e) Sectional variation norm of the nuisance parameter, (f) empirical average of quantity given in Eq. (5), (g) sample average of the efficient influence function, evaluated at the sample estimate.

$Figure 2: Simulation results for the average density value parameter: (a) bias in absolute value (b) variance (c) mean-squared error (all scaled by n 1/2); (d) Sampling distribution of scaled and centered estimator, (e) Sectional variation norm of the nuisance parameter (f) empirical average of quantity given in Eq. (5), (g) sample average of the efficient influence function, evaluated at the sample estimate, (h) n ( P n − P 0 ) ( D n − D 0 ) $\sqrt{n}({P}_{n}-{P}_{0})({D}_{n}-{D}_{0})$ . The dashed lines in the mean-squared error plots denote the efficiency bound. The reference sampling distribution for the estimators is a mean-zero normal distribution with this variance.$

Figure 2:

Simulation results for the average density value parameter: (a) bias in absolute value (b) variance (c) mean-squared error (all scaled by n ^1/2); (d) Sampling distribution of scaled and centered estimator, (e) Sectional variation norm of the nuisance parameter (f) empirical average of quantity given in Eq. (5), (g) sample average of the efficient influence function, evaluated at the sample estimate, (h) n ( P n − P 0 ) ( D n − D 0 ) . The dashed lines in the mean-squared error plots denote the efficiency bound. The reference sampling distribution for the estimators is a mean-zero normal distribution with this variance.

6.2 Simulations for the integral of the square of the density

We simulated a univariate variable O ∼ N(−4, 5/3) and evaluated the performance of undersmoothed HAL for estimating the integral of the square of the density of O (Section 5). We implemented a HAL-based estimator of the density using an approach similar to the one described in [28]. This approach entails estimating a discrete hazard function using HAL using a pre-specified binning of the real line. For this simulation, we used 320 equidistant bins, and note that the HAL density estimator is robust to this choice, so long as a large enough value is chosen. We sample 1000 data sets for each of several sample sizes ranging from n = 100 to 100,000. We compare the results for undersmoothed HAL to those obtained by using a typical implementation of HAL that selects the level of smoothing based on cross-validation. We compared these estimators on the same criterion described in the previous subsection.

The simulations results reflect what is expected based on theory. In particular, the undersmoothed HAL achieves the efficiency bound in large samples and the scaled-centered sampling distribution of the estimator is well approximated by the efficient asymptotic distribution. We found that our selection criterion for the level of undersmoothing based on the EIF led to control of the variation norm of the resultant fit. On the other hand, results for the HAL estimator with level of smoothing selected via cross-validation demonstrated that this estimator does not have bias that is decreasing faster than n ^−1/2. Thus, this estimator performs worse in terms of all criteria that we considered. The estimated cross-validated and undersmoothed function paths as well as the true function are illustrated in Figure 3.

Figure 3:

A random realization of the Simulation in Section 7.2 (n = 500).

7 Discussion

In this article we established that for realistic and nonparametric statistical models an overfitted zero-order spline HAL-MLE of a functional parameter of the data distribution results in efficient plug-in estimators of pathwise differentiable functionals of this functional parameter. The statistical model can be any model for which the parameter space of the functional parameter is a (cartesian product of a) subset of the set of multivariate cadlag functions with a universal bound on the sectional variation norm. The undersmoothing condition involves two purposes. Firstly, one wants to undersmooth so that solving the L ₁-constrained scores P _n S _h(Q _n) with r(h, Q _n) = 0 implies solving P _n S _h(Q _n) = o _P(n ^−1/2), and thereby P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) for the best score approximation D n * ( Q n , G 0 ) of D*(Q _n, G ₀). For that purpose we showed that it suffices to select the L ₁-norm in the HAL-MLE large enough so that the basis functions with non-zero coefficients includes “sparse enough” basis functions, where “sparse enough” corresponds with assuming that the proportion of non-zero elements (among n observations of this basis function) in the basis function converges to zero at a rate faster than n − 1 / 3 ( log ⁡ n ) k 1 . Alternatively, one controls the number J _n of non-zero coefficients so that J _n d ₀(Q _n, Q ₀) = o _P(n ^−1/2). The latter establishes that, from an asymptotic perspective, this condition will even be satisfied for the cross-validation selector. The second purpose is to undersmooth enough so that the approximation D n * ( Q n , G 0 ) becomes a good enough approximation of D*(Q _n, G ₀). If there is no nuisance parameter G ₀, then one generally expects this to hold for the cross-validation selector. However, if there is a nuisance parameter G ₀, then undersmoothing might be needed since the dependence on G ₀ might be more complex than it is to fit Q ₀ itself, so that the fit Q _n needs to select extra basis functions beyond the ones needed to approximate Q ₀. This shows that from an asymptotic point of view the need for undersmoothing appears minimal, but in practice (where the constant matters), we have observed that it is important.

The undersmoothing conditions presented are not parameter specific, so that such an undersmoothed HAL-MLE will be efficient for any of its smooth functionals. In addition, the undersmoothing of the HAL-MLE does not change its rate of convergence relative to the HAL-MLE optimally tuned with cross-validation, as long as the selected L ₁-norm remains uniformly bounded, suggesting that it is still a good estimator of the true functional parameter.

On the other hand, a typical TMLE targeting one particular target parameter will generally only be asymptotically efficient for that particular target parameter, and not even asymtotically linear for other smooth functionals, even if it uses as initial estimator the HAL-MLE tuned with cross-validation. Therefore it appears to be an interesting topic to better understand the sampling distribution of the undersmoothed HAL-MLE in an asymptotic sense and in relation to a sampling distribution of a TMLE using an optimally smoothed (i.e., cross-validation) HAL-MLE as initial estimator. Note, however, that if the TMLE uses an undersmoothed HAL-MLE as initial estimator, than the TMLE step should result in small changes, thereby mostly preserving the behavior of the undersmoothed HAL-MLE. Therefore, the latter type of TMLE might be recommended, inheriting the good global behavior of the HAL-MLE, also in light of the recent work on higher order TMLE [29].

It is also of interest to observe that the second order remainder of the HAL-MLE for a pathwise differentiable functional appears to either be driven by the square of the L ²(P ₀)-norm of the HAL-MLE itself with respect to the functional parameter, or, in the case that the efficient influence curve has a nuisance parameter G, a second order remainder might also (or only) involve a product of differences of the HAL-MLE Q _n with respect to its true counterpart Q ₀ and the difference of a projection G _0,n of the true G ₀ with respect to the linear space of basis functions selected by the undersmoothed HAL-MLE Q _n. Since G _0,n is a type of oracle estimator of G ₀, this suggest that in a model in which our knowledge on G ₀ is not any better than our knowledge on Q ₀, this HAL-MLE has a good second order remainder that might generally be smaller than what it would be for a TMLE that estimates G ₀ with an actual estimator such as the HAL-MLE.

On the other hand, if the statistical model involves particularly strong knowledge on the nuisance parameter G ₀, then a TMLE can fully utilize this model on G ₀ and thereby obtain a better behaved second order remainder than the one for the overfitted HAL-MLE. One also suspects that a TMLE will be more sensitive to lack of positivity for the target parameter than the undersmoothed HAL-MLE. Therefore, we conjecture that an undersmoothed HAL-MLE might be the preferred estimator in models in which case the estimation of G ₀ is as hard as estimation of Q ₀, and when lack of positivity is a serious issue, while an HAL-TMLE might be the preferred estimator when estimation of G ₀ is easier than estimation of Q ₀. These are not formal statements, but indicate a qualitative comparison between the undersmoothed HAL-MLE and a HAL-TMLE using an estimator (HAL-MLE) G _n of G ₀.

However, this above comparison has an additional twist of interest in favor of the HAL-MLE. That is, if G ₀ happens to be a function with relative small variation norm, unknown to the analyst, then we will have much faster convergence of G _0,n to G ₀ than if the true G ₀ is very complex. As such the undersmoothed HAL-MLE will have a remainder involving a very fast converging G _0,n, possibly faster than the estimator G _n used by the TMLE utilizing this simple model. Thus, the HAL-MLE is able to adapt to underlying (unknown) smoothness of G ₀, making it even less obvious that TMLE utilizing knowledge on G ₀ will do any better. All of this strongly suggest that the TMLE should use an undersmoothed HAL-MLE as initial estimator and make sure that the targeting step does not destroy the score equations already solved by the HAL-MLE. We will address the latter in a future article.

In future research we will address the comparison between undersmoothed HAL-MLE and HAL-TMLE in realistic simulations and by formal comparison by their second order remainders (some of it already shown by [29]). Specifically, in a subsequent article we will marry the TMLE with the HAL-MLE by defining a targeted HAL-MLE that minimizes the empirical risk over the linear span of basis functions (approximating the true cadlag function with finite sectional variation norm) under the L ₁-constraint and under the constraint that the Euclidean norm of the empirical mean of the efficient influence curve at the HAL-MLE (as well as at an estimator G _n) is o _P(n ^−1/2). We will show that undersmoothing this targeted HAL-MLE results in an estimator that is still efficient across all smooth functionals, while it is able to fully exploit all knowledge on G ₀ for the sake of the specific target parameter. Moving forward it will also be critically important to compare approaches for building confidence intervals and performing hypothesis tests.

A key advantage of a TMLE is that it can utilize any super-learner so that its library can include many other powerful machine learning algorithms, including many variations of the HAL-MLE. In this manner a TMLE using a powerful super-learner might compensate for the favorable property of an undersmoothed HAL-MLE with respect to size of the second order remainder. In another future article we will provide a method that marries a powerful super-learner with HAL-MLE, by using the super-learner as a dimension reduction, and applying HAL-MLE as the meta learning step in an ensemble learner. We will show that an undersmoothed HAL-MLE in this metalearning step will result again in an estimator that is efficient, and possibly super-efficient, for any of its smooth functionals. By actually using a targeted HAL-MLE as meta learning step, we might end up with an estimator that is able to still fully exploit the strengths super-learning, undersmoothed HAL-MLE, and TMLE using a good estimator of G ₀, combined in one method.

Undersmoothing of HAL-MLE can also be applied to nuisance parameters such as an HAL-MLE of the censoring and treatment mechanism in an inverse probability of treatment and censoring weighted (IPTCW) estimator. By undersmoothing the HAL-MLE G _n of the censoring and treatment mechanism G ₀, smooth functionals of G _n become asymptotically efficient just as shown in this article for undersmoothed Q _n. An analysis of an IPTCW estimator precisely relies on showing that a smooth functional of G _n is asymptotically linear. Therefore, in this manner we can show that an IPTCW-estimator that uses an undersmoothed HAL-MLE for estimation of the censoring and treatment mechanism is regular and asymptotically linear and even efficient if the full data model is saturated [30].

Finally, we refer to our accompanying technical report [31] that presents a generalization of this highly adaptive lasso estimator to minimizers of empirical risk over smoothness classes that are spanned by the higher order spline basis functions. The current HAL-MLE corresponds with zero-order spline basis functions. The order of the spline can be selected with cross-validation resulting in an HAL-MLE that also adapts to underlying smoothness. We plan to publish this part in a later article.

Corresponding author: Mark J. van der Laan, Division of Biostatistics, University of California, Berkeley, USA, E-mail: laan@berkeley.edu

Funding source: National Institute of Allergy and Infectious Diseases

Award Identifier / Grant number: 5R01AI074345-09

Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: This work was supported by the National Institute of Allergy and Infectious Diseases (grant number 5R01AI074345-09).
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

Appendix A: Proof of Theorem 1 and Lemma 1

The HAL-MLE has the form Q n = ∑ j ∈ J ( μ n ) β n ( j ) ϕ j for a finite collection of basis functions. A basis function is of form ϕ _j(X) = I(X ≥ x _j) for a knot point x _j ∈ [0, τ], and if the components of x _j in the complement of a subset s ⊂ {1, …, k} are equal to zero, then this indicator reduces to an indicator I(X(s) ≥ x _j(s)). We also know that ∑ _j∣β _n(j)∣ ≤ C _n for the selected L ₁-bound C _n (typically the L ¹-norm will be equal to C _n). We have that

β n = arg min β , ∑ j ∣ β ( j ) ∣ ≤ C n P n L ∑ j ∈ J ( μ n ) β ( j ) ϕ j .

Consider paths (1 + ϵh(j))β _n(j) for a bounded vector h, which yields a collection of scores

S h ( Q n ) = d d Q n L ( Q n ) ∑ j ∈ J ( μ n ) h ( j ) β n ( j ) ϕ j .

Let r ( h , Q n ) = ∑ j ∈ J ( μ n ) h ( j ) ∣ β n ( j ) ∣ . If r(h, Q _n) = 0, then for ϵ small enough,

∑ j ∈ J ( μ n ) ∣ ( 1 + ϵ h ( j ) ) β n ( j ) ∣ = ∑ j ∈ J ( μ n ) ( 1 + ϵ h ( j ) ) ∣ β n ( j ) ∣ = ∑ j ∈ J ( μ n ) ∣ β n ( j ) ∣ + ϵ r ( h , Q n ) = ∑ j ∈ J ( μ n ) ∣ β n ( j ) ∣ .

Thus, by β _n being an MLE, P _n S _h(Q _n) = 0 for any h satisfying r(h, Q _n) = 0. Let h * = h n * be chosen so that P n S h n * ( Q n ) = P n D n * ( Q n , G 0 ) for the approximation D n * ( Q n , G 0 ) of D*(Q _n, G ₀) specified in the theorem. We want to show that P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) , i.e. P n S h n * ( Q n ) = o P ( n − 1 / 2 ) . Let j* be a particular choice in our finite index set J n satisfying β _n(j*) ≠ 0, which we can specify later to minimize the bound. Let h ̃ be defined by h ̃ ( j ) = h * ( j ) for j ≠ j*, and h ̃ ( j * ) is defined by r ( h ̃ , Q n ) = ∑ j ∈ J ( μ n ) h ̃ ( j ) ∣ β n ( j ) ∣ = 0 , so that we know P n S h ̃ ( Q n ) = 0 . Thus,

∑ j ≠ j * h * ( j ) ∣ β n ( j ) ∣ + h ̃ ( j * ) ∣ β n ( j * ) ∣ = 0 .

This gives

h ̃ ( j * ) = − ∑ j ≠ j * h * ( j ) ∣ β n ( j ) ∣ ∣ β n ( j * ) ∣ .

∑ j ( h ̃ − h * ) ( j ) β n ( j ) ϕ j = ( h ̃ − h * ) ( j * ) β n ( j * ) ϕ j * = − ∑ j ≠ j * h * ( j ) ∣ β n ( j ) ∣ ∣ β n ( j * ) ∣ β n ( j * ) − h * ( j * ) β n ( j * ) ϕ j * ≡ c n ( j * ) ϕ j * ,

where

c n ( j * ) = − ∑ ( j ) ≠ ( j * ) h * ( j ) ∣ β n ( j ) ∣ ∣ β n ( j * ) ∣ β n ( j * ) − h * ( j * ) β n ( j * ) .

We note that c _n(j*) is bounded by ∑ _j∣h*(j)‖β _n(j)∣. So we can bound this by ∥h*∥_∞ C _n. Thus under the assumption that ∥ h n * ∥ ∞ = O P ( 1 ) , we have that c _n(j*) = O _P(1).

For this choice h ̃ , let’s compute P n S h ̃ ( Q n ) − P n S h * ( Q n ) (which equals P n S h * ( Q n ) ):

P n S h ̃ ( Q n ) − P n S h * ( Q n ) = P n d d Q n L ( Q n ) ∑ j ( h ̃ − h * ) ( j ) β n ( j ) ϕ j = P n d d Q n L ( Q n ) ( c n ( j * ) ϕ j * ) = c n ( j * ) P n d d Q n L ( Q n ) ( ϕ j * ) = O P P n d d Q n L ( Q n ) ( ϕ j * ) .

Therefore, our undersmoothing condition is that

(14) min j ∈ J ( μ n ) , β n ( j ) ≠ 0 ∥ P n d d Q n L ( Q n ) ( ϕ j ) ∥ = o P ( n − 1 / 2 ) .

Under this condition we have P n S h ̃ ( Q n ) − P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) , but, since P n S h ̃ ( Q n ) = 0 , this implies the desired conclusion P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) . This proves the first statement of Theorem 1.

Let now j * = arg min j ∈ J ( μ n ) P 0 ϕ j . To understand, P n d d Q n L ( Q n ) ( ϕ j * ) we can proceed as follows.

P n d d Q n L ( Q n ) ( ϕ j * ) = ( P n − P 0 ) d d Q n L ( Q n ) ( ϕ j * ) + P 0 d d Q n L ( Q n ) ( ϕ j * ) .

Let S j ( Q n ) ≡ d d Q n L ( Q n ) ( ϕ j ) . Suppose that P 0 S j * ( Q n ) 2 → p 0 , which will generally hold whenever P 0 ϕ j * = o P ( 1 ) . We also have that { S j ( Q ) : Q ∈ Q , ( j ) } is contained in the class of cadlag functions with uniformly bounded sectional variation norm, which is a Donsker class. Thereby, by asymptotic equicontinuity of the empirical process indexed by a Donsker class, we have ( P n − P 0 ) S j * ( Q n ) = o P ( n − 1 / 2 ) . Thus, it remains to show that P 0 S j * ( Q n ) = o P ( n − 1 / 2 ) . We now note that

P 0 S j * ( Q n ) = P 0 S j * ( Q n ) − S j * ( Q 0 ) + P 0 S j * ( Q 0 ) ,

but P ₀ S _j(Q ₀) = 0 for all j, since Q ₀ = arg min_Q P ₀ L(Q). Therefore, P n d d Q n L ( Q n ) ( ϕ j * ) = o P ( n − 1 / 2 ) if

(15) P 0 S j * ( Q n ) − S j * ( Q 0 ) = o P ( n − 1 / 2 ) .

This proves the second statement of Theorem 1. The third statement is a trivial implication, which completes the proof of Theorem 1. □

Proof of Lemma 1

(16) P 0 d d Q n L ( Q n ) − d d Q 0 L ( Q 0 ) ϕ j * = o P ( n − 1 / 2 ) .

We assume

∥ d d Q n L ( Q n ) − d d Q 0 L ( Q 0 ) ∥ ∞ = O ( ∥ Q n − Q 0 ∥ ∞ ) .

Then, (16) reduces to

∥ Q n − Q 0 ∥ ∞ P 0 ϕ j * = o P ( n − 1 / 2 ) .

This teaches us that the critical condition (14) holds if

min j ∈ J ( μ n ) , β n ( j ) ≠ 0 P 0 ϕ j = O P ( n − 1 / 2 ) .

and that for this choice j* we have P 0 S j * ( Q n ) 2 → p 0 . The latter holds if min_j P ₀ ϕ _j = o _P(1), since d d Q n L ( Q n ) is uniformly bounded. Finally, since P ₀ ϕ _j = O(P ₀(X ≥ x _j)), sup j ∣ ( P n − P ) ϕ j ∣ = O P ( n − 1 / 2 ) , we can replace P ₀ ϕ _j by P _n ϕ _j in the condition. This proves Lemma 1.

Appendix B: Proof of Theorem 2

Let G _0n be an approximation of G ₀, and let D*(Q _n, G _0n) be the approximation of D*(Q _n, G ₀) in the space of scores S ( Q n ) . We have the following general theorem which proves the first part of Theorem 2.

Theorem 5

Consider the HAL-MLE Q _n with C = C _u or C = C _n. Assume M ₁, M ₂₀ < ∞. We have d 0 ( Q n , Q 0 ) = O P ( n − 1 / 2 − α ( k 1 ) ) . Assume also that for a given approximation G 0 n ∈ G of G ₀ which satisfies

(17) P n D * ( Q n , G 0 n ) = o P ( n − 1 / 2 ) .

R ₂((Q _n, G _0n), (Q ₀, G ₀)) = o _P(n ^−1/2) and P 0 D * ( Q n , G 0 n ) − D * ( Q 0 , G 0 ) 2 → p 0 .
{ D * ( Q , G ) : Q ∈ Q , G ∈ G } is contained in the class of k ₁-variate cadlag functions on a cube [ 0 , τ o ] ⊂ I R k 1 in a Euclidean space and that sup Q ∈ Q , G ∈ G ∥ D * ( Q , G ) ∥ v * < ∞ .

Then Ψ(Q _n) is asymptotically efficient at P ₀.

Proof

The exact second order expansion at G _0n of the target parameter Ψ yields

Ψ ( Q n ) − Ψ ( Q 0 ) = ( P n − P 0 ) D * ( Q n , G 0 n ) − P n D * ( Q n , G 0 n ) + R 2 ( ( Q n , G 0 n ) , ( Q 0 , G 0 ) ) .

Given that d 0 ( Q n , Q 0 ) = O P ( n − 1 / 2 − α ( k 1 ) ) , and that G _0n is presumably at least as good of an approximation of G ₀, it is a reasonable assumption to assume R ₂((Q _n, G _0n), (Q ₀, G ₀)) = o _P(n ^−1/2) and P 0 D * ( Q n , G 0 n ) − D * ( Q 0 , G 0 ) 2 → p 0 . We also assume that { D * ( Q , G ) : Q ∈ Q , G ∈ G } is contained in the class of k ₁-variate cadlag functions on a cube [ 0 , τ o ] ⊂ I R k 1 in a Euclidean space and that sup Q ∈ Q , G ∈ G ∥ D * ( Q , G ) ∥ v * < ∞ . This essentially states that the sectional variation norm of D*(Q, G) can be bounded in terms of the sectional variation norm of Q and G, which will naturally hold under a strong positivity assumption that bounds denominators away from zero. Since the class of cadlag functions on [0, τ _o] with sectional variation norm bounded by a universal constant is a Donsker class, empirical process theory yields:

Ψ ( Q n ) − Ψ ( Q 0 ) = ( P n − P 0 ) D * ( Q 0 , G 0 ) − P n D * ( Q n , G 0 n ) + o P ( n − 1 / 2 ) .

□

This theorem can be easily generalized to a more general approximation D n * ( Q n , G 0 ) ∈ S ( Q n ) of D*(Q _n, G ₀) (not necessarily of form D n * ( Q n , G 0 ) = D * ( Q n , G 0 n ) for some G _0n).

Theorem 6

Consider the HAL-MLE Q _n with C = C _u or C = C _n. Assume M ₁, M ₂₀ < ∞. We have d 0 ( Q n , Q 0 ) = O P ( n − 1 / 2 − α ( k 1 ) ) . Assume also that for a given approximation D n * ( Q n , G 0 ) we have P _n D*(Q _n, G _0n) = o _P(n ^−1/2). In addition, assume

R ₂((Q _n, G ₀), (Q ₀, G ₀)) = o _P(n ^−1/2), P 0 D n * ( Q n , G 0 ) − D * ( Q n , G 0 ) = o P ( n − 1 / 2 ) , and P 0 D n * ( Q n , G 0 ) − D * ( Q 0 , G 0 ) 2 → p 0 .
D n * ( Q , G 0 ) , D * ( Q , G 0 ) : Q ∈ Q is contained in the class of k ₁-variate cadlag functions on a cube [ 0 , τ o ] ⊂ I R k 1 in a Euclidean space and that sup Q ∈ Q max ( ∥ D * ( Q , G 0 ) ∥ v * , ∥ D n * ( Q , G 0 ) ∥ v * ) < ∞ .

Then Ψ(Q _n) is asymptotically efficient at P ₀.

Therefore, in order to prove Theorem 2, it remains to establish the condition under which (17) holds, which was proven in the previous Appendix A.

B.1 General proof of efficient score equation condition at G ₀

This subsection can be skipped for the purpose of proving Theorem 2, but the following result fits here.

Lemma 3

Under the conditions of Theorem 5, if P _n D*(Q _n, G _0n) = o _P(n ^−1/2), then also P _n D*(Q _n, G ₀) = o _P(n ^−1/2). Under the conditions of Theorem 6, if P n D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) , then also P _n D*(Q _n, G ₀) = o _P(n ^−1/2).

Proof

Firstly, we have

P n D * ( Q n , G 0 ) = P n D n * ( Q n , G 0 ) + P n D * ( Q n , G 0 ) − D n * ( Q n , G 0 ) = P n D * ( Q n , G 0 ) − D n * ( Q n , G 0 ) + o P ( n − 1 / 2 ) .

In addition, we have

P n D * ( Q n , G 0 ) − D n * ( Q n , G 0 ) = ( P n − P 0 ) D * ( Q n , G 0 ) − D n * ( Q n , G 0 ) + P 0 D * ( Q n , G 0 ) − D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) + P 0 D * ( Q n , G 0 ) − D n * ( Q n , G 0 ) ,

since sup Q ∈ Q ( M ) max ( ∥ D * ( Q , G 0 ) ∥ v * , ∥ D n * ( Q , G 0 ) ∥ v * ) < ∞ , and P 0 D n * ( Q n , G 0 ) − D * ( Q n , G 0 ) 2 → p 0 . If D n * ( Q n , G 0 ) = D * ( Q n , G 0 n ) , then the first assumption holds if sup P ∈ M ∥ D * ( P ) ∥ v * < ∞ .

To understand the last term, consider the case that D n * ( Q n , G 0 ) = D * ( Q n , G 0 n ) . By the exact second order expansion Ψ(Q _n) − Ψ(Q ₀) = −P ₀ D*(Q _n, G) + R ₂₀(Q _n, G, Q ₀, G ₀) for all G, we have

P 0 D * ( Q n , G 0 ) − D * ( Q n , G 0 n ) = R 20 ( Q n , G 0 , Q 0 , G 0 ) − R 20 ( Q n , G 0 n , Q 0 , G 0 ) .

In our general Theorem 5 we assumed R ₂₀(Q _n, G _0n, Q ₀, G ₀) = o _P(n ^−1/2), which certainly implies R ₂₀(Q _n, G ₀, Q ₀, G ₀) (which actually equals zero in double robust problems). This then establishes that

P n D * ( Q n , G 0 ) = o P ( n − 1 / 2 ) .

For general D n * ( Q n , G 0 ) , Theorem 6 simply assumed P 0 D * ( Q n , G 0 ) − D n * ( Q n , G 0 ) = o P ( n − 1 / 2 ) . □

References

1. Bickel, PJ, Klaassen, CAJ, Ritov, Y, Wellner, J. Efficient and adaptive estimation for semiparametric models. Berlin, Heidelberg, New York: Springer; 1997.Search in Google Scholar

2. Newey, W. The asymptotic variance of semiparametric estimators. Econometrica 2014;62:1349–82.10.2307/2951752Search in Google Scholar

3. van der Laan, MJ. Causal effect models for intention to treat and realistic individualized treatment rules. In: Technical report 203. Berkeley: Division of Biostatistics, University of California; 2006.10.2202/1557-4679.1022Search in Google Scholar PubMed PubMed Central

4. van der Vaart, AW. Asymptotic statistics. Cambridge, New York; 1998.10.1017/CBO9780511802256Search in Google Scholar

5. Shen, X. On methods of sieves and penalization. Annals of Statitics 1997;252:2555–91. https://doi.org/10.1214/aos/1030741085.Search in Google Scholar

6. Shen, X. Large sample sieve estimation of semiparametric models. In: Chapter in handbook of econometrics, vol 76; 2007.10.1016/S1573-4412(07)06076-XSearch in Google Scholar

7. Giné, E, Nickl, R. A simple adaptive estimator of the integrated square of a density. Bernoulli 2008;14:47–61. https://doi.org/10.3150/07-bej110.Search in Google Scholar

8. Hahn, J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 1998;2:315–31. https://doi.org/10.2307/2998560.Search in Google Scholar

9. Newey, WK, Robins, JR. Cross-fitting and fast remainder rates for semiparametric estimation. arXiv preprint arXiv:1801.09138 2018.10.1920/wp.cem.2017.4117Search in Google Scholar

10. Newey, WK, Hsieh, F, Robins, JM. Twicing kernels and a small bias property of semiparametric estimators. Econometrica 2004;72:947–62. https://doi.org/10.1111/j.1468-0262.2004.00518.x.Search in Google Scholar

11. Robins, JM, Rotnitzky, A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: AIDS epidemiology. Basel: Birkhäuser; 1992.10.1007/978-1-4757-1229-2_14Search in Google Scholar

12. van der Laan, MJ, Robins, JM. Unified methods for censored longitudinal data and causality. Berlin, Heidelberg, New York: Springer; 2003.10.1007/978-0-387-21700-0Search in Google Scholar

13. van der Laan, MJ. Estimation based on case-control designs with known prevalance probability. Int J Biostat 2008;4:17. https://doi.org/10.2202/1557-4679.1114.Search in Google Scholar PubMed

14. van der Laan, MJ, Gruber, S. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. Int J Biostat 2016;12:351–78. https://doi.org/10.1515/ijb-2015-0054.Search in Google Scholar PubMed PubMed Central

15. van der Laan, MJ, Rose, S. Targeted learning: causal inference for observational and experimental data. Berlin, Heidelberg, New York: Springer; 2011.10.1007/978-1-4419-9782-1Search in Google Scholar

16. van der Laan, MJ, Rubin, DB. Targeted maximum likelihood learning. Int J Biostat 2006;2:11.10.2202/1557-4679.1043Search in Google Scholar

17. Benkeser, D, van der Laan, MJ. The highly adaptive lasso estimator. In: 2016 IEEE international conference on data science and advanced analytics (DSAA). Montreal, QC, Canada: IEEE; 2016:689–96 pp.10.1109/DSAA.2016.93Search in Google Scholar PubMed PubMed Central

18. van der Laan, MJ. A generally efficient targeted minimum loss-based estimator. In: Technical report 300. UC Berkeley; 2015. to appear in IJB, 2017 http://biostats.bepress.com/ucbbiostat/paper343.10.1515/ijb-2015-0097Search in Google Scholar PubMed PubMed Central

19. Gill, RD, van der Laan, MJ, Wellner, JA. Inefficient estimators of the bivariate survival function for three models. Ann Inst Henri Poincaré 1995;31:545–97.Search in Google Scholar

20. Polley, EC, Rose, S, van der Laan, MJ. Super learner. In: van der Laan, MJ, Rose, S, editors. Targeted learning: causal inference for observational and experimental data. New York, Dordrecht, Heidelberg, London: Springer; 2011.10.1007/978-1-4419-9782-1_3Search in Google Scholar

21. van der Laan, MJ, Dudoit, S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. In: Technical report 130. Berkeley: Division of Biostatistics, University of California; 2003.Search in Google Scholar

22. van der Laan, MJ, Dudoit, S, van der Vaart, AW. The cross-validated adaptive epsilon-net estimator. Stat Decis 2006;24:373–95.10.1524/stnd.2006.24.3.373Search in Google Scholar

23. van der Laan, MJ, Polley, EC, Hubbard, AE. Super learner. Stat Appl Genet Mol 2007;6:25.10.2202/1544-6115.1309Search in Google Scholar PubMed

24. van der Vaart, AW, Dudoit, S, van der Laan, MJ. Oracle inequalities for multi-fold cross-validation. Stat Decis 2006;24:351–71. https://doi.org/10.1524/stnd.2006.24.3.351.Search in Google Scholar

25. Bibaut, A, van der Laan, MJ. Fast rates for empirical risk minimization over cadlag functions with bounded sectional variation norm. In: Technical report. Berkeley: Division of Biostatistics, University of California; 2019.Search in Google Scholar

26. van der Laan, MJ, Bibaut, A. Uniform consistency of the highly adaptive lasso of infinite dimensional parameters. In: Technical report arXiv:1709.06256. Berkeley: Division of Biostatistics, University of California; 2017.Search in Google Scholar

27. Cai, W, van der Laan, MJ. Nonparametric bootstrap inference for the targeted highly adaptive least absolute shrinkage and selection operator (lasso) estimator. Int J Biostat 2020;16:20170070. https://doi.org/10.1515/ijb-2017-0070.Search in Google Scholar PubMed

28. Diaz Munoz, I, van der Laan, MJ. Super learner based conditional density estimation with application to marginal structural models. Int J Biostat 2011;7:1–20. https://doi.org/10.2202/1557-4679.1356.Search in Google Scholar PubMed

29. van der Laan, MJ, Wang, Z, van der Laan, LWP. Higher order targeted maximum likelihood estimation; 2021.Search in Google Scholar

30. Ertefaie, A, Hejazi, NS, van der Laan, MJ. Nonparametric inverse probability weighted estimators based on the highly adaptive lasso; 2020.Search in Google Scholar

31. van der Laan, MJ, Benkeser, D, Cai, W. Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive lasso; 2019.Search in Google Scholar

Received: 2019-08-15

Revised: 2022-04-26

Accepted: 2022-05-09

Published Online: 2022-07-15

Articles in the same Issue

https://doi.org/10.1515/ijb-2019-0092

Keywords for this article

asymptotically efficient estimator; canonical gradient; cross-validation; highly adaptive lasso; sectional variation norm; undersmoothing

Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive lasso

Article

Abstract

1 Introduction

2 Defining the functional estimation problem and HAL-MLE

2.1 Functional estimation problem

2.2 Definition of HAL-MLE

3 Efficiency of the HAL MLE for pathwise differentiable target parameters

3.1 Defining the efficient estimation problem and plug-in HAL-MLE

3.2 The HAL MLE solves the unconstrained score-approximation of the efficient influence curve equation by including sparse basis functions

Theorem 1

Lemma 1

3.3 Condition for solving the unconstrained score approximation of the efficient influence curve in terms of number of non-zero coefficients in HAL-MLE fit

Lemma 2

Proof

3.4 Efficiency of the plug-in HAL MLE

Theorem 2

3.5 Inference for the plug-in undersmoothed HAL-MLE

4 Example: HAL-MLE of treatment-specific mean

4.1 Formulation and relevant quantities for statistical estimation problem

4.2 HAL-MLE

4.3 Defining approximation G 0n

4.4 Application of Theorem 2

Theorem 3

5 Example: HAL-MLE for the integrated square of the data density

Theorem 4

6 Simulation study

6.1 Simulations for the treatment-specific mean

6.2 Simulations for the integral of the square of the density

7 Discussion

Appendix A: Proof of Theorem 1 and Lemma 1

Proof of Lemma 1

Appendix B: Proof of Theorem 2

Theorem 5

Proof

Theorem 6

B.1 General proof of efficient score equation condition at G 0

Lemma 3

Proof

References

Articles in the same Issue

Articles in the same Issue

Articles in the same Issue

4.3 Defining approximation G _0n

B.1 General proof of efficient score equation condition at G ₀