1 Introduction
Targeted learning [1–3] is a subfield of statistics concerned with the development of asymptotically efficient substitution estimators of specific target parameters of the data distribution, across possible data distributions within a realistic statistical model. By necessity any such procedure will have to integrate the state of the art in data adaptive estimation, but will also have to target such data adaptive estimators of relevant parts of the data distribution so that they are minimally biased for the target parameter.
Efficiency [4] and empirical process theory [5] for general statistical models provide the foundation for the construction of such targeted machine learning algorithms. The canonical gradient of the pathwise derivative of the target parameter mapping defines an asymptotically efficient estimator with respect to the assumed statistical model as an estimator that is asymptotically linear with influence curve equal to the canonical gradient, which is the reason that the canonical gradient is also called the efficient influence curve. The construction of an efficient estimator of a pathwise differentiable target parameter will thereby naturally involve the utilization of this canonical gradient. The one-step estimator (e.g., Ref. [4]) is such a general method that adds to an initial estimator of the target parameter the empirical mean of the estimated efficient influence curve. Estimating equation methodology [6, 7] represents a related methodology that assumes that the efficient influence curve can be represented as an estimating function in the target parameter and a nuisance parameter, and defines the estimator as the solution of the resulting estimating equation. These procedures do not result in substitution estimators and thereby can lack finite sample robustness.
The targeted maximum likelihood estimator (TMLE) is a two-stage estimator obtained by constructing a parametric submodel through an initial estimator of the data distribution with score, at zero fluctuation of the initial estimator, that spans the efficient influence curve, and iteratively maximizing the corresponding parametric likelihood till no more updates occur. At that point the updated initial estimator solves the so called efficient influence curve equation [3]. The TMLE of the target parameter is now the corresponding plug-in estimator. The fact that the targeted estimator of the data distribution solves the efficient influence curve equation provides the basis for establishing the asymptotic efficiency of the TMLE under regularity conditions, beyond the crucial condition that the initial estimator is within a neighborhood (e.g., n−1/4) of the true data distribution. To minimize the degree of violation of this crucial rate-of-convergence condition on the initial estimator as much as possible, we have proposed to construct such an initial estimator with the ensemble super-learner template fully utilizing the power and generality of cross-validation [8–12]; while integrating the state of the art in machine learning. This super-learner has been proven to be optimal in the sense that it performs asymptotically as well as the best weighted combination of candidate estimators in its library of candidate estimators.
The parametric submodel through the initial estimator with a score that spans the efficient influence curve that is used in the TMLE procedure is called least favorable because it is the parametric submodel that maximizes the asymptotic variance of the submodel-specific maximum likelihood estimator of the target parameter under sampling from the initial estimator. In this article, we point out that this least favorable parametric submodel can also be interpreted as the submodel that maximizes the absolute infinitesimal change in target parameter (relative to initial estimator) divided by the information-norm of the infinitesimal change in probability distribution (relative to initial estimator). This provides a nice intuition about the targeted maximum likelihood step in TMLE as a fitting procedure that locally maximizes the change in target parameter per unit amount of fitting as measured by unit of information. However, it also shows that this choice of submodel is tailored to be optimal locally around the initial estimator, so that its ability to indeed provide maximal change in the target parameter per unit of information relies on the initial estimator being close enough to the true probability distribution.
This motivates us in this article to define and construct a one-dimensional universal least favorable submodel whose score equals the efficient influence curve at each of its parameter values, not just at 0. We show that such a universal least favorable submodel makes the targeted maximum likelihood estimator perform the desired job in one step, with minimal additional fitting of the data. As a consequence, it maximally preserves the statistical performance of the initial estimator, while achieving its desired targeted bias reduction. In particular, this universal least favorable submodel avoids the need for iterative targeted maximum likelihood estimation, and thereby possible overfitting in finite samples. It also provides the basis to various generalizations as needed for targeted minimum loss-based estimation of a possibly multivariate target parameter. Examples in the current literature in which the TMLE converged in one step happened to already use a universal least favorable submodel.
2 Statistical formulation of the goal and result of this article
Let O1,…,On be n independent and identically distributed copies of a random variable O∼P0 with probability distribution P0 that is known to be an element of a set m of possible probability distributions. We refer to m as the statistical model for the true data distribution P0. Let Ψ: m→IRd be a d-dimensional target parameter mapping, so that ψ0=Ψ(P0) represents the target parameter or estimand of interest that best approximates the answer to the question of interest. We assume that Ψ is pathwise differentiable at each P∈m with canonical gradient D*(P). That is, for each path {Pϵ,h:ϵ} through P at ϵ=0 and score Sh, indexed by h in some index set h, we have
ddϵΨ(Pϵ,h)|ϵ=0=PD∗(P)Sh,
where Pf=∫f(o)dP(o) denotes the expectation operator w.r.t. P. D*(P) is the unique gradient that is also an element of the so called tangent space T(P), defined as the closure of the linear span of all scores {Sh:h∈h} in the Hilbert space L02(P) of functions of O with mean zero under P, endowed with the inner-product S1,S2=PS1S2.
An estimator of ψ0 is a mapping Ψˆ that maps the empirical probability distribution Pn of O1,…,On into the parameter space Ψ(m)⊂IRd, and the corresponding estimate of ψ0 is given by ψn=Ψˆ(Pn). An estimator Ψˆ(Pn) is asymptotically efficient at P0 if and only if it is asymptotically linear with influence curve equal to the canonical gradient D*(P0):
Ψˆ(Pn)−Ψ(P0)=(Pn−P0)D∗(P0)+oP1/n.
Such an estimator satisfies (by CLT) n(ψn−ψ0)⇒dN(0,Σ0=P0{D∗(P0)D∗(P0)⊤}), so that statistical inference can be based on the estimator of its influence curve D*(P0). The canonical gradient D*(P0) of Ψ: m→IRd is also called the efficient influence curve.
A targeted maximum likelihood estimator (TMLE) is defined as follows. One first constructs an initial estimator Pn0∈m of P0. In addition, one defines a local least favorable parametric submodel Pn,δ0,lfm:δ through Pn0 at δ=0 with d-dimensional parameter δ and with score ddδlogdPn,δ0,lfm/dPn0|δ=0=D∗(Pn0). This is used to define the corresponding maximum likelihood estimator δ0=argmaxPnlogdPn,δ0,lfm/dPn0. The one-step TMLE of P0 is now defined as Pn1=Pn,δ00,lfm. This process is iterated by defining Pnk+1=Pn,δkk,lfm,k=1,2,…, till a k=K for which δK≈0. The TMLE of P0 is then defined by the final update Pn∗=Pn,δKK,lfm, which solves PnD∗(Pn∗)≈0. The TMLE of ψ0 is the corresponding plug-in estimator Ψ(Pn∗). Here ≈0 can be replaced by oP(1/n): for example, one might iterate till k ||PnD∗(PnK)||≤1/n, where one could use the Euclidean norm. Below, we will ignore the numerical approximation error and just write PnD∗(Pn∗)=0.
The asymptotic efficiency of the TMLE, under regularity conditions, is established as follows. First, define the second order term R2(P, P0) by the equation Ψˆ(P)−Ψ(P0)=(P−P0)D∗(P)+R2(P,P0). Due to D*(P) being a canonical gradient, R2(P, P0) will be a second order diffierence between P and P0. Applying this identity to P=Pn∗, and using that PnD∗(Pn∗)=0, results in the identity:
Ψ(Pn∗)−Ψ(P0)=(Pn−P0)D∗(Pn∗)+R2(Pn∗,P0).
Assuming R2(Pn∗,P0)=oP(1/n),D∗(Pn∗) falls with probability tending to one in a P0-Donsker class, and P0{D∗(Pn∗)−D∗(P0)}2→0 in probability as n→∞, implies now the asymptotic efficiency of the substitution estimator Ψ(Pn∗). The latter is a very weak consistency condition, so that the really crucial condition is R2(Pn∗,P0)=oP(n−1/2). To make the latter hold, it is crucial that one uses super-learning incorporating highly adaptive estimators. The Donsker class condition will hold if Pn∗ is not an overfit so that its variation norm is controlled, utilizing that the class of functions with bounded variation norm is a Donsker class [5].
TMLE has been generalized to targeted minimum loss-based estimation (still denoted with TMLE) in which one utilizes that Ψ(P) can be represented as Ψ1(Q(P)) for some function Ψ1, where Q(P)=argminQPL(Q) can be defined as a minimizer of the risk of a loss-function L(Q)(O). One notes that D∗(P)=D1∗(Q(P),G(P)) for some nuisance parameter G. Given an initial estimator (Qn0,Gn0), one now defines a local least favorable submodel Qn,δ0,lfm so that ddδL(Qn,δ0,lfm)|δ=0 spans D∗(Qn0,Gn0), and one updates Qnk by computing δk=argminδPnL(Qn,δk,lfm) and setting Qnk+1=Qn,δkk,lfm, resulting in a TMLE (Qn∗,Gn0) solving PnD∗(Qn∗,Gn0)=0, and corresponding TMLE Ψ1(Qn∗) of ψ0. To obtain a TMLE that solves additional score equations that serve a certain purpose, one could use δ of higher dimension than the target parameter, and also simultaneously update Gnk with a submodel through Gnk, and iterate updates of Gnk simultaneously with the updates of Qnk, resulting in a TMLE (Qn∗,Gn∗).
In general, TMLE presents an iterative algorithm, utilizing a local parametric submodel with loss-function specific score equal to a user supplied D(), that maps an initial estimator Pn0∈m, or an initial estimator (Qn0,Gn0) of (Q0; G0), into an updated Pn∗, or (Qn∗,Gn∗), with improved empirical fit w.r.t. the loss-function of P0 or (Q0; G0), so that PnD(Pn∗)=0, or PnD(Qn∗,Gn∗)=0. Due to this generality, its statistical applications are diverse and widespread, going beyond the construction of an efficient estimator of a pathwise diffierentiable target parameter for arbitrary semi-parametric models and pathwise diffierentiable target parameter mappings: collaborative targeted maximum likelihood estimation (CTMLE) for targeted estimation of the nuisance parameter in the canonical gradient [2, 13–16]; cross-validated TMLE (CV-TMLE) to robustify the bias-reduction of the TMLE-step [2, 17]; guaranteed improvement w.r.t. a user supplied asymptotically linear estimator [18, 19]; targeted initial estimator through empirical efficiency maximization [2, 20]; double robust inference by targeting censoring/treatment mechanism [16]; CV-TMLE to estimate data adaptive target parameters such as the risk of a candidate estimator and thereby develop a super-learner that uses CV-TMLE instead of the normal cross-validated empirical risk [21, 22]; higher-order TMLE in order to replace in the above proof R2() by a higher order term [23, 24].
Even though the TMLE framework has been shown to be flexible enough to handle any of the challenges we have encountered, in many cases the proposed TMLE is iterative and uses a local parametric submodel through the initial estimator that has more, and possibly many more, than d (fluctuation) parameters. This can result in small sample issues regarding convergence of the TMLE algorithm or causes finite sample instability of the estimator. It also contrasts the principle goal of TMLE as being a procedure that updates the initial estimator with minimal extra data fitting into a new efficient estimator. By using an over-parameterized local submodel or by using an iterative algorithm these TMLE use more fitting of the data than should be needed to achieve the desired goal.
2.1 Goal of article
The goal set out in this article is to construct a parametric submodel {Pn,ϵ0: ϵ} through an initial Pn0∈m so that the above TMLE algorithm only takes one step, and the dimension of ϵ is smaller than or equal to d. The construction of this parametric submodel, a so called universal least favorable submodel, will be philosophically grounded by being in a sense the shortest path (with distance measured by information/data fitting needed) towards its goal (solving the desired score equation). We will first consider the case d=1 and construct a one-dimensional parametric submodel satisfying this key property so that the TMLE is a one-step TMLE. We will generalize it to targeted minimum loss-based estimation, with all its variations in choice of loss function, and demonstrate it with various examples. Finally, we consider the general case d>1, and construct a one-dimensional parametric submodel through Pn0 for which the one-step TMLE solves each of the d desired score equations. Apparently, this one-dimensional path provides a “shortest” path towards its d-dimensional goal. We will show that this result extends to an infinite dimensional target parameter.
3 Intuition of TMLE: local and universal least favorable submodels
Let’s consider one-dimensional target parameters (i.e., d=1). A least favorable model at P is a model S∗=Pϵ,h∗: ϵ, dominated by P, for which Pϵ=0,h∗=P, and that maximizes the submodel specific Cramer-Rao lower bound for the asymptotic variance of a regular asymptotically linear estimator of ΨPϵ=0 for submodel Pϵ,h:ϵ defined by
CR(h|P)≡(ddϵΨ(Pϵ,h)|ϵ=0)2−Pd2dϵ2logdPϵ,hdP|ϵ=0.
It maximizes CR(h|P) over all such parametric submodels Pϵ,h: ϵ with h varying over some index set whose closure of the linear span generated the full tangent space T(P)⊂L02(P) of the model at P. Given the pathwise differentiability with canonical gradient D∗(P), denoting the score of Pϵ,h:ϵ at ϵ=0 with Sh, it follows that this criterion for a submodel can be represented as follows:
CR(h|P)=(PD∗(P)Sh)2PSh2,
By the Cauchy-Schwarz inequality, it follows that this is maximized over all scores in the tangent space T(P) by S=D∗(P). Thus, a least favorable model can also be defined as any parametric model through P that has a score at P equal to D∗(P).
By using a second order Taylor expansion of ϵ→PlogdPϵ,h/dP at ϵ=0 and that this equals the information PSh2, it follows that, under some smoothness assumptions on the submodels, the criterion can also be represented as
CR(h|P)=limϵ→0(Ψ(Pϵ,h)−Ψ(P))2−2Plog dPϵ,h/dP .
This shows that CR(h|P) equals the square change in the target parameter divided by the change in log-likelihood at P at an infinitesimal ϵ. Therefore, we will say that the path Pϵ,h∗:ϵ that maximizes CR(h|P) follows at ϵ=0 (i.e., locally) a path of maximal change in target parameter per unit of information. To stress that the desired optimality property only applies locally, we will refer to such a submodel as a locally (i.e., at ϵ=0) least favorable submodel.
This latter representation of the criterion is intuitively appealing, since a sensible goal of a submodel Pϵ: ϵ through P is that a small fluctuation of P yields a maximal change in target parameter, making the MLE ϵn=arg maxϵ Pn log dPϵ/dP (as used in TMLE) for this parametric model locally all about fitting the target parameter, not wasting data for anything else.
The intuition of TMLE has always been to minimally increase the empirical fit of the initial estimator while achieving the desired bias reduction for the target parameter, measured by solving PnD∗Pn∗ with a good estimator Pn∗ of P0 (so not worse than Pn0). However, if Pn0 is far away from P0, the MLE ϵn0 will be far from local. Even though it moves in the right direction at ϵ≈0, there is no guarantee that it follows a path of maximal change in target parameter per change in distribution once ϵ moves farther away from zero. In the end that means that the targeted maximum likelihood estimator might not have followed such a targeted path after all, and it might have taken various iterations to finally end up with a local ϵnK≈0 at which point the algorithm stops. The distribution Pn0 might have changed much more than needed to obtain the bias reduction in the target parameter. That is, the desired bias reduction came at an unnecessary cost of data fitting so that ΨPn∗ will have larger finite sample variance than needed. Based on this insight, we would like to construct a TMLE that is based on a path that at each ϵ (not just at ϵ=0) follows a path of maximal change in target parameter per unit of information. We will refer to such a path as a universal least favorable submodel.
Definition 1 Suppose that, given aP∈m,Ulfm(P)={Pϵ:ϵ∈(−a,a)}⊂mis a parametric submodel dominated by P, such thatPϵ=0=Pand for eachϵ∈(−a,a)⊂IR,we have
(1)ddϵlogdPϵdP=D∗Pϵ.
Then, we say that Ulfm(
P)
is a universal least favorable submodel through P.
That is, this least favorable model is not only least favorable at ϵ=0, it is a least favorable model at each pϵ∈Ulfm(P). This article proposes such universal least favorable submodels and corresponding targeted maximum likelihood and targeted minimum loss-based estimators. A very nice by-product of these universal least favorable submodels is that the TMLE always “converges” in one step. This reflects the above intuition of a universal least favorable submodel as a shortest path submodel in the sense that it achieves the desired bias reduction at minimal increase in empirical log-likelihood.
4 A universal least favorable submodel for targeted maximum likelihood estimation
4.1 The TMLE based on a universal least favorable submodel takes only one step
Let Pn0 be an initial estimator of P0. Suppose that, given a P∈m, we can construct a universal least favorable parametric model If we use this as parametric submodel Ulfm(P)=P,:ϵ∈(−a,a)}⊂m. in the TMLE, then the TMLE converges in one step. That is, let
ϵn0=argmaxϵPnlogdPn,ϵ0dPn0.
One can replace the maximum by the local maximum ϵn0 closest ϵ=0, to which is what we recommend in case the selected universal least favorable submodel allows for multiple local maxima. Let Pn1=Pn,ϵn00. Since ϵn0 is a local maximum it solves its score equation, given by PnD∗Pn1=0. That is, it achieves the goal of solving the desired efficient influence curve equation in one step. Further iteration will not yield further updates: the next MLE
ϵn1=argmaxϵPnlogdPn,ϵ1dPn1=0.
Therefore, the TMLE of ψ0=ΨP0 is given by the one-step TMLE ψn∗=ΨPn1.
In addition, we strongly suspect that a TMLE using such a least favorable model will often perform better in finite samples, certainly in situations in which the TMLE requires an iterative algorithm. In addition, it is philosophically superior by always following a path along ϵ in which the rate of square change in the parameter by unit of information is maximized at each ϵ-value.
4.2 An analytic formula for a universal least favorable submodel
This motivates us to consider if such a universal least favorable model exists and can be constructed. The answer is, yes, as our constructions below demonstrate.
In the following we use pϵ for the density of Pϵ w.r.t. P, so that p=1, but we will still use p (in case, one wants to use the formulas for densities w.r.t. another dominating measure). For ϵ≥0, we recursively define
(2)pϵ=pexp∫0ϵD∗(Px)dx,
and, for ϵ<0, we recursively define
pϵ=pexp−∫ϵ0D∗(Px)dx.
Theorem 1 Consider the definition of{Pϵ:ϵ∈(−a,a)}above. We have that{Pϵ:ϵ∈(−a,a)}is a set of probability distributions dominated byP,Pϵ=0=P,and, for eachϵ∈(−a,a),we have
ddϵlogdPϵdP=D∗(Pϵ).
Proof: It follows trivially that for each ϵ,ddϵlogpϵ=D∗Pϵ. It remains to verify that pϵ satisfies ∫pϵ(o)dP(o)=1 (obviously, pϵ≥0). Define C(ϵ,P)=∫pϵdP. Consider the probability density pϵ,1=C(ϵ,P)−1pϵ. We have that its score at ϵ is given by:
S(ϵ,P)=1C(ϵ,P)ddϵC(ϵ,P)+D∗Pϵ.
We know that PϵS(ϵ,P)=0. Since PϵD∗(Pϵ)=0, this implies that ddϵC(ϵ,P)=0. Thus, C(ϵ,P)=C(0,P)=1. This completes the proof. □
Note that this recursive relation (2) allows one to recursively solve for pϵ+dϵ, given {px:x∈[0,ϵ]}, in the sense that (e.g.) for ϵ>0,
pϵ+dϵpϵ=exp(D∗(Pϵ)dϵ)=(1+dϵD∗(Pϵ)).
This differential equation is equivalent with stating that ddϵlogpϵ=D∗(Pϵ). This implies a practical construction that starts with px0=0=p and recursively solves for
pxj=pxj−1(1+(xj−xj−1)D∗(Pxj−1)),j=1,...,N
for an arbitrary fine grid 0=x0<x1<…<xN=a. Similarly, one determines recursively
p−xj=p−xj−1(1−(xj−xj−1)D∗(P−xj−1)),j=1,...,N.
If the model m is nonparametric, then this practical construction is a submodel of m, but if the model is restricted the practical construction above might select probability distributions Pxj that are not an element of m, even though it has score at xj equal to D∗(Pxj) in the tangent space at Pxj of the model m. Nonetheless, this practical construction of this least favorable model can be used for any model m, as long as one can extend the target parameter Ψ to be well defined on the probability distributions in this discrete approximation of the theoretical least favorable model. The TMLE will still only require one step and be asymptotically efficient for the actual model m under regularity conditions. In addition, in the next subsection Theorem 2 proves that under mild regularity conditions, quite surprisingly, the theoretical formula (2) for this universal least favorable model, defined as a limit of the above practical construction when the partitioning gets finer and finer, is an actual submodel of m. Another way of viewing this result is that by selecting the partitioning fine enough in the above practical construction {pxj,p−xj:j=0,…,N} this submodel will be arbitrarily close to the model m. Below we will also provide an alternative to the above practical construction that does preserve the submodel property while it still approximates the theoretical formula (2).
4.3 A universal least favorable submodel in terms of a local least favorable submodel
An alternative representation of the above analytic formula (2) is given by a product integral representation. Let dϵ>0. For ϵ≥0, we define
pϵ+dϵ=p∏x∈(0,ϵ](1+D∗(Px)dx),
and for ϵ<0, we define
pϵ−dϵ=p∏x∈[ϵ,0)(1−D∗(Px)dx).
In other words, px+dx=px(1+D∗(Px)dx), or, another way of thinking about this is that px+dx is obtained by constructing a least favorable model through Px with score D∗(Px) at Px, and evaluate it at parameter value dx, slightly away from zero. This suggests the following generalization of the universal least favorable model whose practical analogue will now still be an actual submodel of m.
Let 0=x0<x1<…≤xN=a be an equally spaced fine grid for the interval [0,a]. Let h=xj−xj−1 be the width of the partition elements. We will provide a construction for Pxj,j=0,…,N. This construction is expressed in terms of a mapping P→{Pδlfm:δ∈(−a,a)}⊂m that maps any P∈m into a local least favorable submodel of m through P at δ=0 and with score D∗(P) at δ=0, where a is some positive number. For any estimation problem defined by m and Ψ one is typically able to construct such a local least favorable submodel, so that this is hardly an assumption. Let Px=0=P. Let px1=px0,hlfm, and, in general, let pxj+1=pxj,hlfm,j=1,2,…,N−1. Similarly, let −a=−xN<−xN−1<…<−x1<x0=0 be the corresponding grid for [−a,0], and we define p−xj+1=p−xj,−hlfm,j=1,…,N−1. In this manner, we have defined Pxj,P−xj,j=0,…,N, and, by construction, each of these are probability distributions in the model m. The choice N defines an end value a, but one does not need to a priori select N. One only needs to select a small dx=xj−xj−1, and continue until the first local MLE is reached. This construction is all we need when using the universal least favorable submodel in practice, such as in the TMLE.
This practical construction implies a theoretical formulation by letting N converge to infinity (i.e., let the width of the partitioning converge to zero). That is, an analytic way of representing this universal least favorable submodel, given the local least favorable model parameterization (ϵ,P)→pϵlfm, is given by: for ϵ>0 and dϵ>0, we have
pϵ+dϵ=pϵ,dϵlfm.
This allows for the recursive solving for pϵ starting at pϵ=0=p, and since pϵ,hlfm∈m, its practical approximation will never leave the model m.
Utilizing that the least favorable model h→pϵ,hlfm is continuously twice differentiable with a score D∗(Pϵ) at h=0, we obtain a second order Taylor expansion
pϵ,dϵlfm=pϵ+ddhpϵ,hlfm|h=0dϵ+O((dϵ)2)=pϵ(1+dϵD∗(Pϵ))+O((dϵ)2),
so that we obtain
pϵ+dϵ=pϵ(1+dϵD∗(Pϵ))+O((dϵ)2).
This implies:
pϵ=pexp∫0ϵD∗(Px)dx.
Thus, we obtained the exact same representation (2) as above. This proves that, under mild regularity conditions, this analytic representation (2) is a submodel of m after all, but, when using its practical implementation and approximation, one should use an actual local least favorable submodel in order to guarantee that one stays in the model. We formalize this result in the following theorem.
Theorem 2Let m be a maximal support so that the support of aP∈mis a subset of m. Suppose there exists a mappingP→{Pδlfm:δ∈(−a,a)}⊂mthat maps anyP∈minto a local least favorable submodel of m through P atδ=0and with scoreD∗(P)atδ=0,where a is some positive number independent of P. In addition, assume the following type of second order Taylor expansion:
pϵ,dϵlfm= pϵ+ ddhpϵ,hlfm|h=0 dϵ + R2(pϵ,dϵ),
wheresupϵsupo∈oR2(pϵ,dϵ)(o)=O((dϵ)2).
We also assume thatsupϵsupo∈o|D∗(Pϵ)pϵ|(o)<∞.Then, the universal least favorable{pϵ: ϵ}defined by (2) is an actual submodel of m. Its definition corresponds withpϵ+dϵ=pϵ,dϵlfmwhose corresponding practical approximation will still be a submodel.
We refer to the Appendix for an application of the universal least favorable submodel and a corresponding one-step TMLE for high dimensional parametric models.
5 Universal least favorable model for targeted minimum loss-based estimation
5.1 A universal least favorable submodel w.r.t. specific loss-function
Let’s now generalize this construction of a universal least favorable w.r.t. log-likelihood loss to general loss-functions so that the resulting universal least favorable submodels can be used in the more general targeted minimum loss based estimation methodology. We now assume that Ψ(P)=Ψ1(Q(P)) for some parameter L2(Λ) defined on the model and real valued function Ψ(P) Here Q(m)={Q(P):P∈m} denotes the parameter space of this parameter Q. Let L(Q)(O) be a loss-function for Q(P) in the sense that Q(P)=argminQ∈Q(m)PL(Q). With slight abuse of notation, let D∗(P)=D∗(Q(P),G(P)) be the canonical gradient of Ψ at P, where G: m→G(m) is some nuisance parameter. We consider the case that the efficient influence curve is in the tangent space of Q, so that a least favorable submodel does not need to fluctuate G: otherwise, just include G in the definition of Q. Given, (Q, G), let {Qϵlfm:ϵ∈(−a,a)}⊂Q(m) be a local least favorable model w.r.t. loss function L(Q) at ϵ=0 so that
ddϵL(Qϵlfm)|ϵ=0=D*(Q,G)
The dependence of this submodel on G is suppressed in this notation.
Let 0=x0<x1<…<xN=a be an equally spaced fine grid for the interval [0, a]. Let h=xj−xj−1 be the width of the partition elements. We present a construction for Qxj,j=0,…,N. Let Qx=0=Q. Let Qx1=Qx0,hlfm, and, in general, let Qxj+1=Qxj,hlfm, j=1,2,…,N−1. Similarly, let −a=−xN<−xN−1<…<−x1<x0=0 be the corresponding grid for [−a,0], and we define Q−xj+1=Q−xj,−hlfm,j=1,…,N−1. In this manner, we have defined Qxj,Q−xj,j=0,…,N, and, by construction, each of these are an element of the parameter space Q(m). This construction is all we need when using this submodel in practice, such as in the TMLE.
An analytic way of representing this loss-function specific universal least favorable submodel for ϵ≥0 (and similarly for ϵ<0) is given by: for ϵ>0, dϵ>0,
(3)Qϵ+dϵ=Qϵ,dϵlfm,
allowing for the recursive solving for Qϵ starting at Qϵ=0=Q, and since Qϵ,hlfm∈Q(m), its practical approximation never leaves the parameter space Q(m) for Q.
Let’s now derive a corresponding integral equation. Assume that for some L˙(Q)(O), we have
|ddhL(Qϵ,hlfm)h=0=L˙(Qϵ)|ddhQϵ,hlfmh=0.
Then, by the local property of a least favorable submodel,
|ddhQϵ,hlfmh=0=D∗(Qϵ,G)L˙(Qϵ).
Utilizing that the local least favorable model h→Qϵ,hlfm is twice continuously differentiable with derivative D∗(Qϵ,G)/L˙(Qϵ) at h=0, we obtain the following second order Taylor expansion:
Qϵ,dϵlfm=Qϵ+ddhQϵ,hlfm|h=0dϵ+O(dϵ)2=Qϵ+D∗(Qϵ,G)L˙(Qϵ)dϵ+O(dϵ)2.
Note that Qϵ can also be represented as Qϵ,0lfm. This implies the following recursive analytic definition of the universal least favorable model through Q:
(4)Qϵ=Q+∫0ϵD∗(Qx,G)L˙(Qx)dx.
Similarly, for ϵ<0, we obtain
Qϵ=Q−∫ϵ0D∗(Qx,G)L˙(Qx)dx.
As with the log-likelihood loss (and thus Q(P)=P), this shows that, under regularity conditions, this analytic representation for Qϵ is an element in Q(m), although using it in a practical construction (in which integrals are replaced by sums) might easily leave the model space Q(m), so that our above practical construction in terms of the local least favorable model and discrete grid represents the desired practical implementation of this universal least favorable submodel. The following theorem formalizes this result stating that the analytic formulation (4) is indeed a universal least favorable submodel.
Theorem 3Given, any (Q, G) compatible with modelm, let{Qδlfm: δ∈(−a,a)}⊂Q(m)be a local least favorable model w.r.t. loss function L(Q) atδ=0so that
|ddδL(Qδlfm)δ=0=D∗(Q,G).
Assume that for someL˙(Q)(O), we have|ddϵL(Qϵlfm)ϵ=0=|L˙(Q)ddϵQϵlfmϵ=0.
Consider the corresponding model{Qϵ:ϵ}defined by (4). It goes through Q atϵ=0, and, it satisfies that for allϵ(5)ddϵL(Qϵ)=D∗(Qϵ,G).
In addition, suppose that thea>0in the local least-favorable submodel above can be chosen to be independent of the choice(Q,G)∈{Qϵ,Gϵ:ϵ}, and assume the following second order Taylor expansion:
Qϵ,dϵlfm=|Qϵ+ddhQϵ,hlfmh=0dϵ+R2(Qϵ,G,dϵ)=Qϵ+D∗(Qϵ,G)L˙(Qϵ)dϵ+R2(Qϵ,G,dϵ),
wheresupϵsupo∈oR2(Qϵ,G,dϵ)(o)=O((dϵ)2).
We also assume thatsupϵsupo∈o|D∗(Qϵ,G)L˙(Qϵ)(o)|<∞.Then, we also have{Qϵ:ϵ}⊂Q(m).
Proof:Let ϵ>0. We have
ddϵLQ+∫0ϵD*(Qx,G)L˙(Qx)dx=L˙(Qϵ)ddϵQϵ =L˙(Qϵ)D*(Qϵ,G)L˙(Qϵ) =D*(Qϵ,G).
This completes the proof of (5). The submodel statement was already shown above, but we now provided formal sufficient conditions. □
5.2 Example demonstrating that analytic formula (4) for universal least favorable submodel is indeed a submodel
Suppose O=(W,A,Y)∼P0,A∈{0,1} binary, Y∈{0,1} also binary, and let the statistical model m be the nonparametric model or any model that only restricts the tangent space of the conditional distribution of A, given W. Let Ψ: m→IR be defined by Ψ(P)=EPEP(Y|A=1,W). The efficient influence curve D∗(P)(O)=A/gˉ(W)(Y−Qˉ(W))+Qˉ(W)−Ψ(P), where gˉ(W)=P(A=1|W) and Qˉ(W)=EP(Y|A=1,W). We note that Ψ(P)=Ψ1(Q)=QWQˉ, where Q=(QW,Qˉ), and QW is the probability distribution of W under P. We can decompose D∗(P)=D1∗(Qˉ,gˉ)+D0∗(Q), where D1∗(Qˉ,gˉ)=A/gˉ(Y−Qˉ(W)) is a score of the conditional distribution of Y, given A, W, while D0∗(Q) is a score of the marginal distribution of W. Since we estimate QW,0 with the empirical probability distribution of W1, ..., Wn, there is no need to construct a submodel through QW, so that we focus on constructing a submodel through Qˉ only.
A valid loss function for Qˉ is given by
L(Qˉ)(O)=−I(A=1){YlogQˉ(W)+(1−Y)log(1−Qˉ(W))}.
Consider the local least favorable submodel through Qˉ:
LogitQˉϵlfm=LogitQˉ−ϵH(gˉ),
where H(gˉ)(A,W)=A/gˉ(W). This is indeed a local least favorable submodel for Qˉ since
ddϵL|(Qˉϵlfm)ϵ=0=D1∗(Qˉ,gˉ).
Let’s now compute the corresponding theoretical universal least favorable submodel (4). We have
ddϵL(Qˉϵ)=ddϵQϵ−I(A=1)Y−QˉϵQˉϵ(1−Qˉϵ).
Thus,
L˙(Qϵ)=−I(A=1)Y−QˉϵQˉϵ(1−Qˉϵ).
Thus, the universal least favorable submodel (4) through Q is given by:
Qˉϵ=Qˉ−H(gˉ)∫0ϵQˉx(1−Qˉx)dx.
This integral equation shows that
ddϵQˉϵQˉϵ(1−Qˉϵ)=−H(gˉ).
This has as solution Qˉϵ=Qϵlfm, and since there is only one solution, this proves that the universal least favorable submodel Qˉ=Qϵlfm. Indeed, it follows directly that for all ϵ
ddϵL(Qˉϵlfm)=D1∗(Qˉϵlfm,gˉ),
showing that our local least favorable submodel is already a universal least favorable submodel. Indeed, the TMLE using Qϵlfm requires only one step. In particular, as predicted by our theory, this demonstrates that the analytic formula (4) respects the constraints that Qˉ∈(0,1), even though that is not immediately obvious from its analytic integral or differential representation.
We refer to supplementary material for the construction of a universal least favorable submodels to general loss functions that are allowed to depend on an unknown nuisance parameter, and corresponding example from the causal inference literature. These examples also demonstrate that in examples for which the TMLE based on the local least favorable model already converged in one step, the least favorable submodel is actually already a universal least favorable submodel.
6 Example: One-step TMLE of average treatment effect among the treated
Let O=(W,A,Y)∼P0 and let m be a nonparametric statistical model. Let Ψ:m→IR be defined by Ψ(P)=EP(EP(Y|A=1,W)−EP(Y|A=0,W)|A=1). The efficient influence curve of Ψ at P is given by Ref. [25]:
D∗(P)(O)=H1(g,q)(A,W)(Y−Qˉ(A,W))+AqQˉ(1,W)−Qˉ(0,W)−Ψ(P),
where g(a|W)=P(A=a|W),Qˉ(a,W)=EP(Y|A=a,W),q=P(A=1),and
H1(g,q)(A,W)=Aq−(1−A)g(1|W)qg(0|W).
We note that
Ψ(P)=Ψ1(QW,Qˉ,g,q)=∫Qˉ(1,w)−Qˉ(0,w)g(1|w)qdQW(w),
where QW is the probability distribution of W under P. So, if we define Q=(QW,Qˉ,g,q), then Ψ(P)=Ψ1(Q). For notational convenience, we will use Ψ(P) and Ψ(Q) interchangeably. Since we can estimate QW and q with their empirical probability distributions, we are only interested in a universal least favorable submodel for (Qˉ,g). We can orthogonally decompose D∗(P)=D1∗(P)+D2∗(P)+D3∗(P) in L02(P) into scores of Qˉ,g, and QW, respectively, where
D1∗(P)=H1(g,q)(A,W)(Y−Qˉ(A,W))D2∗(P)=H2(Q)(W)(A−g(1|W))D3∗(P)=g(1|W)qQˉ(1,W)−Qˉ(0,W)−Ψ(Q),
and
H2(Q)(W)=Qˉ(1,W)−Qˉ(0,W)−Ψ(Q)q.
Thus the component of the efficient influence curve corresponding with (Qˉ,g) is given by D1∗(Q)+D2∗(Q).
We consider the following loss-functions and local least favorable submodels for Qˉ and g [25]:
L1(Qˉ)(O)=−YlogQˉ(A,W)+(1−Y)log(1−Qˉ(A,W))LogitQˉϵlfm=LogitQˉ−ϵH1(g,q)L2(g)(O)=−Alogg(1|W)+(1−A)logg(0|W)Logitgˉϵlfm=Logitgˉ−ϵH2(Q).
We now define the sum loss function L(Qˉ,g)=L1(Qˉ)+L2(g) and local least favorable submodel Qϵlfm,gϵlfm: ϵ through (Qˉ,g) at ϵ=0 satisfying
|ddϵL(Qˉϵlfm,gϵlfm)ϵ=0=D1∗(Q)+D2∗(Q).
Thus, we can conclude that this defines indeed a local least favorable submodel for (Qˉ,g).
The universal least favorable submodel (3) is now defined by the following recursive definition: for ϵ≥0 and dϵ>0,
LogitQˉϵ+dϵ=LogitQˉϵ,dϵlfm=LogitQˉϵ−dϵH1(gϵ,q)Logitgˉϵ+dϵ=Logitgˉϵ,dϵlfm=Logitgˉϵ−dϵH2(QW,Qˉϵ,q).
Similarly, we have a recursive relation for ϵ<0, but since all these formulas are just symmetric versions of the ϵ>0 case, we will focus on ϵ>0. This expresses the next (Qϵ+dϵ,gϵ+dϵ) in terms of previously calculated (Qx,gx: x≤ϵ),thereby fully defining this universal least favorable submodel. This recursive definition corresponds with the following integral representation of this universal least favorable submodel:
LogitQˉϵ=LogitQˉ−∫0ϵH1(gx,q)dxLogitgˉϵ=Logitgˉ−∫0ϵH2(QW,Qˉx,q)dx.
Let’s now explicitly verify that this indeed satisfies the key property of a universal least favorable submodel. Clearly, it is a submodel and it contains (Q,g) at ϵ=0. The score of Qˉϵ at is given by H1(gϵ,q)(Y−Qˉϵ) and the score of gϵ at ϵ is given by H2(QW,Qˉϵ,q)(A−gˉϵ(W)), so that
ddϵL(Qˉϵ,gϵ)=H1(gϵ,q)(Y−Qˉϵ)+H2(QW,Qˉϵ,q)(A−gˉϵ(W))=D1∗(QW,Qˉϵ,gϵ,q)+D2∗(QW,Qˉϵ,gϵ,q),
explicitly proving that indeed this is a universal least favorable model for (Qˉ,g).
In our previous work on the TMLE for the average treatment effect among the treated we implemented the TMLE based on the local least favorable submodel Qˉϵ1lfm,gˉϵ2lfm:ϵ1,ϵ2, using a separate ϵ1 and ϵ2 for Qˉ and gˉ. This TMLE can also be implemented using a single ϵ by regressing a dependent variable vector (Y,A) on a stacked design matrix consisting of an offset and covariate H, the vector (H1(g,q)(A,W),H2(Q)(W). This TMLE require several iterations until convergence, whether it is implemented using using a single or separate (ϵ1,ϵ2).
The TMLE based on the universal least favorable submodel above is implemented as follows, given an initial estimator (Qˉ,g). One first determines the sign of the derivative at h=0 of PnL(Qˉh,gh). Suppose that the derivative is negative so that it decreases for h>0. Then, one keeps iteratively calculating (Qˉϵ+dϵ,gϵ+dϵ) for small dϵ>0, given (Qˉx,gx:x≤ϵ), till PnL(Qˉϵ+dϵ,gϵ+dϵ)≥PnL(Qˉϵ,gϵ), at which point the desired local maximum likelihood ϵn is attained. The TMLE of (Qˉ0,g0) is now given by Qˉϵn,gϵn, which solves PnD1∗(Qϵn)+D2∗(Qϵn)=0, where Qϵn=(QW,n,Qˉϵn,gϵn,qn), and QW,n,qn are the empirical counterparts of QW,0,q0. Since, we also have PnD3∗Qϵn=0, it follows that PnD∗Qϵn=0. The (one-step) TMLE of Ψ(Q0) is given by the corresponding plug-in estimator ΨQϵn.
7 Simulation studies for the average treatment effect among the treated
The iterative TMLE for estimating the average treatment effect among the treated (ATT) parameter returns to the data several times to make a sequence of local moves that updates the estimate of Qˉn(A,W) and gˉn(A,W) at each iteration. In contrast, the one-step TMLE using the universal least favorable sub-model fits the data once, where the MLE step requires a series of micro updates within a much smaller local neighborhood defined by a tuning parameter step size, dϵ. When there is sufficient information in the data for estimating the target parameter these two approaches can be expected to have comparable performance. When there is sparsity in the data theory suggests the one-step TMLE will be more stable, having lower variance than the iterative TMLE.
Two simulation studies demonstrate these properties. The iterative TMLE was implemented using a single ϵ, the closest analog to the one-step TMLE. dϵ was set to 0.001 for the one-step TMLE. Source code for the estimators and the simulation studies is available as supplementary materials. The parameter of interest is defined by the mapping Ψ1(Q)=∫{Qˉ(1,w)−Qˉ(0,w)}g(1|w)qdQw(w). Each TMLE targets initial estimates Qn0(A,W) and gn0(W) towards the parameter of interest. The parameter estimate is evaluated by plugging the updated estimates, Qˉn∗(A,W),gn∗(W) into the mapping, with the integral approximated by taking the empirical mean over all observations in the data, ψn=1n∑i=1nQˉn∗(1,Wi)−Qˉn∗(0,Wi)gn∗(1|Wi)q.
7.1 Simulation study I
For this study 1,000 datasets were generated at two sample sizes, n=100 and n=1000. Two normally distributed covariates and one binary covariate were generated as W1∼N(0,1),W2∼N(0,1),W3∼Bern(0.5). All covariates are independent. Treatment assignment probabilities are given by P(A=1|W)=expit(−0.4−0.2W1−0.4W2+0.3W3). A binary outcome, Y was generated by setting P(Y=1|A,W)=expit(−1.2−1.2A−0.1W1−0.2W2−0.1W3). The true value of the ATT parameter is ψ0=−0.1490. There are no theoretical positivity violations (treatment assignment probabilities were typically between 0.07 and 0.87), but at the smaller sample size there is less information in the data for estimating g within some strata of W. This suggests that some of the generated data sets will prove more challenging to the iterative TMLE than to the one-step TMLE. Estimates were obtained using correct and misspecified logistic regression models for the initial estimates of Q and g. Qcor was estimated using a logistic regression of Y on A, W1, W2, W3. Qmis was estimated using a logistic regression of Y on A, W1.
gcor was estimated using a logistic regression of A on W1, W2, W3, and gmis was estimated using a logistic regression of A on W1. Bias, variance, mean squared error (MSE), and relative efficiency (RE=MSEone−step/MSEiter) are shown in Table 1. RE<1 indicates the one-step TMLE has better finite sample efficiency than the iterative TMLE.
Table 1:
Simulation Study I. Bias, variance, mean squared error (MSE) and relative efficiency (RE) of the one-step TMLE and iterative TMLE over 1000 Monte Carlo simulations (n=1000andn=100). Results when n=100 are shown with and without omitting 24 challenging runs from the analysis, and when Qˉn is bounded away from 0 and 1 for both TMLEs.*
| Bias | Variance | MSE | RE |
| one-step | iterative | one-step | iterative | one-step | iterative | |
| n=1000 |
| Q correct | | | | | | | |
| gcor | −0.00042 | −0.00042 | 0.00059 | 0.00059 | 0.00059 | 0.00059 | 1.00 |
| gmis | −0.00050 | −0.00050 | 0.00057 | 0.00057 | 0.00057 | 0.00057 | 1.00 |
| Q misspecified | | | | | | | |
| gcor | −0.00035 | −0.00035 | 0.00059 | 0.00059 | 0.00059 | 0.00059 | 1.00 |
| gmis | 0.01210 | 0.01210 | 0.00049 | 0.00048 | 0.00063 | 0.00063 | 1.00 |
| n=100, all runs |
| Q correct | | | | | | | |
| gcor | 0.00049 | | 0.00694 | | 0.00693 | | |
| gmis | −0.00215 | | 0.00635 | | 0.00635 | | |
| Q misspecified | | | | | | | |
| gcor | 0.00113 | | 0.00685 | | 0.00684 | | |
| gmis | 0.01226 | | 0.00528 | | 0.00543 | | |
| n=100, (24 runs omitted) |
| Q correct | | | | | | | |
| gcor | 0.00296 | 0.00295 | 0.00679 | 0.00678 | 0.00679 | 0.00679 | 1.00 |
| gmis | 0.00023 | 0.00023 | 0.00621 | 0.00621 | 0.00621 | 0.00620 | 1.00 |
| Q misspecified | | | | | | | |
| gcor | 0.00357 | 0.00363 | 0.00671 | 0.00669 | 0.00671 | 0.00670 | 1.00 |
| gmis | 0.01474 | 0.01473 | 0.00509 | 0.00509 | 0.00530 | 0.00530 | 1.00 |
| n=100, Q bounded |
| Q correct | | | | | | | |
| gcor | 0.00049 | –0.00182 | 0.00694 | 0.00781 | 0.00693 | 0.00781 | 0.89 |
| gmis | −0.00215 | −0.00168 | 0.00635 | 0.01033 | 0.00635 | 0.01033 | 0.62 |
| Q misspecified | | | | | | | |
| gcor | 0.00113 | −0.00052 | 0.00685 | 0.00738 | 0.00684 | 0.00738 | 0.93 |
| gmis | 0.01226 | 0.01031 | 0.00528 | 0.00592 | 0.00543 | 0.00602 | 0.90 |
7.2 Results
The one-step and iterative TMLEs exhibit similar performance when n=1000, with RE=1. When n=100 the iterative TMLE failed to converge for 24 of the 1,000 datasets. The performance of the two TMLEs on the remaining 976 datasets was quite similar. However, the fact that the bias, variance, and MSE of the one-step TMLE are larger when evaluated over all 1000 datasets tells us that the 24 omitted datasets where the iterative TMLE failed were among the most challenging. One way to repair the performance of the iterative TMLE is to bound predicted outcome probabilities away from 0 and 1. We re-analyzed the same 1000 datasets enforcing bounds on Qˉn of (10−9,1−10−9) for both estimators. This minimal bounding prevents the iterative TMLE from failing, and should not introduce truncation bias. Bounding Qˉn allowed the iterative TMLE to produce a result for all analyses. Enforcing bounds had no effect on estimates produced by the one-step TMLE. This confirms that the strategy of taking many small steps within a local neighborhood whose boundaries shift minutely with each iteration helps avoid extremes. Although the iterative TMLE no longer failed when Qˉn was bounded, it had higher variance and MSE than the one-step TMLE. Efficiency gains of the one-step TMLE were between 7 and 28 percent.
7.3 Simulation study II
This study more closely examines estimator performance when there is sparsity in the data. Sparsity was introduced by overfitting the initial Qˉn0, leaving little signal for the targeting step. Theory suggests the one-step TMLE will be a more stable estimator than the iterative TMLE under these challenging conditions. To explore the impact of overfitting the data on the iterative and one-step TMLEs we constructed a nested sequence of correct logistic regression outcome models. Covariates W1, W2, W3 were generated as above. Eight additional independent and identically distributed covariates W4,…,W12 were drawn from a normal distribution with mean 0 and standard deviation 1. None of the additional covariates were causally related to Y or A. The binary treatment indicator, A was generated in the same way as in study I. The outcome was generated by setting P(Y=1|A,W)=expit(−1.2−1.2A−0.1W1−0.2W2−0.1W3).
The smallest correct model, Qc1, regresses Y on A, W1, W2, W3. Subsequent models were constructed by adding a single covariate to the model. The ten nested models were defined as
Qc1: E[Y|A,W]=expit(A+W1+W2+W3),Qc2: E[Y|A,W]=expit(A+W1+W2+W3+W4),Qc3: E[Y|A,W]=expit(A+W1+W2+W3+W4+W5),Qc4: E[Y|A,W]=expit(A+W1+W2+W3+W4+W5+W6),Qc5: E[Y|A,W]=expit(A+W1+W2+W3+W4+W5+W6+W7),Qc6: E[Y|A,W]=expit(A+W1+W2+W3+W4+W5+W6+W7+W8),Qc7: E[Y|A,W]=expit(A+W1+W2+W3+W4+W5+W6+W7+W8+W9),Qc8: E[Y|A,W]=expit(A+W1+W2+W3+W4+W5+W6+W7+W8+W9+W10),Qc9: E[Y|A,W]=expit(A+W1+W2+W3+W4+W5+W6+W7+W8+W9+W10+W11),Qc10: E[Y|A,W]=expit(A+W1+W2+W3+W4+W5+W6+W7+W8+W9+W10+W11+W12).
Each of these regression models is correct, but as the model grows larger and larger the model fitting procedure begins to respond to random variation in the outcome. This problem is more acute at smaller sample sizes.
Estimates were obtained from 1,000 datasets (n−100), with g modeled correctly as a regression of A on W1, W2, W3. Bias, variance, MSE, and RE are reported in Table 2. The iterative TMLE failed on a large number of datasets. On the less challenging datasets where it did converge, performance of the iterative and one-step TMLEs was quite similar. When bounds on Qˉn were enforced at (10−9,1−10−9), the performance of the one-step TMLE was unchanged, while the iterative TMLE was repaired. The iterative TMLE had larger bias, variance, and MSE than the one-step TMLE, which was up to four times more efficient then the iterative TMLE. These results are plotted in Figure 1, along with estimates obtained when the parameter was evaluated based on each initial non-targeted outcome regression fit. The behavior of the iterative TMLE is erratic, while that of the non-targeted estimator and the one-step TMLE are quite stable.
Table 2:
Simulation Study II. Bias, variance, mean squared error (MSE) and relative efficiency (RE) of the one-step TMLE and iterative TMLE over 1000 Monte Carlo simulations, n=100.
| Bias | Variance | MSE | RE |
| One-Step | Iterative | One-Step | Iterative | One-Step | Iterative | |
| Qnˉunbounded, problematic runs omitted* |
| Qc1 | 0.00545 | 0.00544 | 0.00368 | 0.00367 | 0.00370 | 0.00370 | 1.00 |
| Qc2 | 0.00509 | 0.00512 | 0.00376 | 0.00376 | 0.00379 | 0.00379 | 1.00 |
| Qc3 | 0.00345 | 0.00346 | 0.00384 | 0.00384 | 0.00385 | 0.00385 | 1.00 |
| Qc4 | 0.00109 | 0.00110 | 0.00408 | 0.00408 | 0.00408 | 0.00408 | 1.00 |
| Qc5 | −0.00047 | −0.00031 | 0.00425 | 0.00425 | 0.00425 | 0.00425 | 1.00 |
| Qc6 | −0.00276 | −0.00279 | 0.00443 | 0.00444 | 0.00444 | 0.00445 | 1.00 |
| Qc7 | −0.00468 | −0.00472 | 0.00461 | 0.00462 | 0.00463 | 0.00464 | 1.00 |
| Qc8 | −0.00821 | −0.00873 | 0.00514 | 0.00550 | 0.00520 | 0.00557 | 0.93 |
| Qc9 | −0.00980 | −0.01013 | 0.00549 | 0.00561 | 0.00558 | 0.00570 | 0.98 |
| Qc10 | −0.01306 | −0.01311 | 0.00579 | 0.00580 | 0.00596 | 0.00597 | 1.00 |
| Qnˉbounded at(10−9,1−10−9) |
| Qc1 | −0.00308 | −0.00687 | 0.00425 | 0.01649 | 0.00426 | 0.01652 | 0.26 |
| Qc2 | −0.00341 | −0.00672 | 0.00435 | 0.01509 | 0.00435 | 0.01512 | 0.29 |
| Qc3 | −0.00480 | −0.00979 | 0.00448 | 0.01363 | 0.00450 | 0.01371 | 0.33 |
| Qc4 | −0.00622 | −0.00909 | 0.00466 | 0.01508 | 0.00470 | 0.01515 | 0.31 |
| Qc5 | −0.00784 | −0.01420 | 0.00482 | 0.00872 | 0.00487 | 0.00891 | 0.55 |
| Qc6 | −0.00958 | −0.01613 | 0.00501 | 0.00708 | 0.00510 | 0.00734 | 0.69 |
| Qc7 | −0.01117 | −0.01519 | 0.00521 | 0.01002 | 0.00533 | 0.01024 | 0.52 |
| Qc8 | −0.01393 | −0.01979 | 0.00565 | 0.00777 | 0.00583 | 0.00816 | 0.72 |
| Qc9 | −0.01528 | −0.02086 | 0.00597 | 0.00778 | 0.00620 | 0.00821 | 0.76 |
| Qc10 | −0.01746 | −0.01745 | 0.00632 | 0.01241 | 0.00662 | 0.01270 | 0.52 |
8 Universal canonical one-dimensional submodel that targets a multidimensional target parameter
Let Ψ: m→H be a Hilbert-space valued pathwise differentiable target parameter. Typically, we simply have H=IRd endowed with the standard inner product x,y=∑j=1dxjyi. However, we also allow that Ψ(P) is a function t→Ψ(P)(t) from τ⊂IR to IR in a Hilbert space L2(Λ) endowed with inner product h1,h2=∫h1th2tdΛt, where Λ is a user supplied positive measure with ∫dΛt<∞. For notational convenience, we will often denote the inner product h1,h2 with h1⊤h2, analogue to the typical notation for the inner product in IRd. Let h=h,h be the Hilbert space norm, which would be the standard Euclidean norm in the case that H=IRd. Let D∗(P) be the canonical gradient. If H=IRd, then this is a d-dimensional canonical gradient D∗(P)=(Dj∗(P): j=1,…,d), but in general D∗(P)=(Dt∗(P):t∈τ). Let L(p)=−logp, where p=dP/dμ is a density of P≪μ w.r.t. some dominating measure μ. In this section we will construct a one-dimensional submodel Pϵ:ϵ≥0 through P at ϵ=0 so that, for any ϵ≥0,
(6)ddϵPnLpϵ=PnD∗pϵ.
The one-step TMLE Pϵn with ϵn=argminϵPnLPϵ, or ϵn chosen large enough so that the derivative is smaller than (e.g.) 1/n, now solves |ddϵPnLPϵϵ=0=0 (or < 1/n), and thus PnD∗Pϵn=0 (or <1/n). Note that PnD∗Pϵn=0 implies that PnDt∗Pϵn=0 for all t∈τ so that the one-step TMLE solves all desired estimating equations.
8.1 A universal canonical submodel that targets a multidimensional target parameter
Consider the following submodel: for ϵ≥0, we define
(7)pϵ=pΠ[0,ϵ]1+PnD∗Px⊤D∗PxD∗Pxdx=pexp∫0ϵPnD∗Px⊤D∗PxD∗Pxdx.
Theorem 4We havepϵ:ϵ≥0is a family of probability densities, its score atϵis a linear combination ofDt∗Pϵfort∈τ, and is thus in the tangent space atTPϵ, and
ddϵPnLPϵ=PnD∗Pϵ.
As a consequence, we haveddϵPnLPϵ=0impliesPnD∗Pϵ=0.As before, our practical construction below demonstrates that, under regularity conditions, we actually have that pϵ:ϵ⊂m is also a submodel.
The normalization by D∗Px is motivated by a practical analogue construction below and provides an important intuition behind this analytic construction. However, we can replace this by any other normalization for which the derivative of the log-likelihood at ϵ equals a norm of PnD∗Pϵ. To illustrate this let’s consider the case that H=Rd. For example, we could consider the following submodel. Let ∑n(Px)=Pn{D∗(Px)D∗(Px)⊤} be the empirical covariance matrix of D∗(Px), and let ∑n−1(Px) be its inverse. We could then define for ϵ>0,
pϵ=pexp∫0ϵ{PnD∗(Px)}⊤∑n−1D∗(Px)dx.
In this case, we have
ddϵPnL(Pϵ)=PnD∗(Pϵ)T∑n(Pϵ)−1PnD∗(Pϵ).
This seems to be an appropriately normalized norm, equal to the euclidean norm of the orthonormalized version of the original D∗(Pϵ), so that the one-step TMLE will still satisfy that ∥PnD∗(Pϵn)∥=0.
It is not clear to us if these choices have a finite sample implication for the resulting one-step TMLE (asymptotics is the same), and if one choice would be better than another, but either way, the resulting one-step TMLE ends up with a Pϵn satisfying PnD∗(Pϵn)=0 (or oP(1/n)), the only key ingredient in the proof of the asymptotic efficiency of the TMLE.
8.2 The practical construction of a universal canonical one-dimensional submodel targeting a multidimensional target parameter
Let’s define a local least favorable submodel {pδlfm:δ}⊂m by the following local property: for all δ
ddδlogpδlfm|δ=0⊤δ=D∗(P)⊤δ.
For the case that H=IRd, this corresponds with assuming that the score of the submodel at δ=0 equals the canonical gradient D∗(P), while, for a general Hilbert space, it states that the derivative of log pϵ in the direction δ (a function in H) equals D∗(P),δ=∫Dt∗(P)δ(t)dΛ(t).
Consider the log-likelihood criterion PnL(Pδlfm), and note that its derivative at δ=0 in the direction δ equals PnD∗(p),δ=PnD∗(P)⊤δ. For a small number dx, we want to maximize the log-likelihood over all δ with ∥δ∥≤dx, and locally, this corresponds with maximizing its linear gradient approximation:
δ→PnD∗(P)⊤δ.
By the Cauchy-Schwarz inequality, it follows that this is maximized over δ with ∥δ∥≤dx by
δn∗(P,dx)=PnD∗(P)∥PnD∗(P)∥dx≡δn∗(P)dx,
where we defined δn∗(P)=PnD∗(P)/∥PnD∗(P)∥. We can now define our update Pdx=Pδn∗(P,dx)lfm. This process can now be iterated by applying the above with P replaced by Pdx, resulting in an update P2dx, and in general PKdx. So this updating process is defined by the differential equation:
Px+dx=Px,δn∗(Px)dxlfm,
where Px,δlfm is the local least favorable multidimensional submodel above but now through Px instead of P.
Assuming that the local least favorable model h→Px,hlfm is continuously twice differentiable with a score D∗(Px) at h=0, we obtain a second order Taylor expansion
px,δn∗(Px)dxlfm=px+ddhpx,hlfm|h=0⊤δn∗(Px)dx+O((dx)2)=px(1+{δn∗(Px)}⊤D∗(Px)dx)+O((dx)2),
so that, under mild regularity conditions, we obtain
px+dx=px(1+{δn∗(Px)}⊤D∗(Px)dx)+O((dx)2).
This implies:
px=pexp∫0ϵ{PnD∗(Px)}⊤∥PnD∗(Px)∥D∗(Px)dx.
So we obtained the exact same analytical representation (7) as above. Since the above practical construction starts out with P∈m and never leaves the model m, this proves that, under mild regularity conditions, this analytic representation (7) is actually a submodel of m after all, but, when using its practical implementation and approximation, one should use the actual local least favorable submodel in order to guarantee that one stays in the model. We can formalize this in a theorem analogue to Theorem 2, but instead such a theorem will be presented in Section 10 for the more general targeted minimum loss-based estimation methodology.
The above practical construction provides us with an intuition for the normalization by ∥PnD∗(Px)∥.
8.3 Existence of MLE or approximate MLE ϵn
Since
Pnlogpϵ=∫0ϵ∥PnD∗(Px)∥dx,
and its derivative thus equals ∥PnD∗(Pϵ)∥, we have that the log-likelihood is non-decreasing in ϵ.
If the local least favorable submodel in the practical construction of the one-dimensional universal canonical submodel {pϵ:ϵ≥0} (7) only contains densities with supremum norm smaller than some M<∞ (e.g., this is assumed by the 26 model m), then we will have that supϵ≥0supoϵopϵ(0)<M<∞. This implies that Pnlogpϵ is bounded from above by log M. Let’s first assume that limϵ→∞Pnlogpϵ<∞. Thus, the log-likelihood is a strictly increasing function till it becomes at, if ever. Suppose that lim supx→∞∥PnD∗(Px)∥>δ>0 for some δ>0. Then it follows that the log-likelihood converges to infinity when x converges to infinity, which contradicts the assumption that the log-likelihood is bounded from above by log M<∞. Thus, we know that lim supx→∞∥PnD∗(Px)∥=0 so that we can find an ϵn so that for ϵ>ϵn∥PnD∗(Pϵ)∥<1/n, as desired.
Suppose now that we are in a case in which the log-likelihood converges to infinity when ϵ→∞, so that our bounded log likelihood assumption is violated. This might correspond with a case in which each pϵ is a continuous density, but pϵ starts approximating an empirical distribution when ϵ→∞. Even in such a case, one would expect that we will have that ∥PnD∗(Pϵ)∥→0, just like an NPMLE of a continuous density of a survival time solves the efficient influence curve equation for its survival function.
The above practical construction of the submodel, as an iterative local maximization of the log-likelihood along its gradient, strongly suggests that even without the above boundedness assumption the derivative ∥PnD∗(Pϵ)∥ will converge to zero as ϵ→∞ so that the desired MLE or approximate MLE exists. Our initial practical implementations of this one-step TMLE of a multivariate target parameter demonstrate that it works well and that finding the desired maximum or approximate maximum is not an issue. We will demonstrate the implementation and practical demonstration of such a one-step TMLE for challenging causal inference problems in a future article.
8.4 A universal score-specific one-dimensional submodel targeting a multivariate score equation
In the above two subsections we could simply replace D∗(P) by a user supplied D(P), giving us a theoretical one-dimensional parametric model {Pϵ:ϵ} so that the derivative ddϵPnL(Pϵ) at ϵ equals ∥PnD(Pϵ)∥, so that a corresponding one-step TMLE will solve PnD(Pϵn)=0. Similarly, given a local parametric model whose score at ϵ=0 equals D(P) will yield a corresponding practical construction of this universal submodel. One can also use such a universal score-specific submodel to construct one-step TMLE of a one-dimensional target parameter with extra properties by making it solve not only the efficient influence curve equation but also other equations of interest (such as the PnD1∗(Qn∗)=PnD2∗(Qn∗)=0 in Section 6). In the current literature, solving multiple score equations typically required an iterative TMLE based on a local score-specific submodel, so that these estimation problems can be revisited with this new one-step TMLE (see our supplementary material).
9 Example: A one-step TMLE, based on universal canonical one-dimensional submodel, of an infinite dimensional target parameter
An open problem has been the construction of an efficient substitution estimator Ψ(Pn∗) of a pathwise differentiable infinite dimensional target parameter Ψ(P0) such as a survival function. Current approaches would correspond with incompatible estimators such as using a TMLE for each Ψ(P0)(t) separately, resulting in a nonsubstitution estimator such as a non-monotone estimator of a survival function. In this section we demonstrate, through a causal inference example, that our universal canonical submodel allows us to solve this problem with the one-step TMLE defined in the previous section.
Let O=(W,A,T)∼P0, where W are baseline covariates, A∈{0,1} is a point-treatment, and T is a survival time. Consider a statistical model m that only makes assumptions about the conditional distribution g0(a|W)=P0(A=a|W) of A, given W. Let W→d(W)∈{0,1} be a given dynamic treatment satisfying g0(d(W)|W)>0a.e. Let Ψ: m→H be defined by:
Ψ(P)(t)=EPP(T>t|A=d(W),W),t≥0.
Under a causal model and the randomization assumption this equals the counterfactual survival function P(Td>t) of the counterfactual survival time Td under intervention d.
Let H be the Hilbert space of real valued functions on IR≥0 endowed with inner product h1⊤h2=h1,h2=∫h1(t)h2(t)dΛ(t) for some user-supplied positive and finite measure Λ. The norm on this Hilbert space is thus given by h=hh⊤=∫h(t)2dΛ(t). Let Qˉt(A,W)=P(T>t|A,W),Y(t)=I(T>t),QW the marginal probability distribution of W, and Q=(Qˉ,QW). The efficient influence curve D∗(P)=(Dt∗(P): t≥0) is defined by:
Dt∗(P)(O)=I(A=d(W))g(A|W)(Y(t)−Qˉt(A,W))+{Qˉt(d(W),W)−Ψ(P)(t)}≡D1,t∗(g,Qˉ)+D2,t∗(P),
where D1,t∗(g,Qˉ) is the first component of the efficient influence curve that is a score of the conditional distribution of T, given A,W. Notice that Ψ(P)=Ψ1(QW,Qˉ)=(QWQˉt: t≥0). We will estimate QW,0 with the empirical distribution of W1,…,Wn, so that a TMLE will only need to target the estimator of the conditional survival function Qˉ0 of T, given A,W. Let q(t|A,W) be the density of T, given A,W and let qn be an initial estimator of this conditional density. For example, one might use machine learning to estimate the conditional hazard q0/Q¯0, which then implies a corresponding density estimator qn. We are also given an estimator gn of g0.
The universal canonical one-dimensional submodel (7) applied to p=qn is defined by the following recursive relation: for ϵ>0,
qn,ϵ=qnexp∫0ϵ{PnD1∗(gn,Qˉn,x)}⊤D1∗(gn,Qˉn,x)D1∗(gn,Qˉn,x)dx.
To obtain some more insight in this expression, we note, for example, that the inner product is given by:
(8){PnD1∗(gn,Qˉn,x)}⊤D1∗(gn,Qˉn,x)(o)=∫t(PnD1,t∗(gn,Qˉn,x)D1,t∗(gn,Qˉn,x)(o)dΛ(t),
and similarly we have such an integral representation of the norm in the denominator. Our Theorem 4, or explicit verification, shows that for all ϵ≥0,qn,ϵ is a conditional density of T, given A,W, and
ddϵPnlogqn,ϵ=PnD1∗(gn,Qˉn,ϵ).
Thus, if we move ϵ away from zero, the log-likelihood increases, and, one searches for the first ϵn so that this derivative is smaller than (e. g.) 1/n. Let qn∗=qn,ϵn, and let Qˉn,t∗(A,W)=∫t∞qn∗(s|A,W)ds be its corresponding conditional survival function, t≥0. Then our one-step TMLE of the d-specific survival function Ψ(P0) is given by ψn∗=Ψ(QW,n,Qˉn∗)=QW,nQˉn∗:
ψn∗(t)=1n∑i=1nQˉn,t∗(d(Wi),Wi).
Since qn∗ is an actual conditional density, it follows that ψn∗ is a survival function. Suppose that the derivative of the log-likelihood at ϵn equals zero exactly (instead of being smaller than 1/n). Then, we have PnD∗(gn,QW,n,Qˉn∗)=0, so that for each t≥0,PnDt∗(gn,QW,n,Qˉn∗)=0, making ψn∗(t) a standard TMLE of ψ0(t), so that its asymptotic linearity for a fixed t can be established accordingly. Let’s now consider a proof of weak convergence of n(ψn∗−ψ0) as a random function. Firstly, let’s assume that an exact MLE is obtained so that PnD∗(gn,QW,n,Qˉn∗)=0. Combined with Ψ(Qn∗)−Ψ(Q0)=−P0D∗(gn,Qn∗)+R2((Qn∗,gn),(Q0,g0)), where R2()=(R2t():t∈τ) for an explicitly defined R2t(P,P0), we then obtain
ψn∗−ψ0=(Pn−P0)D∗(gn,Qn∗)+R2((Qn∗,gn),(Q0,g0)).
We now assume that {Dt∗(P):P∈mt∈τ} is a P0-Donsker class, supt∈τP0{Dt∗(gn,Qn∗)−Dt∗(g0,Q0)}2→0 in probability, and supt|R2t((Qn∗,gn),(Q0,g0))|=oP(n−1/2). Then, it follows that
n(ψn∗−ψ0)=n(Pn−P0)D∗(P0)+oP(n−1/2)⇒dG0,
that is, n(ψn∗−ψ0) converges weakly as a random element of the cadlag function space endowed with the supremum norm to a gaussian process G0 with covariance structure implied by the covariance function ρ(s,t)=P0Ds∗(P0)Dt∗(P0). In particular, if g0 is known, then R2t((Qn∗,g0),(Q0,g0))=0, so that the second order term condition supt|R2t((Qn∗,gn),(Q0,g0))|=oP(n−1/2) is automatically satisfied with oP(n−1/2) replaced by 0. This also allows the construction of a simultaneous confidence band for ψ0. Due to the double robustness of the efficient influence curve, under appropriate conditions, one can also obtain asymptotic linearity and weak convergence with an inefficient influence curve under misspecfication of either gn or Qˉn.
If we only have PnD∗(Pn∗)=oP(n−1/2) (instead of 0), then the above proof still applies so that we now obtain:
n(ψn∗−ψ0)=(Pn−P0)D∗(P0)+rn,
but where now rn=oP(1/n), so that we obtain asymptotic efficiency and weak convergence in the Hilbert space L2(Λ), beyond the point-wise efficiency of ψn∗(t). However, in practice, one can actually track the supremum norm PnD∗(Pϵn)∞=supt|PnDt∗(Pϵn)|, and if one observes that for the selected ϵn this supremum norm is smaller than 1/n, then, we still obtain the asymptotic efficiency in supremum norm above.
Regarding the practical construction of qn,ϵ, we could use the following infinite dimensional local least favorable submodel through a conditional density q given by
qδlfm=q(1+δ⊤D1∗(g,Qˉ)),
and follow the practical construction described in the previous section for general local least favorable submodels. Notice that here δ⊤D1∗(g,Qˉ)=∫δ(t)D1,t∗(g,Qˉ)dΛ(t). In order to guarantee that the supremum norm of the density qδlfm for local δ with δ<dx remains below a universal constant M<∞, one could present such models in the conditional hazard on a logistic scale that bounds the hazard between [0, M]. However, we doubt that this will be an issue in practice, and since it may be necessary for the continuous density qn,ϵ to approximate an empirical distribution in some sense in order to solve PnD∗(Pϵ)=0, we do not want to prevent this from happening.
10 Universal canonical one-dimensional submodel for targeted minimum loss-based estimation of a multidimensional target parameter
10.1 A universal canonical one-dimensional submodel
For the sake of presentation we will focus on the case that the target parameter is Euclidean valued, i.e. H=IRd, but the presentation immediately generalizes to infinite dimensional target parameters, as in the previous section. Let’s now generalize the construction of a universal canonical submodel to the more general targeted minimum loss based estimation methodology. We now assume that Ψ(P)=Ψ1(Q(P))∈IRd for some target parameter Q: m→Q(m) defined on the model and real valued function Ψ1:Q(m)→IRd. Let L(Q)(O) be a loss-function for Q(P) in the sense that Q(P)=argminQ∈Q(m)PL(Q). Let D∗(P)=D∗(Q(P),G(P)) be the canonical gradient of Ψ at P, where G:m→G(m) is some nuisance parameter. We consider the case that the linear span of the components of the efficient influence curve D*(P) is in the tangent space of Q, so that a least favorable submodel does not need to fluctuate G: otherwise, one just includes G in the definition of Q. Given, (Q, G), let {Qδlfm:δ}⊂Q(m) be a local d-dimensional least favorable model w.r.t. loss function L(Q) at δ=0 so that
ddδL|(Qδlfm)δ=0=D∗(Q,G).
The dependence of this submodel on G is suppressed in this notation.
Consider the empirical risk PnL(Qδlfm), and note that its gradient at δ=0 equals PnD*(Q, G). For a small number dx, we want to minimize the empirical risk over all δ with ||δ||≤dx, and locally, this corresponds with maximizing its linear gradient approximation:
δ→{PnD∗(Q,G)}⊤δ.
By the Cauchy-Schwarz inequality, it follows that this is maximized over δ with ||δ||≤dx by
δn∗(Q,dx)=PnD∗(Q,G)PnD∗(Q,G)dx≡δn∗(Q)dx,
where we defined δn∗(Q)=PnD∗(Q,G)/||PnD∗(Q,G)||. We can now define our update Qdx=Qδn∗(Q,dx)lfm. This process can now be iterated by applying the above with Q replaced by Qdx, resulting in an update Q2dx, and in general QKdx. So this updating process is defined by the differential equation:
Qx+dx=Qx,δn∗(Qx)dx)lfm,
where Qx,δlfm is the local least favorable multidimensional submodel above but now through Qx instead of Q.
Assume that for some L˙(Q)(O), we have
(9)ddhL|(Qx,hlfm)h=0=L˙(Qx)ddh|Qx,hlfmh=0.
Then,
ddh|Qx,hlfmh=0=D∗(Qx,G)L˙(Qx).
Utilizing that the local least favorable model h→Qx,hlfm is continuously twice differentiable with a score D*(Qx, G) at h=0, we obtain a second order Taylor expansion
Qx,δn∗(Qx)dxlfm=Qx+ddh|Qx,hlfmh=0δn∗(Qx)dx+O((dx)2)=Qx+D∗(Qx,G)⊤L˙(Qx)δn∗(Qx)dx+O((dx)2).
This implies the following recursive analytic definition of the universal canonical submodel through Q:
(10)Qϵ=Q+∫0ϵD∗(Qx,G)⊤L˙(Qx)δn∗(Qx)dx.
Let’s now explicitly verify that this indeed satisfies the desired condition so that the one-step TMLE solves PnD∗(Qϵn,G)=0. Only assuming (9) it follows that
ddϵPnL(Qϵ)=PnddϵL(Qϵ)=PnL˙(Qϵ)ddϵQϵ=PnL˙(Qϵ)D∗(Qϵ,G)⊤L˙(Qϵ)δn∗(Qϵ)=PnD∗(Qϵ,G)⊤δn∗(Qϵ)={PnD∗(Qϵ,G)}⊤PnD∗(Qϵ,G)PnD∗(Qϵ,G)=∑j=1d{PnDj∗(Qϵ,G)}2PnD∗(Qϵ,G)=PnD∗(Qϵ,G).
In addition, under some regularity conditions, so that the following derivation in terms of the local least favorable submodel applies, it also follows that Qϵ∈Q(m).
This proves the following theorem.
Theorem 5Given any (Q, G) compatible with modelm, let{Qδlfm:δ∈Ba(0)}⊂Q(m)be a local least favorable model w.r.t. loss function L(Q) atδ=0so that
ddδL|(Qδlfm)δ=0=D∗(Q,G).
HereBa(0)={x:||x||<a}for some positive number a. Assume that for someL˙(Q)(O), we haveddϵL|(Qϵlfm)ϵ=0=L˙(Q)ddϵ|Qϵlfmϵ=0.
Consider the corresponding univariate model{Qϵ:ϵ}defined by (10). It goes through Q atϵ=0, and, it satisfies that for allϵ(11)PnddϵL(Qϵ)=||PnD∗(Qϵ,G)||,
where||x||=∑j=1dxj2is the Euclidean norm.In addition, assume that a in Ba(0) can be chosen to be independent of the choice (Q, G) in{(Qϵ,G):ϵ>0}, and assume the following second order Taylor expansion: forh=(h1,…,hd),
Qϵ,hlfm=Qϵ+ddh|Qϵ,hlfmh=0h+R2(Qϵ,G,||h||)=Qϵ+D∗(Qϵ,G)L˙(Qϵ)h+R2(Qϵ,G,||h||),
wheresupϵsupo∈o|R2(Qϵ,G,||h||)(o)|=O((||h||2).
We also assume thatsupϵsupo∈o|D∗(Qϵ,G)L˙(Qϵ)(o)|<∞.Then, we also have{Qϵ:ϵ≥0}⊂m.
11 Summary
Given a d-variate estimating function (Q,O)→D(Q,G)(O), a loss function L(Q) for Q:m→Q(m), a local d-dimensional submodel {Qδsm:δ}⊂Q(m) so that |ddδL(Qδsm)δ=0=D(Q,G), we constructed a one-dimensional universal submodel {Qϵ:ϵ≥0}⊂Q(m) through Q, at ϵ=0, that has the property that for all ϵ≥0ddϵPnL(Qϵ)=||PnD(Qϵ,G)||, where ||⋅|| is the Euclidean norm. Our analytic formula for this universal submodel does not depend on the local submodel, but the local submodel can still play a role for the practical construction. In the special case d=1, we also constructed a universal one-dimensional submodel so that for all ϵddϵL(Qϵ)=D(Qϵ,G), which then implies ddϵPnL(Qϵ)=PnD(Qϵ,G). For each of these universal submodels, the one-step TMLE Qϵn with ϵn=argminϵPnL(Qϵ) solves each PnDj(Qϵn,G)=0,j=1,…,d. We showed how this result immediately extends to an infinite dimensional estimating function D=(Dt:t∈τ), by replacing the Euclidean inner product by an Hilbert space inner product. If D() is the canonical gradient of a target parameter, we referred to this submodel as the universal canonical submodel, and, if d=1, the universal least favorable submodel.
The constructions of these universal submodels correspond with iteratively defining Qϵ+dϵ=Qϵ,δ(ϵ)dϵsm where δ(ϵ)=PnD(Qϵ,G)/||PnD(Qϵ,G)|| moves along the gradient of the PnL(Qϵ) empirical risk at ϵ. These practical constructions demonstrate that this algorithm succeeds in updating an initial Q into an update Qn∗=Qϵn that solves the desired equation PnD(Qϵn,G)=0 while minimally decreasing the empirical risk relative to its initial value PnL(Q). That is, with minimal additional data fitting it achieves the desired goal, while fully preserving the statistical properties of the initial estimator represented by Q.
The universal submodels have dramatic implications for the TMLE literature by allowing one to construct a one-step TMLE for any multivariate and even infinite dimensional pathwise differentiable target parameters, solving the desired estimating equation, so that this TMLE is asymptotically efficient and possibly has additional desired properties implied by solving the equation PnD(Qϵn,G)=0. The one-step TMLE step only involves minimizing an empirical risk over a univariate uctuation parameter ϵ. In the current literature, we defined various iterative TMLE based on multivariate local submodels that can now be replaced by a more stable one-step TMLE only relying on maximizing over a univariate ϵ. We demonstrated such new one-step TMLE for various examples in this article and supplementary material, but obviously this will impact many more problems than the ones presented here. We demonstrated with a simulation study that the one-step TMLE is more robust and stable than the iterative TMLE in finite samples when the targeting step gets challenging.
The important advantages of the TMLE based on a local least favorable submodel relative to estimating equation methods and the one-step estimator have been emphasized in the literature. Since the estimating equation methodology is more limited than the one-step estimator by 1) relying on an estimating function representation of the efficient influence curve, 2) existence and 3) uniqueness of its solution, let’s focus on contrasting the TMLE to the one-step estimator. One important advantage of the TMLE relative to the one-step estimator has been that it is a substitution estimator thereby making it in principle more robust by respecting the global constraints of the model m. Beyond this, the fact that the TMLE updates an initial estimator through minimization of a loss-function specific empirical risk, it allows one to further refine the targeted update step such as carried out in C-TMLE. Another advantage is that it actually provides a corresponding data distribution Pn∗∈m compatible with the estimator of the target parameter, for example, allowing one to compare different TMLEs by the empirical risk of Pn∗. On the other hand, the one-step estimator takes only one step, and that can add important stability relative to a possibly iterative TMLE, making the comparison not so clear in the case that the TMLE is iterative. However, our new universal submodels presented in this article make the TMLE also a one step estimator, thereby dealing with this possible criticism of TMLE.
The benefit of being a substitution estimator is particularly appealing if one estimates an infinite dimensional target parameter such as a survival function with clear global structure. Due to our universal canonical one-dimensional submodel, we could provide one-step TMLE that completely respects this global structure of the infinite dimensional target parameter, something a one-step estimator (or estimating equation method) cannot achieve.
Future simulation studies will have to evaluate the practical benefits that come with the new one-step TMLEs based on universal least favorable or canonical submodels, relative to TMLEs based on the typical local least favorable submodel.