Instrumental variable regression via kernel maximum moment loss

Rui Zhang; Masaaki Imaizumi; Bernhard Schölkopf; Krikamol Muandet

doi:10.1515/jci-2022-0073

Article Open Access

Instrumental variable regression via kernel maximum moment loss

Rui Zhang , Masaaki Imaizumi , Bernhard Schölkopf and Krikamol Muandet

Published/Copyright: April 4, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Journal of Causal Inference Volume 11 Issue 1

Abstract

We investigate a simple objective for nonlinear instrumental variable (IV) regression based on a kernelized conditional moment restriction known as a maximum moment restriction (MMR). The MMR objective is formulated by maximizing the interaction between the residual and the instruments belonging to a unit ball in a reproducing kernel Hilbert space. First, it allows us to simplify the IV regression as an empirical risk minimization problem, where the risk function depends on the reproducing kernel on the instrument and can be estimated by a U-statistic or V-statistic. Second, on the basis this simplification, we are able to provide consistency and asymptotic normality results in both parametric and nonparametric settings. Finally, we provide easy-to-use IV regression algorithms with an efficient hyperparameter selection procedure. We demonstrate the effectiveness of our algorithms using experiments on both synthetic and real-world data.

Keywords: nonlinear instrumental variable regression; conditional moment restriction; kernel method; reproducing kernel Hilbert space; U/V-statistics

MSC 2010: 46E22; 62G08; 34A55

1 Introduction

Instrumental variables (IVs) have become standard tools for economists, epidemiologists, and social scientists to uncover causal relationships from observational data [1,2]. Randomization of treatments or policies has been perceived as the gold standard for such tasks, but it is generally prohibitive in many real-world scenarios due to time constraints or ethical concerns. When treatment assignment is not randomized, it is generally impossible to discern between the causal effect of treatments and spurious correlations that are induced by unobserved factors. Instead, IVs enable the investigators to incorporate natural variation through an IV that is associated with the treatments, but not with the outcome variable, other than through its effect on the treatments. In economics, for instance, the season of birth was used as an IV to study the return from schooling, which measures causal effect of education on labor market earning [3]. In genetic epidemiology, the idea to use genetic variants as IVs, known as Mendelian randomization, has also gained increasing popularity [4,5].

There are numerous variations of methods for dealing with IV. Classical methods for IV regression often rely on a linearity assumption in which a two-stage least squares (2SLS) is the most popular technique [6]. The generalized method of moments (GMM) of ref. [7], which imposes the orthogonality restrictions, can also be used for the linear IV regression. For nonlinear regression, numerous methodologies have emerged in the field of nonparametric IV [8–11] and recent machine learning [12–18]. These estimators can be categorized into two general approaches. The former generalizes a two-stage optimization procedure [12,16], whereas the latter requires solving a minimax optimization problem [13,14,17,19,20]. While the equivalence between 2SLS and GMM is known in the linear setting, the connection between two-stage procedure and minimax optimization in the nonlinear setting was established recently, see, e.g., refs [15], [17, Appendix F], and [21, Sec 3.3]. For statistical inference, numerous approaches have been developed using asymptotic distributions of each point of uniform bands of the target functions by the sieve method [22–26].

In this article, we propose a novel nonlinear IV regression framework named Maximum Moment Restriction-IV (MMR-IV), which utilizes the reproducing kernel Hilbert spaces (RKHSs) and an associated positive definite kernel from the machine learning domain.

The MMR-IV possesses mainly two important features: (i) It reformulates the original conditional moment restriction (CMR) into a single-step empirical risk minimization (ERM) problem, by using the MMR framework of ref. [27] with the kernel and RKHSes. (ii) Based on the ERM problem, it reveals a closed form of the solution by U/V-statistics techniques [28]. Based on this framework, we propose two practical algorithms based on kernel functions and neural networks (NNs), which we call MMR-IV (RKHS) and MMR-IV (NN), respectively.

Our MMR-IV has the following advantages. First, the derived closed-form expression is computationally simple and stable. Further, owing to the generic structure of ERM, we can substitute NNs for kernel functions, which is suitable for highly nonlinear modeling. Second, MMR-IV can identify the optimal solution, i.e., the minimizer of the risk function is guaranteed to be unique, by using the positive definite property of the kernel function. Third, we can identify a distribution that the solution follows in the large sample limit, i.e., the asymptotic distribution. The identified asymptotic distribution allows us to analyze uncertainties of algorithms and estimators, such as statistical tests and confidence intervals. This result is applicable to the case where the model is both parametric and nonparametric, and especially in the nonparametric case, we utilize the functional differentiation technique [29] to derive the results. Moreover, on the basis of the asymptotic distribution, we discuss how the choice of kernel functions affects the efficiency of MMR-IV. Fourth, we provide a hyperparameter selection scheme, by developing a novel interpretation of MMR-IV from a Gaussian process (GP) perspective. This interpretation is obtained by mapping our ERM form to a likelihood of GPs. Since hyperparameter selection for IV regression is a nontrivial problem in both minimax and two-stage formulation, our framework sheds light on efficient methods to solve this problem. In the experimental section, we demonstrate the superiority of the proposed framework by validating its accuracy and effectiveness and show its usefulness in real data analysis.

Our contributions can be summarized as follows:

We study the simple ERM problem converted from the minimax optimization, and then propose the MMR-IV algorithms to minimize the risk based on kernel functions and NNs. This algorithm avoids the difficulties of minimax problems and is computationally simple and stable.
Our method can identify unique solutions by selecting positive definite kernel functions. Our analysis also reveals how the asymptotic efficiency varies with the choice of kernel functions.
We derive the asymptotic normal distribution of the MMR-IV estimators in both parametric and nonparametric settings. This result is derived from the ERM form and allows for statistical inference.
We develop the efficient framework to select hyperparameters of MMR-IV. This approach is based on the interpretation of MMR-IV with kernel functions from a GP perspective.
We compare our methods to a wide range of baselines on different settings and show that MMR-IV has competitive performance.

The rest of this article is organized as follows. Section 2 introduces the IV regression problem and the MMR framework. Our methods are presented in Section 3, followed by the hyperparameter selection in Section 4. Next, we provide the consistency and asymptotic normality results in Section 5. We then provide a thorough discussion on related works in Section 6. Finally, the experimental results are presented in Section 7. All proofs of the main results can be found in the appendix.

2 Preliminaries

2.1 IV regression

Following standard setting in the literature [12–16], let X be a treatment (endogenous) variable taking value in X ⊆ R d and Y a real-valued outcome variable. Our goal is to estimate a function f : X → R from a structural equation model of the form

(1) Y = f ( X ) + ε , X = t ( Z ) + g ( ε ) + ν ,

where we assume that E [ ε ] = 0 and E [ ν ] = 0 . Unfortunately, as we can see from (1), ε is correlated with the treatment X , i.e., E [ ε ∣ X ] ≠ 0 , and hence, standard regression methods cannot be used to estimate f . This setting arises, for example, when there exist unobserved confounders (i.e., common causes) between X and Y . To illustrate the problem, let us consider an example taken from ref. [12], which aims at predicting sales of airline ticket Y under an intervention in price of the ticket X . However, there exist unobserved variables that may affect both sales and ticket price, e.g., conferences, COVID-19 pandemic. This creates a correlation between ε and X in (1) that prevents us from applying the standard regression toolboxes directly on observational data.

Instrumental variable (IV). To address this problem, we assume access to an instrumental variable Z taking value in Z ⊆ R d ′ . As we can see in (1), the instrument Z is associated with the treatments X , but not with the outcome Y , other than through its effect on the treatments. Formally, Z must satisfy (i) Relevance: Z has a causal influence on X , (ii) Exclusion restriction: Z affects Y only through X , i.e., Y ⊥ ⊥ Z ∣ X , ε , and (iii) Unconfounded instrument(s): Z is independent of the error, i.e., ε ⊥ ⊥ Z . See Figure 1 for a visualization of the three conditions. For example, the instrument Z may be the cost of fuel, which influences sales Y only via price X . For detailed exposition on IV, we refer the readers to refs [1,8].

$Figure 1 A causal graph depicting an IV Z Z that satisfies an exclusion restriction and unconfoundedness (there may be a confounder ε \varepsilon acting on X X and Y Y , but it is independent of Z Z ).$

Figure 1

A causal graph depicting an IV Z that satisfies an exclusion restriction and unconfoundedness (there may be a confounder ε acting on X and Y , but it is independent of Z ).

Conditional moment restriction (CMR). Together with the structural assumption in (1), these conditions imply that E [ ε ∣ Z ] = 0 for P Z -almost all z . This is known as a CMR, which we can use to estimate f [30]. For every measurable function h , the CMR implies a continuum of unconditional moment restrictions [13,14]^[1]

(2) E [ ( Y − f ( X ) ) h ( Z ) ] = 0 .

That is, there exists an infinite number of moment conditions, each of which is indexed by the function h . One of the key questions in econometrics is which moment condition should be used as a basis for estimating the function f [31,32]. In this work, we show that, for the purpose of consistently estimating f , it is sufficient to restrict h to be within a unit ball of a RKHS of real-valued functions on Z .

2.2 Maximum moment restriction (MMR)

Throughout this article, we assume that h is a real-valued function on Z , which belongs to an RKHS ℋ k endowed with a reproducing kernel k : Z × Z → R . The RKHS ℋ k satisfies two important properties: for all z ∈ Z and h ∈ ℋ k , (i) k ( z , ⋅ ) ∈ ℋ k and (ii) (reproducing property) h ( z ) = ⟨ h , k ( z , ⋅ ) ⟩ ℋ k where k ( z , ⋅ ) is a function of the second argument. Furthermore, we define Φ k ( z ) as a canonical feature map of z in ℋ k . It follows from the reproducing property that k ( z , z ′ ) = ⟨ Φ k ( z ) , Φ k ( z ′ ) ⟩ ℋ k for every z , z ′ ∈ Z , i.e., an inner product between the feature maps of z and z ′ can be evaluated through the kernel evaluation. Every positive definite kernel k uniquely determines the RKHS for which k is a reproducing kernel [33]. For detailed exposition on kernel methods, see, e.g., [34–36].

Instead of considering all measurable functions h in (2) as instruments, we only restrict to functions that lie within a unit ball of the RKHS ℋ k . The risk functional is then defined as a maximum value of the moment restriction with respect to this function class:

(3) R k ( f ) ≔ sup h ∈ ℋ k , ‖ h ‖ ≤ 1 ( E [ ( Y − f ( X ) ) h ( Z ) ] ) 2 .

The benefits of this formulation are twofold. First, it is computationally intractable to learn f from (2) using all measurable functions as instruments. By restricting the function class to a unit ball of the RKHS, the problem becomes computationally tractable, as will be shown in Lemma 1. Second, this restriction still preserves the consistency of parameters estimated using R k ( f ) . In other words, the RKHS is a sufficient class of instruments for the nonlinear IV problem (cf. Theorem 1). A crucial insight for our approach is that the population risk R k ( f ) has an analytic solution.

Lemma 1

([27], Theorem 3.3) Assume that E [ ( Y − f ( X ) ) 2 k ( Z , Z ) ] < ∞ . Then, we have the following closed-form expression, where ( X ′ , Y ′ , Z ′ ) is an independent copy of ( X , Y , Z ) :

(4) R k ( f ) = E [ ( Y − f ( X ) ) ( Y ′ − f ( X ′ ) ) k ( Z , Z ′ ) ] .

We assume throughout that the reproducing kernel k is integrally strictly positive definite (ISPD).

Assumption 1

The kernel k is continuous, bounded (i.e., sup z ∈ Z k ( z , z ) < ∞ ), and satisfies the condition of ISPD kernels, i.e., for every function g that satisfies 0 < ‖ g ‖ 2 2 < ∞ , we have ∬ Z g ( z ) k ( z , z ′ ) g ( z ′ ) d z d z ′ > 0 .

Popular kernel functions that satisfy Assumption 1 are the Gaussian RBF kernel and Laplacian kernel

k ( z , z ′ ) = exp − ‖ z − z ′ ‖ 2 2 2 σ 2 , k ( z , z ′ ) = exp − ‖ z − z ′ ‖ 1 σ ,

where σ is a positive bandwidth parameter. Another important kernel satisfying this assumption is an inverse multiquadric kernel:

k ( z , z ′ ) = ( c 2 + ‖ z − z ′ ‖ 2 2 ) − γ

where c and γ are positive parameters [37, Ch. 4]. This class of kernel functions is closely related to the notions of universal kernels [38] and characteristic kernels [39]. The former ensures that kernel-based classification/regression algorithms can achieve the Bayes risk, whereas the latter ensures that the kernel mean embeddings can distinguish different probability measures. In principle, they guarantee that the corresponding RKHSs induced by these kernels are sufficiently rich for the tasks at hand. We refer the readers to refs [40,41] for more details.

Next, we further assume the identification for the minimizer of R k ( f ) .

Assumption 2

Consider the function space ℱ and f ∗ ∈ argmin f ∈ ℱ R k ( f ) . Then for every g ∈ ℱ with ∣ E [ g ( X ) ] ∣ < ∞ , E [ g ( X ) − f ∗ ( X ) ∣ Z ] = 0 implies g = f ∗ almost everywhere with respect to P X .

A sufficient condition for identification follows from the completeness property of X for Z , e.g., the conditional distribution of X given Z belongs to the exponential family [8]. See ref. [42] for more general sufficient conditions. Provided identification, it is straightforward to obtain consistency.

3 Our method

We propose to learn f by minimizing R k ( f ) in (4). To this end, we define an optimal function f ∗ as a minimizer of the aforementioned population risk with respect to a function class ℱ of real-valued functions on X , i.e.,

f ∗ ∈ argmin f ∈ ℱ R k ( f ) .

It is instructive to note that population risk R k depends on the choice of the kernel k . On the basis of Assumption 1 and Lemma 1, we obtain the following result by applying [27, Theorem 3.2], showing that R k ( f ) = 0 if and only if f satisfies the original CMR (see Appendix A.2 for the proof).

Proposition 1

Assume that the kernel k is ISPD. Then for every real-valued measurable function f, R k ( f ) = 0 if and only if E [ Y − f ( X ) ∣ Z = z ] = 0 for P Z -almost every z.

Proposition 1 holds as long as the kernel k belongs to a class of ISPD kernels. Hence, it allows for more flexibility in terms of kernel choice. Moreover, it is not difficult to show that R k ( f ) is strictly convex in f ; see the proof in Appendix A.3.

Proposition 2

If ℱ is a convex set and Assumptions 1 and 2 hold, then the risk R k given in (4) is strictly convex on ℱ .

3.1 Empirical risk minimization

The previous results pave the way for an ERM framework [43] to be used in our work. That is, given an i.i.d. sample { ( x i , y i , z i ) } i = 1 n ∼ P n ( X , Y , Z ) of size n , empirical estimates of the risk R k ( f ) can be obtained as in the form of U-statistic or V-statistic [28, Section 5]:

R ^ U ( f ) ≔ 1 n ( n − 1 ) ∑ i = 1 n ∑ j ≠ i ( y i − f ( x i ) ) ( y j − f ( x j ) ) k ( z i , z j )

and

R ^ V ( f ) ≔ 1 n 2 ∑ i = 1 n ∑ j = 1 n ( y i − f ( x i ) ) ( y j − f ( x j ) ) k ( z i , z j ) .

Both forms of empirical risk can be used as a basis for a consistent estimation of f . The advantage of R ^ U is that it is a minimum-variance unbiased estimator with appealing asymptotic properties, whereas R ^ V is a biased estimator of the population risk (4), i.e., E [ R ^ V ] ≠ R k . However, the estimator based on V-statistics employs full pairs of samples and hence may yield a better estimate of the risk than the U-statistic counterpart. Let x ≔ [ x 1 , … , x n ] ⊤ , y ≔ [ y 1 , … , y n ] ⊤ and z ≔ [ z 1 , … , z n ] ⊤ be column vectors. Let K z be the kernel matrix K ( z , z ) = [ k ( z i , z j ) ] i j evaluated on the instruments z and f ( x ) ≔ [ f ( x 1 ) , … , f ( x n ) ] ⊤ . Then, both R ^ U and R ^ V can be rewritten as follows:

(5) R ^ V ( U ) ( f ) = ( y − f ( x ) ) ⊤ W V ( U ) ( y − f ( x ) ) ,

where W V ( U ) is a n × n symmetric weight matrix that depends on the kernel matrix K z . Specifically, W U = ( K z − diag ( k ( z 1 , z 1 ) , … , k ( z n , z n ) ) ) / ( n ( n − 1 ) ) corresponds to R ^ U , where diag ( a 1 , … , a n ) denotes an n × n diagonal matrix whose diagonal elements are a 1 , … , a n . As shown in Appendix A.5, W U is indefinite and may cause problematic inferences. For R ^ V , W V ≔ K z / n 2 is positive definite because the kernel k is ISPD. Finally, our objective (5) also resembles the well-known generalized least regression with correlated noise [44, Chapter 2], where the covariance matrix is the Z -dependent invertible matrix W V ( U ) − 1 .

On the basis of R ^ U and R ^ V , we estimate f ∗ by minimizing the regularized empirical risk over ℱ :

f ˆ V ( U ) ∈ argmin f ∈ ℱ R ^ V ( U ) ( f ) + λ Ω ( f ) ,

where λ > 0 is a regularization constant satisfying lim n → ∞ λ = 0 , and Ω ( f ) is the regularizer. Since f ˆ U and f ˆ V minimize objectives that are regularized U-statistic and V-statistic, they can be viewed as a specific form of M-estimators; see, e.g., [45, Ch. 5]. In this work, we focus on the V-statistic empirical risk and provide practical algorithms when ℱ is parametrized by deep NNs and an RHKS of real-valued functions.

Kernelized GMM. We may view the objective (5) from the GMM perspective [32]. The assumption that the instruments Z are exogenous implies that E [ Φ k ( Z ) ε ] = 0 , where Φ k denotes the canonical feature map associated with the kernel k . This gives us an infinite number of moments, g ( f ) = E [ Φ k ( Z ) ( Y − f ( X ) ) ] . Hence, we can write the sample moments as g ˆ ( f ) = ( 1 / n ) ∑ i = 1 n Φ k ( z i ) ( y i − f ( x i ) ) . The intuition behind GMM is to choose a function f that sets these moment conditions as close to zero as possible, motivating the objective function

J ( f ) ≔ ‖ g ˆ ( f ) ‖ ℋ k 2 = ⟨ g ˆ ( f ) , g ˆ ( f ) ⟩ ℋ k = 1 n 2 ∑ i , j ( y i − f ( x i ) ) ⟨ Φ k ( z i ) , Φ k ( z j ) ⟩ ℋ k ( y j − f ( x j ) ) = R ^ V ( f ) .

Hence, our objective R ^ V ( f ) (5) is a special case of the GMM objective when the weighting matrix is the identity operator. Carrasco et al. [46, Ch. 6] showed the optimal weighting operator in terms of the inversed covariance operator.

3.1.1 Theoretical perspective: Identification and efficiency

Before introducing practical algorithms, we discuss the theoretical aspects of our framework, specifically, the identifiability and efficiency issues by changing the norm constraint in the formulation. The key difference between our risk R k ( f ) in (3) and the ordinary framework of the unconstrained moment restriction is that, in the ordinary framework, the CMR (2) is rewritten as the following unconditional moment restriction:

(6) sup h : E [ ( f ( X ) ) 2 ] < ∞ ( E [ ( Y − f ( X ) ) h ( Z ) ] ) 2 .

Note that the test function h is maximized over a class of square-integrable functions, i.e., E [ ( h ( X ) ) 2 ] < ∞ . In contrast, our risk R k ( f ) utilizes a maximized test function h over a unit ball of the RKHS, that is, h ∈ ℋ k , ‖ h ‖ ≤ 1 with the RKHS norm ‖ ⋅ ‖ . This arbitrary design with the RKHS plays a critical role in our method.

Identification: An advantage of our design (3) with RKHSes is that it uniquely defines the function that minimizes the risk. Since RKHSes equips a positive definite kernel, and we further assume the kernel is ISPD in Assumption 1, the risk R k ( f ) has a positive definite Hessian matrix (operator) at the minimum, and hence, it has the unique minimizer. In short, the loss design with RKHSes has the advantage that the function class of h can be chosen so that the solution is uniquely identified.

Efficiency of IV: Another important issue is the efficiency of the estimator, i.e., the variance of the asymptotic distribution of estimators. For the linear regression problem with finite-dimensional parameters, we can define the optimal IV, i.e., the estimator with the IV that minimizes the asymptotic variance [8]. In contrast, our regression model in this setup is nonlinear, infinite-dimensional, and also based on U-statistics, hence it is not trivial to discuss optimality.

To address the latter issue, we provide the variance of the asymptotic distribution of our estimator as a way to measure efficiency. In Proposition 5 in Section 5, we derive asymptotic Gaussian approximations of the estimator, both pointwise and uniformly, that reveal their variance. The result in Proposition 5 is summarized as follows. We first define coefficients Λ max and Λ min as follows:

Λ max ≔ sup g : ‖ g ‖ L 2 = 1 ∬ Z g ( z ) k ( z , z ′ ) g ( z ′ ) d z d z ′ , and Λ min ≔ inf g : ‖ g ‖ L 2 = 1 ∬ Z g ( z ) k ( z , z ′ ) g ( z ′ ) d z d z ′ ,

where ‖ g ‖ L 2 = ∫ Z g ( z ) 2 d z is the L2 norm. The value Λ max and Λ min is the largest/smallest eigenvalue of an embedding operator to the RKHS. Here, we define δ Λ ≔ max { ∣ Λ max − 1 ∣ , ∣ 1 − Λ min ∣ } . Then, we show that an upper bound of the asymptotic variance with the proposed estimator is increased by

O ( Λ max δ Λ + { Λ max δ Λ } 2 ) ,

compared to the case without kernel functions. The case with δ Λ = 1 corresponds to the unconditional moment restriction with the L2 norm, as displayed in (6). This indicates that kernels that are concentrated in the direction of a particular eigenfunction may have worse efficiency. More details will be provided in Section 5.2.1.

3.2 Practical MMR-IV algorithms

A workflow of our algorithm based on R ^ V is summarized in Algorithm 1; we leave the R ^ U , based method to future work to solve the inference issues caused by indefinite W U . We provide examples of the class ℱ in both parametric and nonparametric settings.

Algorithm 1 MMR-IV
Input: Dataset D = { x , y , z } , kernel k with parameters θ k , a function class ℱ , regularizer Ω ( ⋅ ) and λ
Output: The estimate of f ∗ in ℱ .
1: Compute the kernel matrix K = k ( z , z ; θ k ) .
2: Define the residual ε ( f ) = y − f ( x ) .
3: f ˆ λ ← argmin f ∈ ℱ n − 2 ε ( f ) ⊤ K ε ( f ) + λ Ω ( ‖ f ‖ ℱ ) .
4: return f ˆ λ .

Deep NNs. In the parametric setting, the function class ℱ can often be expressed as ℱ Θ = { f θ : θ ∈ Θ } , where Θ ⊆ R m denotes a parameter space. We consider a very common nonlinear model in machine learning f ( x ) = W 0 Φ ( x ) + b 0 where, Φ : x ⟼ σ h ( W h σ h − 1 ( ⋯ σ 1 ( W 1 x ) ) ) denotes a nonlinear feature map of a depth- h NN. Here, W i for i = 1 , … , h are parameter matrices, and each σ i denotes the entry-wise activation function of the i th layer. In this case, θ = ( b 0 , W 0 , W 1 , … , W h ) . As a result, we can rewrite f ˆ V in terms of their parameters as θ ˆ V ∈ arg min θ ∈ Θ R ^ V ( f θ ) + λ ‖ θ ‖ 2 2 , where f θ ∈ ℱ Θ . We denote θ ∗ ∈ arg min θ ∈ Θ R k ( f θ ) . In what follows, we refer to this algorithm as MMR-IV (NN); see Algorithm 2.

Algorithm 2 MMR-IV (NN)
Input: Dataset D = { x , y , z } , kernel function k with parameters θ k , NN f NN with parameters θ NN , regularization parameter λ .
Output: predictive value at x ∗ .
1: Compute K = k ( z , z ; θ k ) .
2: f ˆ NN = argmin f NN ( y − f NN ( x ) ) ⊤ K ( y − f NN ( x ) ) / n 2 + λ ‖ θ NN ‖ 2 2 .
3: return f ˆ NN ( x ∗ )

Kernel machines. In a nonparametric setting, the function class ℱ becomes an infinite dimensional space. In this work, we consider ℱ to be an RKHS ℋ l of real-valued functions on X with a reproducing kernel l : X × X → R . Then, the regularized solution can be obtained by arg min f ∈ ℋ l R ^ V ( f ) + λ ‖ f ‖ ℋ l 2 . As per the representer theorem, every optimal f ˆ admits a form f ˆ ( x ) = ∑ i = 1 n α i l ( x , x i ) for some ( α 1 , … , α n ) ∈ R n [47], and on the basis of this representation, we rewrite the objective as follows:

(7) f ˆ V = arg min α ∈ R n ( y − L α ) ⊤ W V ( y − L α ) + λ α ⊤ L α ,

where L = [ l ( x i , x j ) ] i j is the kernel matrix on x . For the U-statistic version, the quadratic program (7) substitutes indefinite W U for W V , so it may not be positive definite. The value of λ needs to be sufficiently large to ensure that (7) is definite. On the other hand, the V-statistic-based estimate (7) is definite for all nonzero λ since W V is positive semidefinite. Thus, the optimal α ^ can be obtained by solving the first-order stationary condition, and if L is positive definite, the solution has a closed-form expression, α ^ = ( L W V L + λ L ) − 1 L W V y . Thus, we will focus on the V-statistic version in our experiments. In the following, we refer to this algorithm as MMR-IV (RKHS).

Nyström approximation. The MMR-IV (RKHS) algorithm is computationally costly for large datasets as it requires a matrix inversion. To improve the scalability, we resort to Nyström approximation [48] to accelerate the matrix inversion in

α ^ = ( L W V L + λ L ) − 1 L W V y .

First, we randomly select a subset of m ( ≪ n ) samples from the original dataset and construct the corresponding submatrices of W V , namely, W m m and W n m based on this subset. Second, let V and U be the eigenvalue vector and the eigenvector matrix of W m m , respectively. Then, the Nyström approximation is obtained as W V ≈ U ˜ V ˜ U ˜ ⊤ , where U ˜ ≔ m n W n m U V − 1 and V ˜ ≔ n m V . We finally apply the Woodbury formula [49, p. 75] to obtain

( L W V L + λ L ) − 1 L W V = L − 1 ( W V + λ L − 1 ) − 1 W V ≈ λ − 1 [ I − U ˜ ( λ − 1 U ˜ ⊤ L U ˜ + V ˜ − 1 ) − 1 U ˜ ⊤ λ − 1 L ] U ˜ V ˜ U ˜ ⊤ .

We will refer to this algorithm as MMR-IV (Nyström); Algorithm 3. The runtime complexity of this algorithm is O ( n m 2 + n 2 ) .

Algorithm 3 MMR-IV (Nyström)
Input: Dataset D = { ( x , y , z ) } i = 1 n , kernel functions k and l with parameters θ k and θ l , regularization parameter λ , leave out data size M , Nyström approximation sample size M ′ , leave-out-cross-validation (LMOCV) times m
Output: predictive value at x ∗
1: K = k ( z , z ; θ k )
2: δ ˆ , θ ˆ l = argmin δ , θ l ∑ i = 1 m ( b ( i ) − y M ( i ) ) ⊤ K M ( i ) ( b ( i ) − y M ( i ) ) Equation (18)
3: K / n 2 ≈ U ˜ V ˜ U ˜ ⊤ Nyström Approx with M ′ samples
4: α ˆ = λ − 1 [ I − U ˜ ( λ − 1 U ˜ ⊤ L U ˜ + V ˜ − 1 ) − 1 U ˜ ⊤ λ − 1 L ] U ˜ V ˜ U ˜ ⊤ y
5: return f ˆ ( x ∗ ) = l ( x ∗ , x ; θ ˆ l ) α ˆ

4 Hyperparameter selection

We discuss the selection of hyperparameters such as the regularization coefficient λ . The hyperparameters in the IV regression problem significantly affect the finite sample performance, but the selection process is nontrivial. We first give an overview of our cross-validation (CV) approach in Section 4.1, and then develop some techniques that make the approach efficient in Sections 4.2 and 4.3. Figure 2 gives an overview of the contents of this section.

Figure 2

A summary of proposed algorithms. Following the KMMR framework, two methods – MMR-IV (NN) and MMR-IV (RKHS) – are proposed, where f is parameterized by a NN and an RKHS function, respectively. An analytical CV error is proposed for MMR-IV (RKHS), and both methods select the kernel k by the median heuristic.

4.1 Leave-M-Out Cross Validation (LMOCV)

We utilize the LMOCV approach for the hyperparameter selection, which is described as follows. Let M be an integer such that M ≤ n , and Λ be a set of hyperparameter candidates.

1:	K ← n / M
2:	Split the data into K parts: [ D 1 , D 2 , … , D K ] where ∣ D i ∣ ≈ M
3:	for each ( λ , ℱ ) ∈ Λ do
4:	for each D i ∈ [ D 1 , D 2 , … , D K ] do
5:	Solve f ˆ λ ← arg min f ∈ ℱ R ^ V ( f ) + λ Ω ( f ) using D − i = [ D 1 , … , D i − 1 , D i + 1 , … , D K ]
6:	Evaluate S ( λ , i ) ← R ^ V ( f ˆ λ ) using D i
7:	end for
8:	S ( λ ) ← ( 1 / K ) ∑ k = 1 K S ( λ , k )
9:	end for
10:	Output λ ∗ , ℱ ∗ ← arg min λ , ℱ S ( λ )

One disadvantage of this approach is that it is computationally expensive because the empirical risks are optimized many times. To alleviate this problem, we study an analytical form of the score function S ( λ ) for MMR-IV (RKHS) and MMR-IV (Nyström). The analytical error enables any fold of CV and in comparison, related solutions rely on a few fold of CV [15,50] or median heuristic [16] for this task, which, however, have rather limited selection capability. We propose this method in the next subsections.

4.2 GP interpretation

Our interest is to obtain an analytical empirical risk for MMR-IV (Nyström) with the V-statistic objective (7). Toward the goal, we relate our ERM form to a stochastic model with a GP, inspired by ref. [51]. Different from the counterpart in the ordinary kernel regression, we need to apply additional analysis to deal with the additional weight matrix W V in the V-statistic risk (7).

We build the connection between the V-statistic risk and the posterior distribution of a GP model in this subsection, after which we will show that the MMR-IV (Nyström) estimator is equivalent to the Maximum-A-Posteriori estimator based on the GP. The relationship is inspired by the similarity of our objective function to that of GLS.

Prior and likelihood. Let us consider a GP prior over a space of functions f ( x ) ∼ GP ( 0 , δ l ( x , x ) ) , where δ > 0 is a real constant, and a set of i.i.d. data D = { ( x i , y i , z i ) } i = 1 n . To define the likelihood p ( D ∣ f ) , we recall from Lemma 1 that the risk R k ( f ) can be expressed in terms of two independent copies of random variables ( X , Y , Z ) and ( X ′ , Y ′ , Z ′ ) . To this end, let D ′ ≔ { ( x i ′ , y i ′ , z i ′ ) } i = 1 n be an independent sample of size n with an identical distribution to D . Given a pair of samples ( x , y , z ) and ( x ′ , y ′ , z ′ ) from D and D ′ , respectively, we then define the likelihood as follows:

p ( { ( x , y , z ) , ( x ′ , y ′ , z ′ ) } ∣ f ) = exp − 1 2 ( y − f ( x ) ) k ( z , z ′ ) ( y ′ − f ( x ′ ) ) ∫ exp − 1 2 ( y ¯ − f ( x ¯ ) ) k ( z ¯ , z ¯ ′ ) ( y ¯ ′ − f ( x ¯ ′ ) ) d P ⊗ P ( ( x ¯ , y ¯ , z ¯ ) , ( x ¯ ′ , y ¯ ′ , z ¯ ′ ) ) .

Here, P is a joint distribution of ( X , Y , Z ) , and P ⊗ P is a product of the distribution. Let t = y ¯ − f ( x ¯ ) and t ′ = y ¯ ′ − f ( x ¯ ′ ) and assume that the kernel k is bounded. Since ∫ exp ( − 1 2 t k ( z ¯ , z ¯ ′ ) t ′ ) d P ⊗ P ≤ max { ∫ exp ( − 1 2 t 2 C ) d P , exp ( − 1 2 ( t ′ ) 2 C ) d P } with some C > 0 , the likelihood function is integrable, and hence well defined. Since the integration in the denominator does not depend on ( x , y , z ) , ( x ′ , y ′ , z ′ ) , we regard the denominator as a normalizing constant and simply denote

p ( { ( x , y , z ) , ( x ′ , y ′ , z ′ ) } ∣ f ) ∝ exp − 1 2 ( y − f ( x ) ) k ( z , z ′ ) ( y ′ − f ( x ′ ) ) .

Hence, the likelihood on both D and D ′ can be expressed as follows:

(8) p ( { D , D ′ } ∣ f ) = ∏ i = 1 n ∏ j = 1 n p ( { ( x i , y i , z i ) , ( x j ′ , y j ′ , z j ′ ) } ∣ f ) ∝ exp − 1 2 ∑ i = 1 n ∑ j = 1 n ( y i − f ( x i ) ) k ( z i , z j ′ ) ( y j ′ − f ( x j ′ ) ) .

This likelihood function is regarded as a generalized version of a loss function of generalized least squares with heteroskedastic noise [52]. That is, if we set k ( z i , z j ) = σ i 2 if i = j with σ i 2 > 0 , and k ( z i , z j ) = 0 otherwise, the likelihood corresponds to that of a linear regression model with Gaussian noise having heteroskedastic variance. In practice, however, we only have access to a single copy of the sample, i.e., D , but not D ′ . One way of constructing D ′ is through data splitting: the original dataset is split into two halves of equal size, where the former is used to construct D and the latter is for D ′ . Unfortunately, this approach reduces the effective sample size that can be used for learning. Alternatively, we propose to estimate the original likelihood (8) by using M -estimators: given the full dataset D = { ( x i , y i , z i ) } i = 1 n , our approximated likelihood is defined as follows:

(9) p ( D ∣ f ) = ∏ i = 1 n ∏ j = 1 n p ( { ( x i , y i , z i ) , ( x j , y j , z j ) } ∣ f ( x i ) , f ( x j ) ) ∝ exp − 1 2 ∑ i = 1 n ∑ j = 1 n ( y i − f ( x i ) ) k ( z i , z j ) ( y j − f ( x j ) ) = exp − 1 2 ( y − f ( x ) ) ⊤ K z ( y − f ( x ) ) .

This likelihood assumes correlated errors with the normal distribution. It is closely related to the standard GP regression that has a similar Gaussian likelihood, but the covariance matrix is computed on x rather than z , and the generalized ridge regression as well as the weighted least regression. The matrix K z , as a kernel matrix defined earlier, plays a role in correlating the residuals y i − f ( x i ) . Based on the likelihood p ( D ∣ f ) , the maximum likelihood estimator hence coincides with the minimizer of the unregularized version of our objective (5).

Maximum a posteriori (MAP). Combining the aforementioned likelihood function with the prior on f ( x ) , the posterior probability of f ( x ) is expressed as follows:

p ( f ( x ) ∣ D ) = N ( f ( x ) ∣ c , C ) ∝ p ( f ( x ) ) p ( D ∣ f ) , C = ( K z + ( δ L ) − 1 ) − 1 = L ( L W V L + ( δ n 2 ) − 1 L ) − 1 L n − 2 , c = C K z y = L ( L W V L + ( δ n 2 ) − 1 L ) − 1 L W V y .

The MAP estimate of f ( x ) is simply c ,

argmax f ( x ) log p ( f ( x ) ∣ D ) = L ( L W V L + ( δ n 2 ) − 1 L ) − 1 L W V y ︸ = α ˆ .

We can see that the MAP estimator returns the same result as that of the MMR-IV (RKHS) estimator given that ( n 2 δ ) − 1 = λ (see the context around (7)), and δ plays the role of the regularization parameter. For MMR-IV (Nyström), we consider the GP model with W V approximated by U ^ V ^ U ^ ⊤ in C and c , and the conclusion still holds,

p ( f ( x ) ∣ D ) = N ( f ( x ) ∣ c ′ , C ′ ) C ′ = δ L [ I − U ˜ ( λ − 1 U ˜ ⊤ L U ˜ + V ˜ − 1 ) − 1 U ˜ ⊤ λ − 1 L ] , c ′ = δ n 2 L [ I − U ˜ ( λ − 1 U ˜ ⊤ L U ˜ + V ˜ − 1 ) − 1 U ˜ ⊤ λ − 1 L ] U ^ V ^ U ^ ⊤ y .

We summarize the result in Proposition 3. Such a GP interpretation is used for an elegant derivation of the analytical CV error in the following subsection.

Proposition 3

Given δ = ( λ n 2 ) − 1 and f ˆ = argmin f ∈ ℋ l (equation (7), including Nyström approximated),

argmax f ( x ∗ ) p ( f ( x ∗ ) ∣ D ) = f ˆ ( x ∗ ) .

4.3 Analytical form of error

Now, we derive the analytical error of LMOCV via Bayes’ rules based on the posterior GP. We split the whole dataset D into training and development datasets, denoted as D t r and D d e ≔ { x d e , y d e , z d e } , respectively, where D d e has M triplets of data points. Given D t r , the predictive probability of f ( D d e ) can be obtained by Bayes’ rules (see also equation (22) in ref. [51]):

p ( f ( x d e ) ∣ D ) = p ( f ( x d e ) ∣ D t r , D d e ) = p ( D d e ∣ f ( x d e ) , D t r ) p ( f ( x d e ) ∣ D t r ) / p ( D d e ∣ D t r ) = p ( D d e ∣ f ( x d e ) ) p ( f ( x d e ) ∣ D t r ) / p ( D d e ∣ D t r ) ,

which implies that

(10) p ( f ( x d e ) ∣ D t r ) = p ( f ( x d e ) ∣ D ) p ( D d e ∣ D t r ) / p ( D d e ∣ f ( x d e ) ) ,

where p ( f ( x d e ) ∣ D ) = N ( c d e , C d e ) with c d e , C d e the mean and covariance of f ( x d e ) in p ( f ( x ) ∣ D ) and p ( D d e ∣ f ( x d e ) ) ∝ N ( y d e , K d e − 1 ) , and K d e = k ( z d e , z d e ) . From this result, we can see that p ( D d e ∣ f ( x d e ) ) and p ( f ( x d e ) ∣ D t r ) are in Gaussian forms and p ( D d e ∣ D t r ) is a constant term, so p ( f ( x d e ) ∣ D t r ) must be Gaussian:

p ( f ( x d e ) ∣ D t r ) = N ( b , B ) ,

where B − 1 = C d e − 1 − K d e and b = B ( C d e − 1 c d e − K d e y d e ) can be obtained by solving equation (10). By Proposition 3, we know that the prediction on the validation data f ( x d e ) given the training data D t r is b . Therefore, the error of m repeated LMOCV is computed as shown in (11), where we use ( i ) to denote the i th split of the whole dataset into D d e and D t r and C d e ( i ) and as mentioned earlier, c d e ( i ) are straightforwardly obtained from p ( f ( x ) ∣ D ) . As a result, this enables an efficient parameter selection with the CV principle: the standard LMOCV procedure requires multiple experiments (e.g., m here) on different D t r and D d e , while our result needs only a single training on D and the validation error of m experiments is simply gained by plugging the training result into the analytical error (11):

(11) LMOCV Error ≔ ∑ i = 1 m ( r ( i ) ) ⊤ K d e ( i ) r ( i ) , r ( i ) ≔ b ( i ) − y M ( i ) = ( I − C d e ( i ) K d e ( i ) ) − 1 ( c d e ( i ) − y d e ( i ) ) .

5 Theoretical result: Consistency and asymptotic normality

We provide the consistency and asymptotic normality of f ˆ V . Similar results for f ˆ U , together with all the proofs, are provided in the appendix. Readers interested in the finite sample rate should consult [50, E.4] for further details.

5.1 Consistency

We first show the consistency of f ˆ V , which depends on the uniform convergence of the risk functions. The result holds for both parametric and nonparametric cases regardless of the shape of Ω ( f ) , so we can utilize the regularization ‖ θ ‖ 2 2 , which is common for NN but nonconvex in terms of f . In the following result on the consistency, we consider that ℱ has topology induced by distance for parameters θ ∈ Θ that characterize functions, that is, for functions f θ , f θ ′ ∈ ℱ , their distance is measured by that of their parameters θ and θ ′ : d ( f θ , f θ ′ ) = ‖ θ − θ ′ ‖ 2 .

Proposition 4

(Consistency of f ˆ V ) Assume that E [ ∣ Y ∣ 2 ] < ∞ , E [ sup f ∈ ℱ ∣ f ( X ) ∣ 2 ] < ∞ , ℱ is compact, Assumptions 1 and 2 hold, Ω ( f ) is a bounded function, and λ → p 0 . Then f ˆ V → p f ∗ .

If Ω ( f ) is convex in f , the consistency can be obtained more easily by [53, Thm. 2.7], which relies on the convexity of the risk functions. In this case, we can avoid several conditions. We provide an additional result with this setting in Appendix. A.6.

5.2 Asymptotic normality with finite-dimension case

We analyze the asymptotic normality of the estimator f ˆ V , which is important to advance statistical analysis such as tests. The results are obtained by applying the delta method with finite parameters and functionals, respectively.

We first consider the f ˆ V that is characterized by a finite-dimensional parameter from a parameter space Θ . We rewrite the regularized V-statistic risk as a compact form R ^ V , λ ( f θ ) ≔ 1 n 2 ∑ i , j h θ ( u i , u j ) + λ Ω ( θ ) , h θ ( u i , u j ) ≔ ( y i − f θ ( x i ) ) ( y j − f θ ( x j ) ) k ( z i , z j ) , and consider R k ( f θ ) is uniquely minimized at θ ∗ ∈ Θ and ⇝ denotes a convergence in law.

Theorem 1

(Asymptotic normality of θ ˆ V ) Suppose that f θ and Ω ( θ ) are twice continuously differentiable about θ , Θ is compact, H = E [ ∇ θ 2 h θ ∗ ( U , U ′ ) ] is nonsingular, E [ ∣ Y ∣ 2 ] < ∞ , E [ sup θ ∈ Θ ∣ f θ ( X ) ∣ 2 ] < ∞ , E [ sup θ ∈ Θ ‖ ∇ θ f θ ( X ) ‖ 2 2 ] < ∞ , E [ sup θ ∈ Θ ‖ ∇ θ 2 f θ ( X ) ‖ F 2 ] < ∞ , n λ → p 0 , R k ( f θ ) is uniquely minimized at θ ∗ which is an interior point of Θ , and Assumption 1 holds. Then,

n ( θ ˆ V − θ ∗ ) ⇝ N ( 0 , Σ V )

holds, where Σ V = 4 H − 1 diag ( E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] ) H − 1 .

5.2.1 Connection to efficiency

The variance of the asymptotic distribution varies with the choice of kernel. To see this, we consider the constants Λ min ≤ Λ max such that

Λ min = inf g : ‖ g ‖ L 2 = 1 ∬ Z g ( z ) k ( z , z ′ ) g ( z ′ ) d z d z ′ ≤ sup g : ‖ g ‖ L 2 = 1 ∬ Z g ( z ) k ( z , z ′ ) g ( z ′ ) d z d z ′ = Λ max .

Their positiveness and existence are guaranteed by Assumption 1. By Theorem 4.51 in ref. [37], Λ min and Λ max correspond to the smallest and largest eigenvalue of the embedding operator T : L 2 → ℋ k , where ℋ k is the RKHS whose reproducing kernel is k . Equivalently, by the Mercer’s theorem (Theorem 4.49 in ref. [37]), we have the form of k as follows:

k ( z , z ′ ) = ∑ j λ j ϕ j ( x ) ϕ j ( z ′ ) ,

where { λ j } j are coefficients and ϕ j ( ⋅ ) is an orthonormal basis function, and Λ max = max j λ j and Λ min = min j λ j . We also define an asymptotic variance without the kernel k . That is, we define a loss without the kernel as h ˇ θ ( u i , u j ) ≔ ( y i − f θ ( x i ) ) ( y j − f θ ( x j ) ) , and its associated matrices H ˇ = E [ ∇ θ 2 h ˇ θ ∗ ( U , U ′ ) ] and Σ ˇ V = 4 H ˇ − 1 diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) H ˇ − 1 . Let ‖ Σ V ‖ F = tr ( Σ V 2 ) be the Frobenius norm for square matrices.

Proposition 5

Suppose that Assumption 1 holds. Define δ Λ ≔ max { ∣ Λ max − 1 ∣ , ∣ 1 − Λ min ∣ } . Then, we have

‖ Σ V ‖ F ≤ ‖ Σ ˇ V ‖ F + O ( Λ max δ Λ + ( Λ max δ Λ ) 2 ) .

This result evaluates the increase in the asymptotic variance due to the choice of the kernel by the upper bound. Strictly speaking, as the eigenvalues Λ min and Λ max of the kernel k approach 1, i.e., as the kernel function approaches a constant function, the asymptotic variance decreases at the speed of the first and second power of δ Λ . Therefore, this result demonstrates that utilizing the simplest possible kernel, one that is not focused on a specific eigenspace, is the method for preventing an increase in the asymptotic variance.

Intuitively, this result represents the difficulty of choosing the kernel function as the optimal IV in nonlinear IV regression. Unlike the case of linear regression [30], the nonlinear IV regression does not provide information such as which eigenspaces of data have a large variance. Thus, the anisotropy of the eigenvalues of kernels may worsen the efficiency in the worst case. However, efficiency can be improved if the anisotropy of the eigenvalues fits the data. If such properties could be known, one can develop an optimal kernel selection procedure based on them.

5.3 Asymptotic normality of infinite-dimension case

We show the asymptotic normality of an infinite-dimensional estimator f ˆ V . That is, we show that an error of f ˆ V weakly converges to a GP that takes values in a function space ℋ l . We set Ω ( f ) = ‖ f ‖ ℋ l 2 and consider a minimizer: f λ 0 ∗ ∈ argmin f ∈ ℱ R k ( f ) + λ 0 ‖ f ‖ ℋ l 2 with arbitrary λ 0 > 0 . We also define N ( ε , ℋ , ‖ ⋅ ‖ ) as an ε -covering number of ℋ in terms of ‖ ⋅ ‖ . Then, we obtain the following result, which allows statistical inference on functional estimators such as kernel machines.

Assumption 3

(Low-entropy condition) There exists s ∈ ( 0 , 2 ) and a constant C H > 0 such that

log N ( ε , ℋ l , ‖ ⋅ ‖ L ∞ ) ≤ C H ε − s ,

for every ε ∈ ( 0 , 1 ) .

Although the covering number condition log N ( ε , ℋ l , ‖ ⋅ ‖ L ∞ ) ≤ C H ε − s in Theorem 2 is strong, we can present some examples that satisfy the condition. A representative example is the set of Lipschitz functions on a bounded interval. The condition is then satisfied under s = 1 (Example 5.10 in ref. [54]). Although the support is one dimensional, it can cover nondifferentiable functions. Another example is the set of b -times differentiable functions on a compact set in R d with b > 2 d (Theorem 2.7.1 in ref. [55]). Although requires higher-order smoothness, it can handle functions with multidimensional inputs. The last example is the RKHS with kernel functions characterized by the spectrum: the condition is satisfied when s is a decay rate of the eigenvalues of the embedding operator into the RKHS. Several common kernels, such as the Gaussian kernel (see Section 4 in ref. [37]) and the Mercer kernel (explained in ref. [56]), satisfy it with a certain parameter configuration. This characterization is used in several machine learning studies, e.g., [57,58].

Theorem 2

(Asymptotic normality of f ˆ V ) Suppose Assumption 1 holds, l is a bounded kernel, k is a uniformly bounded function, and λ − λ 0 = o ( n − 1 / 2 ) holds. Also, suppose that X , Z , and Y are compact spaces, and Assumption 3 holds. We define a Gaussian process G P ∗ on ℋ l with zero mean and its covariance

Cov ( G P ∗ ( f ) , G P ∗ ( f ′ ) ) = E U [ P 1 h f ( U ) P 1 h f ′ ( U ) ] − E U [ P 1 h f ( U ) ] E U [ P 1 h f ′ ( U ) ]

with h f ( u , u ′ ) = ( y − f ( x ) ) ( y ′ − f ( x ′ ) ) k ( z , z ′ ) and P 1 h f ( ⋅ ) = ( ∫ h f ( u , ⋅ ) + h f ( ⋅ , u ) d P ( u ) ) / 2 . Then, there exists a linear operator ∇ S P 2 , λ 0 : ℋ l → ℋ l such that

n ( f ˆ V − f λ 0 ∗ ) ⇝ ∇ S P 2 , λ 0 ( G P ∗ ) i n ℋ l .

Note that ∇ S P 2 , λ 0 ( G P ∗ ) is also a GP on ℋ l owing to the linearly of ∇ S P 2 , λ 0 (Section 3.9.2 in ref. [55]).

We provide the detailed derivation in the appendix. A specific form of the operator ∇ S P 2 , λ 0 is also given in the appendix. All conditions for the theorem are valid for many well-known kernels. For the boundedness assumption, many common kernels, such as the Gaussian RBF kernel, the Laplacian kernel, and the Mercer kernel, satisfy it.

This nonparametric asymptotic normality is a generic result that leads to finite-dimensional asymptotic normality. That is, for every finite set { x j } j = 1 N ⊂ X , Theorem 2 immediately implies the following convergence:

n f ˆ V ( x 1 ) − f λ 0 ∗ ( x 1 ) ⋮ f ˆ V ( x N ) − f λ 0 ∗ ( x N ) ⇝ N N ( 0 , Σ ) ,

where N N ( 0 , Σ ) is the N -variate normal distribution with zero mean and a corresponding covariance matrix Σ . This result is a convenient generalization of asymptotic normality in a parametric setting.

Remark 1

(On asymptotic normality) Theorem 2 and λ 0 > 0 results in a biased center f λ 0 ∗ while it is essential to obtain the limit distribution in infinite dimensional space. This is one of the reasons why n -convergence can be achieved in Theorem 2. Although it is not very satisfactory, we note that deriving the limit distribution has a different nature than obtaining error bounds, and it is more restrictive than developing a learner with bounded generalization errors. One can find that a similar difficulty appears in deriving a limit with the non-Donsker class [55]. To remove this constraint, we need to introduce some strong assumptions, such as the low noise condition in the classification problem, as discussed in ref. [29]. However, we do not solve this problem in this study because such a response would go far beyond the scope of this work.

6 Related work

Several extensions of 2SLS and GMM exist for the nonlinear IV problem. In the two-stage approach, the function f ( x ) has often been obtained by solving a Fredholm integral equation of the first kind E [ Y ∣ Z ] = ∫ f ( x ) d P ( x ∣ Z ) . In refs [8,10,11,59], linear regression is replaced by a linear projection onto a set of known basis functions. Like the kernel choice problem in our work, this approach requires choosing the appropriate set of basis functions. A uniform convergence rate of this approach is provided in ref. [60]. In refs [9,61], the first-stage regression is replaced by a conditional density estimation of P ( X ∣ Z ) using a kernel density estimator. On the contrary, our approach does not rely on the conditional density estimation problem.

The IV regression has also recently received attention in the machine learning community. Hartford et al. [12] proposed to solve the integral equation by first estimating P ( X ∣ Z ) with a mixture of deep generative models on which the function f ( x ) can be learned with other deep NNs. Instead of NNs, [16] proposed to model the first-stage regression using the conditional mean embedding of P ( X ∣ Z ) [36,62,63], which is then used in the second-step kernel ridge regression. In other words, the first-stage estimation in ref. [16] becomes a vector-valued regression problem. In an attempt to alleviate the two-stage-estimation spirit, [15] and [17] reformulate the two-stage procedure as a convex–concave saddle-point problem, which is in a minimax form, via studying the Fenchel and Lagrange dual reformulations of the population risk, respectively. By using RKHSes for the inner maximization, DualIV [15] obtains a quadratic objective function similar to ours, but its RKHS is applied over ( Z , Y ) , which cannot be interpreted as a valid instrument. In contrast, the approach of [17, Appendix F] provides an exact dual reformulation of our method and obtains our objective if RKHSes are used for the inner maximization. Besides, the starting objectives of the aforementioned works differ from ours: in refs [12,15,16] and [17, Appendix F], they started from minimizing the L 2 norm of the CMR, whereas we start from minimizing the squared UMR. The fact that they arrive at the same objective suggests a deeper connection that requires further investigation.

Our work follows in spirit many GMM-based approaches for IV regression, namely, [13,14,27]. We adopt the MMR framework of ref. [27], which only considers a conditional moment testing problem, to the parameter estimation problem of IV regression, and hence, our approaches and both theoretical and experimental analyses are different from theirs. In fact, this framework was initially inspired by refs [13,14], which instead parametrize the instruments by deep NNs. By combining the GMM framework with RKHS functions, the objective function can be evaluated in a closed form as shown in our work. As a result, our IV estimate can be obtained by minimizing the empirical risk, as opposed to an adversarial optimization [13,14]. It is important to note that recently these adversarial approaches have been improved similarly with RKHSs by refs [18,50], while there is a major difference. First, ref. [50] extends the work of ref. [13] to a single-stage algorithm similar to our MMR-IV (RKHS) [50, Section 4]. Although both works employ RKHSs in the minimax frameworks, ref. [50] incorporate a Tikhonov regularization on h in (3) and resort to the representer theorem [64] to develop the analytical objective function, whereas we impose a unit-ball constraint which is a form of Ivanov regularization [65]. The basis of our objective is the adversarial GMM objective of ref. [13]. Furthermore, prior works do not study the analytical CV error.

We also discuss connections to existing works in the areas outside of the IV setting in Section B in Appendix.

7 Experimental results

We present the experimental results in a wide range of settings for IV estimation. Following refs [13,14], we consider both low- and high-dimensional scenarios. In the experiments, we compare our algorithms to the following baseline algorithms:

DirectNN: A standard least square regression on X and Y using a NN.
2SLS: A vanilla 2SLS on raw X and Z .
Poly2SLS: The 2SLS that is performed on polynomial features of X and Z via ridge regressions.
SieveIV [26]: This algorithm solves the CMR via the sieve method, which uses orthonormal basis functions. In the implementation, we utilize the trigonometric basis, and the number of basis functions is selected by Lepski’s method. The details are found in ref. [26] and its related studies.
DeepIV [12]: A nonlinear extension of 2SLS using deep NNs. We use the implementation available at https://github.com/microsoft/EconML.
KernelIV [16]: A generalization of 2SLS by modeling relations among X , Y , and Z as nonlinear functions in RKHSs. We use the publicly available implementation at https://github.com/r4hu1-5in9h/KIV.
GMM+NN: An optimally weighted GMM [7] is combined with a NN f ( X ) . The details of this algorithm can be found in [14, Section 5].
AGMM [13]: This algorithm models h ( Z ) by a deep NN and employs a minimax optimization to solve for f ( X ) . The implementation we use is available at https://github.com/vsyrgkanis/adversarial_gmm.
DeepGMM [14]: This algorithm is a variant of AGMM with optimal inverse-covariance weighting matrix. The publicly available implementation at https://github.com/CausalML/DeepGMM is used, and all of the aforementioned baselines are provided in the package except KernelIV.
AGMM-K [50]: This algorithm extends AGMM by modeling h ( Z ) and f ( X ) as RKHSs. Nyström approximation is applied for fast computation. The publicly available implementation at https://github.com/microsoft/AdversarialGMM is used.
DualIV [15]: This algorithm solves the CMR via a saddle-point reformulation and obtains a minimax problem. By using an RKHS for the inner maximization, an objective similar to ours is obtained, while the central kernel matrix is evaluated on Z and Y , which thus cannot be viewed as instruments and different from ours. The publicly available implementation at https://github.com/krikamol/DualIV-NeurIPS2020 is used.

7.1 Low-dimensional scenarios

Following [14], we employ the following data generation process:

Y = f ∗ ( X ) + e + δ , X = Z 1 + e + γ , Z ≔ ( Z 1 , Z 2 ) ,

where e ∼ N ( 0 , 1 ) , γ , δ ∼ N ( 0 , 0 . 1 2 ) , and Z ∼ Uniform ( [ − 3 , 3 ] 2 ) is a two-dimensional IV, but Z 1 has an effect on X. The variable e is the confounding variable that creates the correlation between X and the residual Y − f ∗ ( X ) . We vary the true function f ∗ between the following cases to enrich the datasets: (i) sin: f ∗ ( x ) = sin ( x ) . (ii) step: f ∗ ( x ) = 1 { x ≥ 0 } . (iii) abs: f ∗ ( x ) = ∣ x ∣ . (iv) linear: f ∗ ( x ) = x . We consider both small-sample ( n = 200 ) and large-sample ( n = 2,000 ) regimes.

7.1.1 Experimental settings

For the experiments on low-dimensional data, we consider both small-sample ( n = 200 ) and large-sample ( n = 2,000 ) regimes, in which n points are sampled for training, validation, and test sets. In both regimes, we standardize the values of Y to have zero mean and unit variance for numerical stability.

Hyperparameters of NNs in Algorithm 2, including the learning rate and the regularization parameter, are chosen by a two-fold CV for fair comparisons with the baselines. As per the dimensions of X , we parametrize f either as a fully connected NN with leaky ReLU activations and 2 hidden layers, each of which has 100 cells, for nonimage data, or a deep convolutional neural network architecture for MNIST data. We denote the fully connected NN as FCNN(100,100) and refer readers to our code release for exact details on our CNN construction. Learning rates and regularization parameters are summarized in Table 1. Besides, we use the well-tuned hyperparameter selections of baselines provided in their packages without changes. We fix the random seed to 527 for all data generation and model initialization.

Table 1

Hyperparameters of NNs used in the experiments

Scenario	Model f	Learning rates	λ
Low dimensional	FCNN(100,100)	( 1 0 − 12 , 1 0 − 11 , 1 0 − 10 , 1 0 − 9 , 1 0 − 8 , 1 0 − 7 , 1 0 − 6 )	( 5 × 1 0 − 5 , 1 0 − 4 , 2 × 1 0 − 4 )
MNIST_Z	FCNN(100,100)	( 1 0 − 12 , 1 0 − 11 , 1 0 − 10 , 1 0 − 9 , 1 0 − 8 , 1 0 − 7 , 1 0 − 6 )	( 5 × 1 0 − 5 , 1 0 − 4 , 2 × 1 0 − 4 )
MNIST_X	CNN	( 1 0 − 12 , 1 0 − 11 , 1 0 − 10 , 1 0 − 9 , 1 0 − 8 , 1 0 − 7 , 1 0 − 6 )	( 5 × 1 0 − 5 , 1 0 − 4 , 2 × 1 0 − 4 )
MNIST_XZ	CNN	( 1 0 − 12 , 1 0 − 11 , 1 0 − 10 , 1 0 − 9 , 1 0 − 8 , 1 0 − 7 , 1 0 − 6 )	( 5 × 1 0 − 5 , 1 0 − 4 , 2 × 1 0 − 4 )
Mendelian	FCNN(100,100)	( 1 0 − 12 , 1 0 − 11 , 1 0 − 10 , 1 0 − 9 , 1 0 − 8 , 1 0 − 7 , 1 0 − 6 )	( 5 × 1 0 − 5 , 1 0 − 4 , 2 × 1 0 − 4 )

In contrast to our NN-based method, the RKHS-based method in Algorithm 3 has the analytic form of CV error. For the kernel function on instruments, we employ the sum of Gaussian kernels

k ( z , z ′ ) = 1 3 ∑ i = 1 3 exp − ‖ z − z ′ ‖ 2 2 2 σ k i 2

and the Gaussian kernel for l ( x , x ′ ) = exp ( − ‖ x − x ′ ‖ 2 2 ∕ ( 2 σ l 2 ) ) , where σ k 1 is chosen as the median interpoint distance of z = { z i } i = 1 n and σ k 2 = 0.1 σ k 1 , σ k 3 = 10 σ k 1 . The motivation of such a kernel k is to optimize f on multiple kernels, and we leave parameter selection of k to the future work. We combine the training and validation sets to perform leave-2-out CV to select parameters of the kernel l and the regularization parameter λ . We choose the Gaussian kernel for l . For the Nyström approximation, we subsample 300 points from the combined set. As a small subset of the Gram matrix is used as Nyström samples, which misses much information of data, we avoid outliers in test results by averaging 10 test errors with different Nyström approximations in each experiment. All methods are repeated 10 times on each dataset with random initialization.

7.1.2 Experiment results

Tables 2 and 3 report the results for the large-sample and small-sample regimes respectively. First, under the influence of confounders, DirectNN performs worst as it does not use instruments. Second, MMR-IVs perform reasonably well in both small-sample and large-sample regimes. For the linear function, 2SLS and Poly2SLS tend to outperform other algorithms as the linearity assumption is satisfied in this case. For nonlinear functions, some NN-based methods show competitive performance in certain cases. Notably, GMM+NN has unstable performance as the function h in (2) is designed manually. KernelIV performs well because of simple simulation models. Its kernel parameters are selected by a median heuristic which is relatively naive to the CV, and in later more complicated experiments, the performance of KernelIV deteriorates. AGMM-K is similar in principle to our method while its errors are high. We suspect that this is due to its hyperparameter selection, and a few folds of CV over a small set of hyperparameters are not effective enough, because we observed that the AGMM-K’s performance becomes better as the numbers of CV folds and hyperparameter candidates increase. A similar observation was also obtained on DualIV. Thus, it shows that it is desirable to have the analytical CV error, which is an advantage of our method against other RKHS baselines. Besides, the selection of the weight matrix remains an open question, and although DualIV, AGMM-K, and MMRIV (RKHS) rely on the median heuristic to select the bandwidth, we believe that the selection has different effects on different methods and it is difficult to obtain fair selection. We will leave this problem to the future work. Moreover, the performances of the complicated methods like DeepIV and DeepGMM deteriorate more in the large-sample regimes than the small-sample regimes. We suspect that this is because these methods rely on two NNs and thus require proper hyperparameters (e.g., model structures) on random samples. For SieveIV, it performs well when the true function is the sin function, because we utilize the trigonometric basis for the sieve method. It suggests that the choice of basis functions is important for the performance of this method. In contrast, MMR-IV (Nyström) has the advantage of adaptive hyperparameter selection.

Table 2

The mean square error (MSE) ± one standard deviation in the large-sample regime ( n = 2,000 ). Italics denote the second best

Algorithm	True function f ∗
	abs	linear	sin	step
`DirectNN`	0.116 ± 0.000	0.035 ± 0.000	0.189 ± 0.000	0.199 ± 0.000
`2SLS`	0.522 ± 0.000	0.000 ± 0.000	0.254 ± 0.000	0.050 ± 0.000
`Poly2SLS`	0.083 ± 0.000	0.000 ± 0.000	0.133 ± 0.000	0.039 ± 0.000
`GMM+NN`	0.318 ± 0.000	0.044 ± 0.000	0.694 ± 0.000	0.500 ± 0.000
`AGMM`	0.600 ± 0.001	0.025 ± 0.000	0.274 ± 0.000	0.047 ± 0.000
`DeepIV`	0.247 ± 0.004	0.056 ± 0.003	0.165 ± 0.003	0.038 ± 0.001
`DeepGMM`	0.027 ± 0.009	0.005 ± 0.001	0.160 ± 0.025	0.025 ± 0.006
`KernelIV`	0.019 ± 0.000	0.009 ± 0.000	0.046 ± 0.000	0.026 ± 0.000
`AGMM-K`	181 ± 0.000	2.34 ± 0.000	19.4 ± 0.000	4.13 ± 0.000
`DualIV`	0.344 ± 0.000	0.034 ± 0.000	0.379 ± 0.000	0.345 ± 0.000
`SieveIV`	0.170 ± 0.009	0.279 ± 0.001	0.021 ± 0.001	6.89 ± 0.044
`MMR-IV` _NN	0.011 ± 0.002	0.005 ± 0.000	0.153 ± 0.019	0.040 ± 0.004
`MMR-IV` _Nys	0.011 ± 0.001	0.001 ± 0.000	0.006 ± 0.002	0.020 ± 0.002

Table 3

The MSE ± one standard deviation in the small-sample regime ( n = 200 ). Italics denote the second best

Algorithm	True function f ∗
	abs	linear	sin	step
`DirectNN`	0.143 ± 0.000	0.046 ± 0.000	0.404 ± 0.006	0.253 ± 0.000
`2SLS`	0.564 ± 0.000	0.003 ± 0.000	0.304 ± 0.000	0.076 ± 0.000
`Poly2SLS`	0.125 ± 0.000	0.003 ± 0.000	0.164 ± 0.000	0.077 ± 0.000
`GMM+NN`	0.792 ± 0.000	0.203 ± 0.000	1.56 ± 0.001	0.550 ± 0.000
`AGMM`	0.031 ± 0.000	0.011 ± 0.000	0.330 ± 0.000	0.080 ± 0.000
`DeepIV`	0.204 ± 0.008	0.047 ± 0.004	0.197 ± 0.004	0.039 ± 0.001
`DeepGMM`	0.022 ± 0.003	0.032 ± 0.016	0.143 ± 0.030	0.039 ± 0.002
`KernelIV`	0.063 ± 0.000	0.024 ± 0.000	0.086 ± 0.000	0.055 ± 0.000
`AGMM-K`	12.3 ± 0.000	1.32 ± 0.000	1.57 ± 0.000	1.71 ± 0.000
`DualIV`	0.202 ± 0.000	0.103 ± 0.000	0.251 ± 0.000	0.362 ± 0.000
`SieveIV`	0.466 ± 0.002	0.452 ± 0.001	0.363 ± 0.002	9.07 ± 0.038
`MMR-IV (NN)`	0.019 ± 0.003	0.004 ± 0.001	0.292 ± 0.024	0.075 ± 0.008
`MMR-IV (RKHS)`	0.030 ± 0.000	0.011 ± 0.000	0.075 ± 0.000	0.057 ± 0.000

We also record the runtimes of all methods on the large-sample regime and report them in Figure 3. Compared to the NN-based methods, i.e., AGMM, DeepIV, DeepGMM, our MMR-IV (NN) is the most computationally efficient method, which is clearly a result of a simpler objective. By using a minimax optimization between two NNs, AGMM is the least efficient method. DeepGMM and DeepIV are more efficient than AGMM, but are less efficient than MMR-IV (NN). Finally, all four RKHS-based methods, namely, KernelIV, AGMM-K, DualIV, and MMR-IV (RKHS), have similar computational time. All four methods are observed to scale poorly on large datasets. In contrast, SieveIV is the most computationally efficient method in this experiment.

$Figure 3 Runtime in the large-sample regime ( n = 2,000 n=\hspace{0.1em}\text{2,000}\hspace{0.1em} ). Time for parameter selection is excluded. AGMM-K (No Nyström), DualIV, and MMR-IV (RKHS) overlap due to the same runtime. Python and library versions are included in the “README” and “requirement” files in the software. We use 2.9 GHz Intel Core i5 for the experiment.$

Figure 3

Runtime in the large-sample regime ( n = 2,000 ). Time for parameter selection is excluded. AGMM-K (No Nyström), DualIV, and MMR-IV (RKHS) overlap due to the same runtime. Python and library versions are included in the “README” and “requirement” files in the software. We use 2.9 GHz Intel Core i5 for the experiment.

7.2 High-dimensional structured scenarios

In a high-dimensional setting, we employ the same data-generating process as in Section 7.1. We consider only the absolute function for f ∗ ( x ) = ∣ x ∣ , but map Z , X , or both X and Z to MNIST images (784-dim) [66]. Let us denote the original outputs in Section 7.1 by X low , Z low , and let π ( u ) ≔ round ( min ( max ( 1.5 u + 5 , 0 ) , 9 ) ) be a transformation function mapping inputs to an integer between 0 and 9, and let RI ( d ) be a function that selects a random MNIST image from the digit class d . Then, the scenarios we consider are (i) MNIST _Z: Z ← RI ( π ( Z 1 low ) ) , (ii) MNIST _X: X ← RI ( π ( X low ) ) , and (iii) MNIST _XZ: X ← RI ( π ( X low ) ) , Z ← RI ( π ( Z 1 low ) ) .

7.2.1 Experimental settings

We sample n = 10,000 points for the training, validation, and test sets, and run each method 10 times. For MMR-IV (Nyström), we run it only on the training set to reduce computational workload and use principal component analysis (PCA) to reduce dimensions of X from 728 to 8. We still use the sum of Gaussian kernels for k ( z , z ′ ) , but use the automatic relevance determination (ARD) kernel for l ( x , x ′ ) . We omit KernelIV and DualIV in this experiment since their kernels are not suitably designed for structured data and as we will see, so is MMR-IV (Nyström). We also omit SieveIV in this setting because the method requires a huge number of basis functions in the high-dimensional setting.

7.2.2 Experiment results

We report the results in Table 4. Like [14], we observe that the DeepIV code returns NaN when X is high dimensional, so the implementation has to be improved before being used for similar scenarios. Besides, we run Ridge2SLS, which is Poly2SLS with a fixed linear degree, in place of Poly2SLS to reduce computational workload, and as a result, its performance is unsatisfactory. 2SLS has large errors for high-dimensional X because the first-stage regression from Z to X is ill posed. The average performance of GMM+NN suggests that manually designed functions for instruments are insufficient to extract useful information. Furthermore, MMR-IV (NN) performs competitively across scenarios. On MNIST_Z, MMR-IVs perform better than other methods, which implies that using the sum of Gaussian kernels for the kernel k is proper. DeepGMM has competitive performance as well. On MNIST_X and MNIST_XZ, MMR-IV (NN) outperforms all other methods. AGMM-K+NN, i.e., the KLayerTrained method by ref. [50], is a variant of AGMM-K, which uses NNs to model f ( X ) and to extract features of Z as inputs of the kernel k . To reinforce the flexibility, AGMM-K+NN also trains the kernel hyperparameter, i.e., the scale of the RBF kernel k . Compared with MMR-IV (NN), AGMM-K+NN has a more flexible kernel k but lacks good kernel hyperparameter selection. So, it is vulnerable to a local minimum and shows unstable performance. The ARD kernel with PCA of MMR-IV (Nyström) fails because the features from PCA are not representative enough. In addition, we observe that DeepGMM often produces results that are unreliable across all settings. We suspect that the hard-to-optimize objective and the complicated optimization procedure are the causes. Compared with DirectNN, most of the baselines can hardly deal with high-dimensional structured instrumental regressions.

Table 4

The MSE ± one standard deviation on high-dimensional structured data. We run each method 10 times

Algorithm	Setting
	MNIST z	MNIST x	MNIST x z
`DirectNN`	0.134 ± 0.000	0.229 ± 0.000	0.196 ± 0.011
`2SLS`	0.563 ± 0.001	> 1,000	> 1,000
`Ridge2SLS`	0.567 ± 0.000	0.431 ± 0.000	0.705 ± 0.000
`GMM+NN`	0.121 ± 0.004	0.235 ± 0.002	0.240 ± 0.016
`AGMM`	0.017 ± 0.007	0.732 ± 0.107	0.529 ± 0.163
`DeepIV`	0.114 ± 0.005	n/a	n/a
`DeepGMM`	0.038 ± 0.004	0.315 ± 0.130	0.333 ± 0.168
`AGMM-K+NN`	0.021 ± 0.007	1.05 ± 0.366	0.327 ± 0.192
`MMR-IV (NN)`	0.024 ± 0.006	0.124 ± 0.021	0.130 ± 0.009
`MMR-IV (Nys)`	0.015 ± 0.002	0.442 ± 0.000	0.425 ± 0.002

7.3 Mendelian randomization

We demonstrate our method in the setting of Mendelian randomization, which relies on genetic variants that satisfy the IV assumptions. The “exposure” X and outcome Y are univariate and generated from the simulation process by ref. [67]:

Y = β X + c 1 e + δ , X = ∑ i = 1 m α i Z i + c 2 e + γ ,

where Z ∈ R d ′ with each entry Z i ∼ B ( 2 , p i ) , p i ∼ unif ( 0.1 , 0.9 ) , e ∼ N ( 0 , 1 ) , α i ∼ unif ( [ 0.8 / d ′ , 1.2 / d ′ ] ) , and γ , δ ∼ N ( 0 , 0 . 1 2 ) . Z i ∼ B ( 2 , p i ) mimics the frequency of an individual getting one or more genetic variants. The parameters β , c 1 control the strength of exposures and confounders to outcomes, while c 2 , α i control the strength of instruments and confounders to exposures. We set α i ∼ unif ( [ 0.8 / d ′ , 1.2 / d ′ ] ) so that as the number of IVs increases, each IV becomes weaker while the overall strength of instruments ( ∑ i d ′ α i ) remains constant.

In Mendelian randomization, it is known that genetic variants may act as weak IVs [5,68,69], so this experiment aims to evaluate the sensitivity of different methods to the number of instruments ( d ′ ) and confounder strengths ( c 1 , c 2 ) . We consider three scenarios: (i) d ′ = 8 , 16 , 32 ; (ii) c 1 = 0.5 , 1 , 2 ; (iii) c 2 = 0.5 , 1 , 2 ; unmentioned parameters use default values: β = 1 , d ′ = 16 , c 1 = 1 , and c 2 = 1 . We draw 10,000 samples for the training, validation, and test sets, respectively, and train MMR-IV (Nyström) only on the training set. Other settings are the same as those of the low-dim scenario.

Figure 4 depicts the experimental results. Overall, 2SLS performs well on all settings due to the linearity assumption, except particular sensitivity to the number of (weak) instruments, which is a well-known property of 2SLS [1]. Although imposing no such assumption, MMR-IVs perform competitively and even more stably, since the information of instruments is effectively captured by the kernel k and we only need to deal with a simple objective, and also the analytical CV error plays an essential role. KernelIV also achieves competitive and stable performance across all settings, but not as good as ours. DirectNN is always among the worst approaches on all settings as no instrument is used. Poly2SLS performs accurately on the last two experiments, while it presents significant instability with the number of instruments in Figure 4 (left) because of failure of the hyperparameter selection. In Figure 4 (middle) and Figure 4 (right), we can observe that the performance of most approaches deteriorates as the effect of confounders becomes stronger. MMR-IV (Nyström) has promising performance and shows a bit more sensitivity to c 2 than c 1 , and the good performance takes the advantage of the hyperparameter selection compared with AGMM-K and DualIV.

$Figure 4 The MSE of different methods on Mendelian randomization experiments as we vary the numbers of instruments (left), the strength of confounders to exposures c 1 {c}_{1} (middle), and the strength of confounders to instruments c 2 {c}_{2} (right). The MSE is obtained from 10 repetitions of the experiment.$

Figure 4

The MSE of different methods on Mendelian randomization experiments as we vary the numbers of instruments (left), the strength of confounders to exposures c 1 (middle), and the strength of confounders to instruments c 2 (right). The MSE is obtained from 10 repetitions of the experiment.

7.4 Vitamin D data

Finally, we apply our algorithm to the Vitamin D data [70, Section 5.1]. The data were collected from a 10-year study on 2,571 individuals aged 40 to 71 and 4 variables are employed: age (at baseline), filaggrin (a binary indicator of filaggrin mutations), VitD (Vitamin D level at baseline), and death (a binary indicator of death during the study). The goal is to evaluate the potential effect of VitD on death. We follow [70] by controlling age in the analyses, using filaggrin as an instrument, and then by applying the MMR-IV (Nyström) algorithm. Sjolander and Martinussen [70] modeled the effect of VitD on death by a generalized linear model (GLM) and found that the effect is insignificant by 2SLS ( p -value on the estimated coefficient is 0.13 with the threshold of 0.05). More details are shown later. The estimated effect is illustrated in Figure 5 in the appendix. We observe that: (i) by using instruments, both our method and [70] output more reasonable results compared with those without instruments: a low VitD level at a young age has a slight effect on death, but a more adverse effect at an old age [71]; (ii) Unlike [70], our method allows more flexible nonlinearity for causal effect.

$Figure 5 Estimated effect of vitamin D level on mortality rate, controlled by age. All plots depict normalized contours of f ′ ( X , C ) f^{\prime} \left(X,C) defined in Section 7.4.1, where blue represents low mortality rate and yellow the opposite. We can divide each plot roughly into the left (young age) and right (old age) parts. While the right parts reflect similar information (i.e., lower vitamin D levels and old age lead to a higher mortality rate), the left parts are different. (a) A high level of vitamin D at a young age can result in a high mortality rate, which is counterintuitive. A plausible explanation is that it is caused by some unobserved confounders between vitamin D levels and mortality rates. (b) On the other hand, this spurious effect disappears when the filaggrin mutation is used as an instrument, i.e., a low vitamin D level at a young age has only a slight effect on death, but a more adverse effect at an old age [71]. This comparison demonstrates the benefit of an IV. (c) and (d) correspond to the results obtained by using [70]’ GLM, from which we can draw similar conclusions. It is noteworthy that MMR-IV allows more flexible nonlinearity for a causal effect. (a) Kernel ridge regression (IV: none). (b) MMRIV (IV: Filaggrin mutation). (c) GLM (IV: none). (d) 2SLS + GLM (IV: Filaggrin mutation).$

Figure 5

Estimated effect of vitamin D level on mortality rate, controlled by age. All plots depict normalized contours of f ′ ( X , C ) defined in Section 7.4.1, where blue represents low mortality rate and yellow the opposite. We can divide each plot roughly into the left (young age) and right (old age) parts. While the right parts reflect similar information (i.e., lower vitamin D levels and old age lead to a higher mortality rate), the left parts are different. (a) A high level of vitamin D at a young age can result in a high mortality rate, which is counterintuitive. A plausible explanation is that it is caused by some unobserved confounders between vitamin D levels and mortality rates. (b) On the other hand, this spurious effect disappears when the filaggrin mutation is used as an instrument, i.e., a low vitamin D level at a young age has only a slight effect on death, but a more adverse effect at an old age [71]. This comparison demonstrates the benefit of an IV. (c) and (d) correspond to the results obtained by using [70]’ GLM, from which we can draw similar conclusions. It is noteworthy that MMR-IV allows more flexible nonlinearity for a causal effect. (a) Kernel ridge regression (IV: none). (b) MMRIV (IV: Filaggrin mutation). (c) GLM (IV: none). (d) 2SLS + GLM (IV: Filaggrin mutation).

7.4.1 Experimental settings on Vitamin D data

We normalize each variable to have a zero mean and unit variance to reduce the influence of different scales. We consider two cases: (i) without instruments and (ii) with instruments. By without instrument, we mean that the W V matrix of MMR-IV (Nyström) becomes an identity matrix. Following [70], we assess the effect of vitamin D (exposure) on mortality rate (outcome), control the age in the analyses, and use filaggrin as the instrument. We illustrate original vitamin D, age, and death in Figure 6. We randomly pick (random seed is 527) 300 Nyström samples and use leave-2-out CV to select hyperparameters. The GLMs in ref. [70] are a linear function in the first step and a logistic regression model in the second step.

Figure 6

Distribution of Vitamin D data. Data points are plotted in the middle, the solid curve and histogram on the right describe the kernel density estimation and histogram of Vitamin D, and those on the top are for age.

In this experiment, we use age as a control variable by considering a structural equation model, which is similar to the model (1) except for the presence of the controlled (exogenous) variable C ,

(12) Y = f ( X , C ) + ε , X = t ( Z , C ) + g ( ε ) + ν ,

where E [ ε ] = 0 and E [ ν ] = 0 . We further assume that the instrument Z satisfies the following three conditions:

relevance: Z has a causal influence on X ;
exclusion restriction: Z affects Y only through X , i.e., Y ⊥ ⊥ Z ∣ X , ε , C ;
unconfounded instrument(s): Z is conditionally independent of the error, i.e., ε ⊥ ⊥ Z ∣ C .

Unlike the conditions specified in the main text, (ii) and (iii) also include the controlled variable C . A similar model is employed in ref. [12]. From Assumption (iii), we can see that E [ ε ∣ C , Z ] = E [ ε ∣ C ] , and on the basis of this, we further obtain

(13) E [ ( Y − f ( X , C ) − E [ ε ∣ C ] ) h ( Z , C ) ] = 0

for every measurable function h . Note that E [ ε ∣ C ] is only conditioned on C , remains constant on arbitrary values of C , and is typically nonzero. To adapt our method to this model, we only need to use the kernel k with ( Z , C ) as inputs and be aware that the output of the method is an estimate of f ′ ( X , C ) ≔ f ( X , C ) + E [ ε ∣ C ] instead of just f ( X , C ) . For simplicity, we directly fit f ′ to binary Y without using such as the logistic transform g ( x ) = e x / ( 1 + e x ) . This is because the transform requires nontrivial modification to the analytical CV error. We want to test MMR-IV (Nyström) with the exact analytical error proposed in the article.

8 Conclusion and discussion

Learning causal relations when hidden confounders are present is a cornerstone of reliable decision-making. IV regression is a standard tool to tackle this task, but currently faces challenges in nonlinear settings. This work presents a simple kernel-based framework that overcomes some of these challenges. We employ RKHS theory in the reformulation of CMR as a MMR based on which we can approach the problem from the ERM perspective. As we demonstrate, this framework not only facilitates theoretical analysis but also results in easy-to-use algorithms that perform well in practice compared to existing methods. This article also shows a way of combining the elegant theoretical approach of kernel methods with the practical merits of deep NNs. Despite these advantages, the optimal choice of the kernel k in the MMR objective remains an open question that we aim to address in future work.

The efficiency issue, i.e., the volume of the error by our estimator, is one of the major unsolved problems. Importantly, since the efficiency heavily depends on the choice of kernel function k , we have to discuss an appropriate method of kernel selection. This dependence has been clarified by the analysis of errors in the minimax problem by ref. [50], which shows that the error is upper bounded by a metric entropy of RKHS associated with the kernel k . Hence, the selection of optimal kernel functions is one of the important future directions.

In the classical IV problem with parametric models, the optimal choice of IV is based on the variance of the asymptotic distribution of estimators [30]. However, our setup corresponds to a nonparametric regression problem, which requires a more complicated approach. A recent work [72] tackles this problem, but some arbitrariness remains. Another challenge is recent works on a variational method of moments [73]. This work provides the optimal weighted objective and the closed-form expression of the estimator, which can achieve the semiparametric efficiency bound for solving the CMR. This approach would be highly relevant to the discussion of efficiency in our setting.

Acknowledgments

We thank Yuchen Zhu, Vasilis Syrgkanis, and Heiner Kremer for fruitful discussions. We are also indebted to anonymous reviewers for their constructive feedback on the initial draft of this manuscript.

Funding information: Rui Zhang was sponsored by PhD scholarship from Data61 and by the Empirical Inference Department of Max Planck Institute. Masaaki Imaizumi was supported by JSPS KAKENHI (Grant Number 18K18114) and JST Presto (Grant Number JPMJPR1852). Bernhard Schölkopf is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. This work was partly supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039B.
Author contributions: All authors have accepted responsibility for the entire content of this article and approved its submission. The study and manuscript have received significant contributions from all authors whose contributions are reflected in the list of authors.
Conflict of interest: The authors have a conflict of interest with the Australian National University, Data61 of Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia, Max Planck Institute (MPI), CISPA–Helmholtz Center for Information Security, and the University of Tokyo.
Ethical approval: The conducted research is not related to either human or animal use.
Data availability statement: The datasets generated during and analyzed during the current study are available at https://github.com/RuiZhang2016/MMRIV, where the implementation of the approaches is also provided.

Appendix A Detailed proofs

This section contains detailed proofs of the results that are missing in the main article. Most of the proofs on consistency and asymptotic normality take advantages of the useful resource by ref. [53]. Readers are referred to it for more detailed discussions on assumptions. Note that our proofs are based on the normed ℱ space.

A.1 Proof of Lemma 1

Proof

Since ℋ k is the RKHS, we can rewrite (3) as follows:

(A1) R k ( f ) = sup h ∈ ℋ k , ‖ h ‖ ≤ 1 ( E [ ( Y − f ( X ) ) ⟨ h , k ( Z , ⋅ ) ⟩ ℋ ] ) 2 = sup h ∈ ℋ k , ‖ h ‖ ≤ 1 ( ⟨ h , E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] ⟩ ℋ ) 2 = ∥ E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] ∥ ℋ k 2 ,

where we used the reproducing property of ℋ k in the first equality and the fact that ℋ k is a vector space in the last equality. By assumption, E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] is Bochner integrable [37, Def. A.5.20]. Hence, we can write (A1) as follows:

‖ E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k 2 = ⟨ E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] , E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] ⟩ ℋ k = E [ ⟨ ( Y − f ( X ) ) k ( Z , ⋅ ) , E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] ⟩ ℋ k ] = E [ ⟨ ( Y − f ( X ) ) k ( Z , ⋅ ) , ( Y ′ − f ( X ′ ) ) k ( Z ′ , ⋅ ) ⟩ ℋ k ] = E [ ( Y − f ( X ) ) ( Y ′ − f ( X ′ ) ) k ( Z , Z ′ ) ] ,

as required.□

A.2 Proof of Proposition 1

Proof

First, the law of iterated expectation implies that

E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] = E Z [ E X Y [ ( Y − f ( X ) ) k ( Z , ⋅ ) ∣ Z ] ] = E Z [ E X Y [ Y − f ( X ) ∣ Z ] k ( Z , ⋅ ) ] .

By Lemma 1, we know that R k ( f ) = ‖ E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k 2 . As a result, R k ( f ) = 0 if E [ Y − f ( X ) ∣ z ] = 0 for P Z -almost all z . To show the converse, we assume that R k ( f ) = 0 and rewrite it as follows:

R k ( f ) = ∬ Z g ( z ) k ( z , z ′ ) g ( z ′ ) d z d z ′ = 0 ,

where we define g ( z ) ≔ E X Y [ Y − f ( X ) ∣ z ] p ( z ) . Since k is ISPD by assumption, this implies that g is a zero function with respect to P Z , i.e., E [ Y − f ( X ) ∣ z ] = 0 for P Z -almost all z .□

A.3 Proof of Proposition 2

Proof

Given α ∈ ( 0 , 1 ) and any functions f , g : X → R , we will show that

R k ( α f + ( 1 − α ) g ) − α R k ( f ) − ( 1 − α ) R k ( g ) < 0 .

By Lemma 1, we know that R k ( f ) = ‖ E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k 2 . Hence, we can rewrite the aforementioned function as follows:

R k ( α f + ( 1 − α ) g ) − α R k ( f ) − ( 1 − α ) R k ( g ) = ‖ E [ ( Y − α f ( X ) − ( 1 − α ) g ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k 2 − α ‖ E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k 2 − ( 1 − α ) ‖ E [ ( Y − g ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k 2 = (a) α ( α − 1 ) ‖ E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k 2 + α ( α − 1 ) ‖ E [ ( Y − g ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k 2 − 2 α ( α − 1 ) ⟨ E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] , E [ ( Y − g ( X ) ) k ( Z , ⋅ ) ] ⟩ ℋ k = (b) α ( α − 1 ) ‖ E [ ( f ( X ) − g ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k 2 ︸ > 0 < 0 .

The equality (a) is obtained by considering Y = α Y + ( 1 − α ) Y in ‖ E [ ( Y − α f ( X ) − ( 1 − α ) g ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k 2 on the left-hand side of (a). We note that the right-hand side of (a) is quadratic in ‖ E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k and ‖ E [ ( Y − g ( X ) ) k ( Z , ⋅ ) ] ‖ ℋ k , and can be further expressed as a square binomial as the right-hand side of ( b ) . Therefore, the convexity follows from the fact that k is the ISPD kernel, ‖ E [ ( f ( X ) − g ( X ) ) ∣ Z ] ‖ 2 ≠ 0 and α ( α − 1 ) < 0 .□

A.4 Uniform convergence of risk functionals

The results presented in this section are used to prove the consistency of f ˆ V and f ˆ U .

Lemma A1

(Uniform consistency of R ^ V ( f ) ) Assume that E [ ∣ Y ∣ 2 ] < ∞ , ℱ is compact, E [ sup f ∈ ℱ ∣ f ( X ) ∣ 2 ] < ∞ , and Assumption 1 holds. Then, the risk R k ( f ) is continuous about f ∈ ℱ and sup f ∈ ℱ ∣ R ^ V ( f ) − R k ( f ) ∣ → p 0 .

Proof

First, let u ≔ ( x , y , z ) , u ′ ≔ ( x ′ , y ′ , z ′ ) , and h f ( u , u ′ ) ≔ ( y − f ( x ) ) ( y ′ − f ( x ′ ) ) k ( z , z ′ ) for some ( x , y , z ) , ( x ′ , y ′ , z ′ ) ∈ X × Y × Z . To prove that R ^ V converges uniformly to R k , we need to show that (i) h f ( u , u ′ ) is continuous at each f with probability one; (ii) E U , U ′ [ sup f ∈ ℱ ∣ h f ( U , U ′ ) ∣ ] < ∞ , and E U , U [ sup f ∈ ℱ ∣ h f ( U , U ) ∣ ] < ∞ [53, Lemma 8.5]. To this end, it is easy to see that

∣ h f ( u , u ′ ) ∣ = ∣ ( y − f ( x ) ) ( y ′ − f ( x ′ ) ) k ( z , z ′ ) ∣ ≤ ∣ y − f ( x ) ∣ ∣ y ′ − f ( x ′ ) ∣ ∣ k ( z , z ′ ) ∣ ≤ ∣ y − f ( x ) ∣ ∣ y ′ − f ( x ′ ) ∣ k ( z , z ) k ( z ′ , z ′ ) ≤ ( ∣ y ∣ + ∣ f ( x ) ∣ ) ( ∣ y ′ ∣ + ∣ f ( x ′ ) ∣ ) k ( z , z ) k ( z ′ , z ′ ) .

The third inequality follows from the Cauchy-Schwarz inequality. Since ℱ is compact, every f ∈ ℱ has f ( x ) bounded for ‖ x ‖ < ∞ . In term of k ( ⋅ , ⋅ ) is bounded as per Assumption 1, we have h f ( u , u ′ ) < ∞ , and thus, h f ( u , u ′ ) continuous at each f with probability one. Furthermore, we obtain the following inequalities:

E U , U ′ [ sup f ∈ ℱ ∣ h f ( U , U ′ ) ∣ ] ≤ E [ sup f ∈ ℱ ( ∣ Y ∣ + ∣ f ( X ) ∣ ) ( ∣ Y ′ ∣ + ∣ f ( X ′ ) ∣ ) k ( Z , Z ) k ( Z ′ , Z ′ ) ] ≤ E [ sup f ∈ ℱ ( ∣ Y ∣ + ∣ f ( X ) ∣ ) sup f ∈ ℱ ( ∣ Y ′ ∣ + ∣ f ( X ′ ) ∣ ) ] sup z k ( z , z ) = E [ sup f ∈ ℱ ( ∣ Y ∣ + ∣ f ( X ) ∣ ) ] 2 sup z k ( z , z ) = E [ ∣ Y ∣ + sup f ∈ ℱ ∣ f ( X ) ∣ ] 2 sup z k ( z , z ) < ∞ E U , U [ sup f ∈ ℱ ∣ h f ( U , U ) ∣ ] ≤ E u [ sup f ∈ ℱ ( ∣ Y ∣ + ∣ f ( X ) ∣ ) 2 ] sup z k ( z , z ) = E [ ( ∣ Y ∣ + sup f ∈ ℱ ∣ f ( X ) ∣ ) 2 ] sup z k ( z , z ) ≤ 2 ( E [ ∣ Y ∣ 2 ] + E [ sup f ∈ ℱ ∣ f ( X ) ∣ 2 ] ) sup z k ( z , z ) < ∞ .

Hence, our assertion follows from [53, Lemma 8.5].□

Lemma A2

(Uniform consistency of R ^ U ( f ) ) Assume that E [ ∣ Y ∣ ] < ∞ , ℱ is compact, E [ ∣ f ( X ) ∣ ] < ∞ and Assumption 1 holds. Then, R k ( f ) is continuous about f and sup f ∈ ℱ ∣ R ^ U ( f ) − R k ( f ) ∣ → p 0 .

Proof

First, let u ≔ ( x , y , z ) , u ′ ≔ ( x ′ , y ′ , z ′ ) , and h f ( u , u ′ ) ≔ ( y − f ( x ) ) ( y ′ − f ( x ′ ) ) k ( z , z ′ ) for some ( x , y , z ) , ( x ′ , y ′ , z ′ ) ∈ X × Y × Z . To prove the uniform consistency of R ^ U , we need to show that (i) h f ( u , u ′ ) is continuous at each f with probability one; (ii) there is d ( u , u ′ ) with ∣ h f ( u , u ′ ) ∣ ≤ d ( u , u ′ ) for all f ∈ ℱ and E U , U ′ [ d ( U , U ′ ) ] < ∞ [53, Lemma 2.4]; (iii) ( u i , u j ) i ≠ j n , n has strict stationarity and ergodicity in the sense of [53, Footnote 18 in p. 2129]. To this end, it is easy to see that

∣ h f ( u , u ′ ) ∣ = ∣ ( y − f ( x ) ) ( y ′ − f ( x ′ ) ) k ( z , z ′ ) ∣ ≤ ∣ y − f ( x ) ∣ ∣ y ′ − f ( x ′ ) ∣ ∣ k ( z , z ′ ) ∣ ≤ ∣ y − f ( x ) ∣ ∣ y ′ − f ( x ′ ) ∣ k ( z , z ) k ( z ′ , z ′ ) ≤ ( ∣ y ∣ + ∣ f ( x ) ∣ ) ( ∣ y ′ ∣ + ∣ f ( x ′ ) ∣ ) k ( z , z ) k ( z ′ , z ′ ) ≡ d ( u , u ′ ) .

The third inequality follows from the Cauchy-Schwarz inequality. Since ℱ is compact, every f ∈ ℱ has f ( x ) bounded for ‖ x ‖ < ∞ . In terms of k ( ⋅ , ⋅ ) is bounded as per Assumption 1, we have h f ( u , u ′ ) < ∞ , and thus, it proves that (i) h f ( u , u ′ ) is continuous at each f with probability one. To prove (ii) E U , U ′ [ d ( U , U ′ ) ] < ∞ , we show that

E U , U ′ [ d ( U , U ′ ) ] ≤ E [ ∣ Y ∣ + ∣ f ( X ) ∣ ] 2 sup z k ( z , z ) < ∞ .

Furthermore, we show that ( u i , u j ) i ≠ j n , n has strict stationarity and ergodicity. Strict stationarity means that the distribution of a set of data ( u i , u j ≠ i ) i = 1 , j = 1 i + m , j + m ′ does not depend on the starting indices i , j for any m and m ′ , which is easy to check. Ergodicity means that R ^ U ( f ) → p R k ( f ) for all f ∈ ℱ and E [ ∣ h f ( U , U ′ ) ∣ ] < ∞ . We have already shown that h f ( u , u ′ ) is bounded and E [ ∣ h f ( U , U ′ ) ∣ ] < ∞ , so R ^ U ( f ) → p R k ( f ) follows by [74, P.25]. Therefore, ergodicity holds, and we have shown all conditions required by extended results of [53, Lemma 2.4]. Then, it follows that sup f ∈ ℱ ∣ R ^ U ( f ) − R k ( f ) ∣ → p 0 and R k ( f ) is continuous.□

A.5 Indefiniteness of weight matrix W U

Theorem A1

If Assumption 1 holds, W U is indefinite.

Proof

By definition, we have

W U = 1 n ( n − 1 ) [ K ( z , z ) − diag ( k ( z 1 , z 1 ) , … , k ( z n , z n ) ) ] = 1 n ( n − 1 ) K U ,

where diag ( a 1 , … , a n ) denotes an n × n diagonal matrix whose diagonal elements are a 1 , … , a n . We can see that the diagonal elements of K U are zeros, and therefore, trace ( W U ) = 0 . Let us denote the eigenvalues of W U by { λ i } i = 1 n . Since ∑ i = 1 n λ i = trace ( W U ) , we conclude that there exist both positive and negative eigenvalues (all eigenvalues being zeros yields trivial W U = 0 ). As a result, W U is indefinite.□

A.6 Consistency of f ˆ V with convex Ω ( f )

Theorem A2

(Consistency of f ˆ V with convex Ω ( f ) ) Assume that ℱ is a convex set, f ∗ is an interior point of ℱ , Ω ( f ) is convex about f , λ → p 0 , and Assumptions 1 and 2 hold. Then, f ˆ V exists with probability approaching one and f ˆ V → p f ∗ .

Proof

Given Ω ( f ) is convex about f , we prove the consistency based on [53, Theorem 2.7], which requires (i) R k ( f ) is uniquely maximized at f ∗ ; (ii) R ^ V ( f ) + λ Ω ( f ) is convex; (iii) R ^ V ( f ) + λ Ω ( f ) → p R k ( f ) for all f ∈ ℱ .

Recall that R ^ V ( f ) = ‖ 1 n ∑ i = 1 n ( y i − f ( x i ) ) k ( z i , ⋅ ) ‖ ℋ k 2 , and by the law of large number, we have that 1 n ∑ i = 1 n ( y i − f ( x i ) ) k ( z i , ⋅ ) → p E [ ( Y − f ( X ) ) k ( Z , ⋅ ) ] . Then R ^ V ( f ) → p R k ( f ) follows from the continuous mapping theorem [75] based on the fact that the function g ( ⋅ ) = ‖ ⋅ ‖ ℋ k 2 is continuous. As λ → p 0 , we obtain (iii) R ^ V ( f ) + λ Ω ( f ) → p R k ( f ) by Slutsky’s theorem [45, Lemma 2.8]. Besides, it is easy to see that R ^ V ( f ) is convex because the weight matrix W V is positive definite, and (ii) R ^ V ( f ) + λ Ω ( f ) is convex due to convex Ω ( f ) . Further, the condition (i) directly follows from Proposition 2, and given that f ∗ is an interior point of the convex set ℱ , our assertion follows from [53, Theorem 2.7].□

A.7 Proof of Proposition 4

Proof

From the conditions of Lemma A1, we know that ℱ is compact, R k ( f ) is continuous about f and sup f ∈ ℱ ∣ R ^ V ( f ) − R k ( f ) ∣ → p 0 . As Assumptions 1 and 2 hold, R k ( f ) is uniquely minimized at f ∗ . On the basis of the conditions that Ω ( f ) is bounded and λ → p 0 , we obtain by Slutsky’s theorem that

sup f ∈ ℱ ∣ R ^ V ( f ) + λ Ω ( f ) − R k ( f ) ∣ ≤ sup f ∈ ℱ ∣ R ^ V ( f ) − R k ( f ) ∣ + λ sup f ∈ ℱ Ω ( f ) → p 0 .

Consequently, we assert the conclusion by [53, Theorem 2.1].□

A.8 Consistency of f ˆ U

Theorem A3

(Consistency of f ˆ U ) Assume that conditions of Lemma A2 and Assumption 2 hold, Ω ( f ) is a bounded function and λ → p 0 . Then f ˆ U → p f ∗ .

Proof

By the conditions of Lemma A2, we know that ℱ is compact, R k ( f ) is continuous about f , and sup f ∈ ℱ ∣ R ^ U ( f ) − R k ( f ) ∣ → p 0 . As Assumptions 1 and 2 hold, R k ( f ) is uniquely minimized at f ∗ . On the basis of the conditions that Ω ( f ) is bounded and λ → p 0 , we obtain by Slutsky’s theorem that

sup f ∈ ℱ ∣ R ^ U ( f ) + λ Ω ( f ) − R k ( f ) ∣ ≤ sup f ∈ ℱ ∣ R ^ U ( f ) − R k ( f ) ∣ + λ sup f ∈ ℱ Ω ( f ) → p 0 .

Consequently, we assert the conclusion by [53, Theorem 2.1].□

A.9 Asymptotic normality of θ ˆ U

In this section, we consider the regularized U-statistic risk R ^ U , λ ( f θ ) . For u i ≔ ( x i , y i , z i ) and u j ≔ ( x j , y j , z j ) , we express it in a compact form:

R ^ U , λ ( f θ ) ≔ 1 n ( n − 1 ) ∑ i = 1 n ∑ j ≠ i n h θ ( u i , u j ) ︸ R ^ U ( f θ ) + λ Ω ( θ ) h θ ( u i , u j ) ≔ ( y i − f θ ( x i ) ) k ( z i , z j ) ( y j − f θ ( x j ) ) .

We will assume that f θ and Ω ( θ ) are twice continuously differentiable about θ . The first-order derivative ∇ θ R ^ U ( f θ ) can also be written as follows:

∇ θ R ^ U , λ ( f θ ) = 1 n ( n − 1 ) ∑ i = 1 n ∑ j ≠ i n ∇ θ h θ ( u i , u j ) ︸ ∇ θ R ^ U ( f θ ) + λ ∇ θ Ω ( θ ) ∇ θ h θ ( u i , u j ) = − [ ( y i − f θ ( x i ) ) ∇ θ f θ ( x j ) + ( y j − f θ ( x j ) ) ∇ θ f θ ( x i ) ] k ( z i , z j ) .

A.10 Asymptotic normality of ∇ θ R ^ U , λ ( f θ ∗ )

We first show the asymptotic normality of ∇ θ R ^ U , λ ( f θ ∗ ) . We assume that there exists z ∈ Z such that E X [ ∇ θ f θ ∗ ( X ) ∣ z ] p ( z ) ≠ 0 or E X Y [ Y − f θ ∗ ( X ) ∣ z ] p ( z ) ≠ 0 . Both terms being equal to zeros for all z ∈ Z leads to a singular ∇ θ 2 R ^ U ( f θ ∗ ) and the asymptotic distribution therefore becomes much more complicated to analyze.

Lemma A3

Suppose that f θ and Ω ( θ ) are first continuously differentiable about θ , E [ ‖ ∇ θ h θ ∗ ( U , U ′ ) ‖ 2 2 ] < ∞ , there exists z ∈ Z such that E X [ ∇ θ f θ ∗ ( X ) ∣ z ] p ( z ) ≠ 0 or E X Y [ Y − f θ ∗ ( X ) ∣ z ] p ( z ) ≠ 0 , and n λ → p 0 . Then,

n ∇ θ R ^ U , λ ( f θ ∗ ) → p N ( 0 , 4 diag ( E U [ E U ′ 2 [ ∇ θ h θ ∗ ( U , U ′ ) ] ] ) ) .

Proof

The proof follows from [28, Sections 5.5.1 and 5.5.2] and we need to show that (i) ∇ θ R ^ U ( f θ ∗ ) → p 0 and (ii) whether V ar U [ E U ′ [ ∇ θ h θ ∗ ( U , U ′ ) ] ] > 0 . (i) can be obtained by the law of large numbers because ∇ θ R ^ U ( f θ ∗ ) is a sample average of ∇ θ R k ( f θ ∗ ) = 0 .

To prove (ii), we first note that Var U [ E U ′ [ ∇ θ h θ ∗ ( U , U ′ ) ] ] = E U [ E U ′ 2 [ ∇ θ h θ ∗ ( U , U ′ ) ] ] − E U U ′ 2 [ ∇ θ h θ ∗ ( U , U ′ ) ] ︸ = 0 ≥ 0 , where equality holds if for any U, there is E U ′ [ ∇ θ h θ ∗ ( U , U ′ ) ] = 0 , i.e.,

E U ′ [ ∇ θ h θ ∗ ( U , U ′ ) ] = − E X ′ Z ′ [ ∇ θ f θ ∗ ( X ′ ) k ( Z ′ , Z ) ] ( Y − f θ ∗ ( X ) ) − E X ′ Y ′ Z ′ [ ( Y ′ − f θ ∗ ( X ′ ) ) k ( Z ′ , Z ) ] ∇ θ f θ ∗ ( X ) = 0 .

As the aforementioned equation holds for any Y , the coefficient of Y must be 0:

E X ′ Z ′ [ ∇ θ f θ ∗ ( X ′ ) k ( Z ′ , Z ) ] = E Z ′ [ E X ′ [ ∇ θ f θ ∗ ( X ′ ) ∣ Z ′ ] k ( Z ′ , Z ) ] = 0 ,

where we note that E [ ∇ θ f θ ∗ ( X ′ ) ∣ Z ′ ] p ( Z ′ ) = 0 for any Z ′ implied by the second function earlier. Similarly, the coefficient of ∇ θ f θ ∗ ( X ) must be zero, which implies that E X ′ Y ′ [ ( Y ′ − f θ ∗ ( X ′ ) ) ∣ Z ′ ] p ( Z ′ ) = 0 for any Z ′ . The two coefficients cannot be zero at the same time (otherwise against the given conditions), so Var U [ E U ′ [ ∇ θ h θ ∗ ( U , U ′ ) ] ] > 0 . Further, due to the given condition E [ ‖ ∇ θ ∗ h ( U , U ′ ) ‖ 2 2 ] < ∞ , we obtain n ∇ θ R ^ U ( f θ ∗ ) → p N ( 0 , 4 E U [ E U ′ 2 [ ∇ θ h θ ∗ ( U , U ′ ) ] ] ) as per [28, Section 5.5.1]. Finally, as n λ → p 0 and ∇ θ Ω ( θ ∗ ) < ∞ by the condition that Ω ( θ ) is first continuously differentiable, we assert the conclusion by Slutsky’s theorem,

n ∇ θ R ^ U , λ ( f θ ∗ ) = n ∇ θ R ^ U ( f θ ∗ ) + n λ ∇ θ Ω ( θ ∗ ) → p N ( 0 , 4 diag ( E U [ E U ′ 2 [ ∇ θ h θ ∗ ( U , U ′ ) ] ] ) ) .

This concludes the proof.□

A.11 Uniform consistency of ∇ θ 2 R ^ U , λ ( f θ )

Next, we consider the second derivative ∇ θ 2 R ^ U , λ ( f θ ) and show its uniform consistency. In what follows, we denote by ‖ ⋅ ‖ F the Frobenius norm. We can express ∇ θ 2 R ^ U , λ ( f θ ) as follows:

∇ θ 2 R ^ U , λ ( f θ ) = 1 n ( n − 1 ) ∑ i = 1 n ∑ j ≠ i n ∇ θ 2 h θ ( u i , u j ) ︸ ∇ θ 2 R ^ U ( f θ ) + λ ∇ θ 2 Ω ( θ ) ∇ θ 2 h θ ( u i , u j ) = [ ∇ θ f θ ( x i ) ∇ θ f θ ⊤ ( x j ) − ( y i − f θ ( x i ) ) ∇ θ 2 f θ ( x j ) + ∇ θ f θ ( x j ) ∇ θ f θ ⊤ ( x i ) − ( y j − f θ ( x j ) ) ∇ θ 2 f θ ( x i ) ] k ( z i , z j ) .

Lemma A4

Suppose that f θ and Ω ( θ ) are twice continuously differentiable about θ , Θ is compact, E [ ∣ f θ ( X ) ∣ ] < ∞ , E [ ‖ ∇ θ f θ ( X ) ‖ 2 ] < ∞ , E [ ‖ ∇ θ 2 f θ ( X ) ‖ F ] < ∞ , E [ ∣ Y ∣ ] < ∞ , λ → p 0 and Assumption 1 holds. Then, E [ ∇ θ 2 h θ ( U , U ′ ) ] is continuous about θ and

sup θ ∈ Θ ∥ ∇ θ 2 R ^ U , λ ( f θ ) − E [ ∇ θ 2 h θ ( U , U ′ ) ] ∥ F → p 0 .

Proof

The proof is similar to that of Lemma A2 and both applies extended results of [53, Lemma 2.4]. As ( u i , u j ) i ≠ j being strictly stationary in the sense of [53, Footnote 18 in p. 2129] has been shown in Lemma A2, we only need to show that (i) ∇ θ 2 h θ ( u , u ′ ) is continuous at each θ ∈ Θ with probability one and (ii) there exists d ( u , u ′ ) ≥ ‖ ∇ θ 2 h θ ( u , u ′ ) ‖ F for all θ ∈ Θ and E [ d ( U , U ′ ) ] < ∞ . We exploit the triangle inequality of the Frobenius norm and obtain

∥ ∇ θ 2 h θ ( u , u ′ ) ∥ F ≤ [ 2 ‖ ∇ θ f θ ( x ) ∇ θ f θ ⊤ ( x ′ ) ‖ F + ( ∣ y ∣ + ∣ f θ ( x ) ∣ ) ‖ ∇ θ 2 f θ ( x ′ ) ‖ F + ( ∣ y ′ ∣ + ∣ f θ ( x ′ ) ∣ ) ‖ ∇ θ 2 f θ ( x ) ‖ F ] k ( z , z ′ ) ≡ d ( u , u ′ ) ,

We first show d ( u , u ′ ) is bounded for bounded u , u ′ . As f θ is twice continuously differentiable about θ and Θ is compact, we have f θ ( x ) bounded as well as each entry of ∇ θ f θ ( x ) and ∇ θ 2 f θ ( x ) for ‖ x ‖ < ∞ . Further taking into account that k ( ⋅ , ⋅ ) is bounded as per Assumption 1, we know that d ( u , u ′ ) < ∞ if u , u ′ are bounded, and it follows that (i) ∇ θ 2 h θ ( u , u ′ ) is continuous at each θ ∈ Θ with probability one as f θ is twice continuously differentiable.

We then show that (ii) E U , U ′ [ d ( U , U ′ ) ] < ∞ by the following inequalities:

E U , U ′ [ d ( U , U ′ ) ] ≤ 2 E [ ‖ ∇ θ f θ ( X ) ∇ θ f θ ⊤ ( X ′ ) ‖ F + ( ∣ Y ∣ + ∣ f θ ( X ) ∣ ) ‖ ∇ θ 2 f θ ( X ′ ) ‖ F ] sup z k ( z , z ) = 2 ( E [ ‖ ∇ θ f θ ( X ) ‖ F ] ︸ < ∞ E [ ‖ ∇ θ f θ ⊤ ( X ′ ) ‖ F ] + E [ ∣ Y ∣ + ∣ f θ ( X ) ∣ ] ︸ < ∞ E [ ‖ ∇ θ 2 f θ ( X ′ ) ‖ F ] ︸ < ∞ ) sup z k ( z , z ) < ∞ .

Therefore, we obtain sup θ ∈ Θ ‖ ∇ θ 2 R ^ U ( f θ ) − E [ ∇ θ 2 h θ ( U , U ′ ) ] ‖ F → p 0 following from the extended results in the remarks of [53, Lemma 2.4]. Furthermore, from the conditions that Ω ( θ ) is twice continuously differentiable and the parameter space Θ is compact, we obtain that ‖ ∇ θ 2 Ω ( θ ) ‖ F < ∞ for any θ ∈ Θ . Finally, it follows from the Slutsky’s theorem that

sup θ ∈ Θ ∥ ∇ θ 2 R ^ U , λ ( f θ ) − E [ ∇ θ 2 h θ ( U , U ′ ) ] ∥ F ≤ sup θ ∈ Θ ∥ ∇ θ 2 R ^ U ( f θ ) − E [ ∇ θ 2 h θ ( U , U ′ ) ] ∥ F + λ sup θ ∈ Θ ∥ ∇ θ 2 Ω ( θ ) ∥ F → p 0 .

This concludes the proof.□

Theorem A4

(Asymptotic normality of θ ˆ U ) Suppose that H = E [ ∇ θ 2 h θ ∗ ( U , U ′ ) ] is nonsingular, Θ compact, E [ ∣ f θ ( X ) ∣ ] < ∞ , E [ ∣ Y ∣ ] < ∞ , f θ and Ω ( θ ) are twice continuously differentiable about θ , E [ ‖ ∇ θ f θ ( X ) ‖ 2 ] < ∞ , E [ ‖ ∇ θ 2 f θ ( X ) ‖ F ] < ∞ , n λ → p 0 , R k ( f θ ) is uniquely minimized at θ ∗ which is an interior point of Θ , E [ ‖ ∇ θ h θ ∗ ( U , U ′ ) ‖ 2 2 ] < ∞ and Assumption 1 hold. Then

n ( θ ˆ U − θ ∗ ) → p N ( 0 , 4 H − 1 diag ( E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] ) H − 1 ) .

Proof

The proof follows by [53, Theorem 3.1], and we need to show that (i) θ ^ U → p θ * ; (ii) R ^ U , λ ( θ ) is twice continuously differentiable; (iii) n ∇ θ R ^ U , λ ( f θ ∗ ) → p N ( 0 , 4 E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] ) ; (iv) there is H ( θ ) that is continuous at θ ∗ and sup θ ∈ Θ ‖ ∇ θ 2 R ^ U , λ ( f θ ) − H ( θ ) ‖ F → p 0 ; (v) H ( θ ∗ ) is nonsingular.

The proof of (i) is very similar to Theorem A3 except that we consider finite dimensional parameter space instead of functional space. For a neat proof, we would like to omit the detailed proof here. We can first show the uniform consistency sup θ ∈ Θ ∣ R ^ U , λ ( f θ ) − R k ( f θ ) ∣ → p 0 and R k ( f θ ) is continuous about θ similarly to Lemma A2. Here, the proof is based on the conditions E [ ∣ Y ∣ ] < ∞ , Θ is compact, E [ ∣ f θ ( X ) ∣ ] < ∞ , and f θ is twice continuously differentiable about θ , and Assumption 1 holds. Then, θ ^ U → p θ * similarly to Theorem A3, because of the extra condition R k ( f θ ) is uniquely minimized at θ ∗ .

Furthermore, from the conditions that Θ is compact, f θ is twice continuously differentiable about θ , E [ ∣ f θ ( X ) ∣ ] < ∞ , E [ ‖ ∇ θ f θ ( X ) ‖ 2 ] < ∞ , E [ ‖ ∇ θ 2 f θ ( X ) ‖ F ] < ∞ , E [ ∣ Y ∣ ] < ∞ , and k ( z , z ′ ) is bounded as implied by Assumption 1, we can obtain (ii) R ^ V , λ ( θ ) is twice continuously differentiable about θ . Given H = E [ ∇ θ 2 h θ ∗ ( U , U ′ ) ] = ∇ θ 2 R k ( θ ) is nonsingular and R k ( f θ ) is uniquely minimized at θ ∗ , we can obtain that the Hessian matrix H is positive definite,

H = 2 E ( X Y Z ) , ( X ′ Y ′ Z ′ ) [ ( ∇ θ f θ ∗ ( X ) ∇ θ f θ ∗ ⊤ ( X ′ ) − ( Y − f θ ∗ ( X ) ) ∇ θ 2 f θ ∗ ( X ′ ) ) k ( Z , Z ′ ) ] ≻ 0 .

If for all z ∈ Z , there is E X [ ∇ θ f θ ∗ ( X ) ∣ z ] p ( z ) = 0 and E X Y [ Y − f θ ∗ ( X ) ∣ z ] p ( z ) = 0 , then we can see that the above function H = 0 , which contradicts H ≻ 0 . Therefore, there must exist z s.t. E X [ ∇ θ f θ ∗ ( X ) ∣ z ] p ( z ) ≠ 0 or E X Y [ Y − f θ ∗ ( X ) ∣ z ] p ( z ) ≠ 0 . Then, it follows by Lemma A3 that (iii) n ∇ θ R ^ U , λ ( f θ ∗ ) → p N ( 0 , 4 E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] ) .

Finally by Lemma A4, we know that H ( θ ) = E [ ∇ θ 2 h θ ( U , U ′ ) ] and H ( θ ∗ ) = H , so (iv) and (v) are satisfied. Now, conditions of [53, Theorem 3.1] are all satisfied, so we assert the conclusion.□

A.12 Proof of Theorem 1

We restate the notations

R ^ V , λ ( f θ ) ≔ 1 n 2 ∑ i = 1 n ∑ j = 1 n h θ ( u i , u j ) ︸ R ^ V ( f θ ) + λ Ω ( θ ) h θ ( u i , u j ) ≔ ( y i − f θ ( x i ) ) k ( z i , z j ) ( y j − f θ ( x j ) ) .

Lemma A5

Suppose that conditions of Lemma A3 hold. Then n ∇ θ R ^ V , λ ( f θ ∗ ) → p N ( 0 , 4 E U [ E U ′ 2 [ ∇ θ h θ ∗ ( U , U ′ ) ] ] ) .

Proof

As E [ ‖ ∇ θ h θ ∗ ( U , U ′ ) ‖ 2 2 ] < ∞ , n ∇ θ R ^ V ( f θ ∗ ) has the same limit distribution as that of n ∇ θ R ^ U ( f θ ∗ ) by [28, Section 5.7.3]. Furthermore, by n λ → p 0 and ∇ θ Ω ( θ ∗ ) < ∞ from that Ω ( θ ) is first continuously differentiable, we assert the conclusion by Slutsky’s theorem

□ n ∇ θ R ^ V , λ ( f θ ∗ ) = n ∇ θ R ^ V + n λ ∇ θ Ω ( θ ∗ ) → p N ( 0 , 4 E U [ E U ′ 2 [ ∇ θ h θ ∗ ( U , U ′ ) ] ] ) .

Lemma A6

Suppose that f θ and Ω ( θ ) are twice continuously differentiable about θ , Θ is compact, E [ sup θ ∈ Θ ∣ f θ ( X ) ∣ 2 ] < ∞ , E [ sup θ ∈ Θ ‖ ∇ θ f θ ( X ) ‖ 2 2 ] < ∞ , E [ sup θ ∈ Θ ‖ ∇ θ 2 f θ ( X ) ‖ F 2 ] < ∞ , E [ ∣ Y ∣ 2 ] < ∞ , λ → p 0 , and Assumption 1 holds. Then, E [ ∇ θ 2 h θ ( U , U ′ ) ] is continuous about θ and sup θ ∈ Θ ‖ ∇ θ 2 R ^ V ( f θ ) − E [ ∇ θ 2 h θ ( U , U ′ ) ] ‖ F → p 0 .

Proof

We apply [53, Lemma 8.5] for this proof and need to show (i) ∇ θ 2 h θ ( u , u ′ ) is continuous about each θ ∈ Θ with probability one, and (ii) E [ sup θ ∈ Θ ‖ ∇ θ 2 h θ ( U , U ′ ) ‖ F ] < ∞ and E [ sup θ ∈ Θ ‖ ∇ θ 2 h θ ( U , U ) ‖ F ] < ∞ .

We first see that ‖ ∇ θ 2 h θ ( u , u ′ ) ‖ F and ‖ ∇ θ 2 h θ ( u , u ) ‖ F are bounded for finite u , u ′ because f θ is twice continuously differentiable about θ . It follows that (i) ∇ θ 2 h θ ( u , u ′ ) is continuous about θ with probability one. We then derive upper bounds for E [ sup θ ∈ Θ ‖ ∇ θ 2 h θ ( U , U ′ ) ‖ F ] and E [ sup θ ∈ Θ ‖ ∇ θ 2 h θ ( U , U ) ‖ F ] so as to show their boundedness,

E [ sup θ ∈ Θ ‖ ∇ θ 2 h θ ( U , U ′ ) ‖ F ] ≤ 2 E [ sup θ ∈ Θ ‖ ∇ θ f θ ( X ) ∇ θ f θ ⊤ ( X ′ ) ‖ F + ( ∣ Y ∣ + ∣ f θ ( X ) ∣ ) ‖ ∇ θ 2 f θ ( X ′ ) ‖ F ] sup z k ( z , z ) ≤ 2 E [ sup θ ∈ Θ ‖ ∇ θ f θ ( X ) ‖ 2 ] 2 + 2 E [ ∣ Y ∣ + sup θ ∈ Θ ∣ f θ ( X ) ∣ ] E [ sup θ ∈ Θ ‖ ∇ θ 2 f θ ( X ′ ) ‖ F ] sup z k ( z , z ) < ∞ ,

and

E [ sup θ ∈ Θ ‖ ∇ θ 2 h θ ( U , U ) ‖ F ] ≤ 2 E [ sup θ ∈ Θ ‖ ∇ θ f θ ( X ) ∇ θ f θ ⊤ ( X ) ‖ F + ( ∣ Y ∣ + ∣ f θ ( X ) ∣ ) ‖ ∇ θ 2 f θ ( X ) ‖ F ] sup z k ( z , z ) ≤ 2 E [ sup θ ∈ Θ ‖ ∇ θ f θ ( X ) ‖ 2 2 ] + 2 E [ ( ∣ Y ∣ + sup θ ∈ Θ ∣ f θ ( X ) ∣ ) 2 ] E [ sup θ ∈ Θ ‖ ∇ θ 2 f θ ( X ′ ) ‖ F 2 ] sup z k ( z , z ) ≤ 2 E [ sup θ ∈ Θ ‖ ∇ θ f θ ( X ) ‖ 2 2 ] + E [ ( ∣ Y ∣ + sup θ ∈ Θ ∣ f θ ( X ) ∣ ) 2 ] + E [ sup θ ∈ Θ ‖ ∇ θ 2 f θ ( X ′ ) ‖ F 2 ] sup z k ( z , z ) ≤ 2 E [ sup θ ∈ Θ ‖ ∇ θ f θ ( X ) ‖ 2 2 ] + 2 E [ ∣ Y ∣ 2 + sup θ ∈ Θ ∣ f θ ( X ) ∣ 2 ] + E [ sup θ ∈ Θ ‖ ∇ θ 2 f θ ( X ′ ) ‖ F 2 ] sup z k ( z , z ) < ∞ .

Thus, we assert the conclusion by [53, Lemma 8.5].□

Proof of Theorem 1

The proof is the same as that of Theorem A4 except that R ^ U is replaced by R ^ V .□

A.13 Asymptotic normality in the infinite-dimension case

We first state the asymptotic normality theorem for f ˆ U and its proof. Afterward, we provide the proof of Theorem 2 whose proof is a slightly modified version of that of f ˆ U .

Theorem A5

Suppose Assumption 1 holds, l is a bounded kernel, k is a uniformly bounded function, and λ ≥ λ 0 holds. Also, suppose that X , Z , and Y are compact spaces, and there exists s ∈ ( 0 , 2 ) and a constant C H > 0 such that log N ( ε , ℋ l , ‖ ⋅ ‖ L ∞ ) ≤ C H ε − s for any ε ∈ ( 0 , 1 ) . If λ − λ 0 = o ( n − 1 / 2 ) holds, then there exists a Gaussian process G P ∗ such that

n ( f ˆ U − f λ 0 ∗ ) ⇝ G P ∗ i n ℋ l .

An exact covariance of G P ∗ is described in the proof. The proof is based on the uniform convergence of U-processes on the function space [76] and the functional delta method using the asymptotic expansion of the loss function [29]. This asymptotic normality allows us to perform statistical inference, such as tests, even in the nonparametric case.

To prove the theorem, we provide some notation. Let P be a probability measure that generates u = ( x , y , z ) and W = X × Y × Z . Also, we define a function h f ( u , u ′ ) = ( y − f ( x ) ) ( y ′ − f ( x ′ ) ) k ( z , z ′ ) . Let H ≔ { h f : W × W → R ∣ f ∈ ℋ l } . For preparation, we define P 1 h f : W → R as P 1 h f ( ⋅ ) = ( ∫ h f ( u , ⋅ ) + h f ( ⋅ , u ) d P ( u ) ) / 2 for h f ∈ H . For a signed measure Q on W , we define a measure Q 2 ≔ Q ⊗ Q on W × W . Then, we can rewrite the U-statistic risk as follows:

R ^ U ( f ) = ( n − 2 ) ! n ! ∑ i = 1 n ∑ j ≠ i n h f ( u i , u j ) ≕ U n 2 h f ,

where U n is an empirical measure for the U-statistics. Similarly, we can rewrite the V-statistic risk as follows:

R ^ V ( f ) = 1 n 2 ∑ i = 1 n ∑ j = 1 n h f ( u i , u j ) ≕ P n 2 h f ,

where P n is an empirical measure of u .

Further, we define a functional associated with measure. We consider functional spaces G 1 ≔ { g : W × W → [ 0 , 1 ] ∣ a convex set ω s.t. g = 1 { ⋅ ≤ ω } } and G 2 ≔ { g : W × W → [ 0 , 1 ] ∣ ∃ f , f ′ ∈ ℋ l , g ( u , u ′ ) = h f ( u , u ′ ) ( f ′ ( x ) + f ( x ′ ) ) } . Note that G 1 contains a functional which corresponds to the U n 2 . Then, we consider a set of functionals

B S ≔ F : G 1 ∪ G 2 → R ∣ ∃ nonzero finite Q 2 s.t. F ( g ) = ∫ g d Q 2 , ∀ g ∈ G 1 ∪ G 2 ,

and also let B 0 be the closed linear span of B S . For a functional F ∈ B S , let ι ( F ) be a measure that satisfies

F ( g ) = ∫ g d ι ( F ) .

Uniform central limit theorem: We first achieve the uniform convergence. Note that a measure P 2 satisfies P 2 h f ≔ E ( U , U ′ ) [ h f ( U , U ′ ) ] . The convergence theorem for U-processes is as follows.

Theorem A6

(Theorem 4.4 in ref. [76]) Suppose H is a set of uniformly bounded class and symmetric functions, such that { P 1 h f ∣ h f ∈ H } is a Donsker class and

lim n → ∞ E [ n − 1 / 2 log N ( n − 1 / 2 ε , H , ‖ ⋅ ‖ L 1 ( U n 2 ) ) ] = 0

holds, for all ε > 0 . Then, we obtain

n ( U n 2 − P 2 ) ⇝ 2 G P 1 , i n ℓ ∞ ( H ) .

Here, G P 1 denotes a Brownian bridge, which is a GP on H with zero mean and a covariance

E U [ P 1 h f ( U ) P 1 h f ′ ( U ) ] − E U [ P 1 h f ( U ) ] E U [ P 1 h f ′ ( U ) ] ,

with h f , h f ′ ∈ H .

To apply the theorem, we have to show that H satisfies the condition in Theorem A6. We first provide the following bound.

Lemma A7

For any f , f ′ ∈ ℋ l such that ‖ f ‖ L ∞ ∨ ‖ f ′ ‖ L ∞ ≤ B holds, y , y ′ ∈ [ − B , B ] and u , u ′ ∈ W , we have

∣ h f ( u , u ′ ) − h f ′ ( u , u ′ ) ∣ ≤ 4 B ∣ k ( z , z ′ ) ∣ ‖ f − f ′ ‖ L ∞ .

Proof of Lemma A7

We simply obtain the following:

∣ h f ( u , u ′ ) − h f ′ ( u , u ′ ) ∣ = ∣ ( y − f ( x ) ) k ( z , z ′ ) ( y ′ − f ( x ′ ) ) − ( y − f ′ ( x ) ) k ( z , z ′ ) ( y ′ − f ′ ( x ′ ) ) ∣ = ∣ k ( z , z ′ ) ∣ ∣ ( y − f ( x ) ) ( y ′ − f ( x ′ ) ) − ( y − f ′ ( x ) ) ( y ′ − f ′ ( x ′ ) ) ∣ = ∣ k ( z , z ′ ) ∣ ∣ y ′ ( f ′ ( x ) − f ( x ) ) + y ( f ′ ( x ′ ) − f ( x ′ ) ) + f ( x ) f ( x ′ ) − f ′ ( x ) f ′ ( x ′ ) ∣ = ∣ k ( z , z ′ ) ∣ ∣ y ′ ( f ′ ( x ) − f ( x ) ) + y ( f ′ ( x ′ ) − f ( x ′ ) ) + f ( x ′ ) ( f ( x ) − f ′ ( x ) ) − f ′ ( x ) ( f ( x ′ ) − f ′ ( x ′ ) ) ∣ ≤ ∣ k ( z , z ′ ) ∣ { ∣ y ′ ∣ ∣ f ′ ( x ) − f ( x ) ∣ + ∣ y ∣ ∣ f ′ ( x ′ ) − f ( x ′ ) ∣ + ∣ f ( x ′ ) ∣ ∣ f ( x ) − f ′ ( x ) ∣ − ∣ f ′ ( x ) ∣ ∣ f ( x ′ ) − f ′ ( x ′ ) ∣ } ≤ 4 B ∣ k ( z , z ′ ) ∣ ‖ f − f ′ ‖ L ∞ ,

as required.□

Lemma A8

Suppose the assumptions of Theorem 2 hold. Then, the followings hold:

{ P 1 h f ∣ h f ∈ H } is a Donsker class.
For any ε > 0 , the following holds:
lim n → ∞ E [ n − 1 / 2 log N ( n − 1 / 2 ε , H , ‖ ⋅ ‖ L 1 ( U n 2 ) ) ] = 0 .

Proof of Lemma A8

For preparation, fix ε > 0 and set N = N ( ε , ℋ l , ‖ ⋅ ‖ L ∞ ) . Also, let Q be an arbitrary finite discrete measure. Then, by the definition of a bracketing number, there exist N functions { f i ∈ ℋ l } i = 1 N such that for any f ∈ ℋ l there exists i ∈ { 1 , 2 , … , N } such as ‖ f − f i ‖ L ∞ ≤ ε .

For the first condition, as shown in equation (2.1.7) in ref. [55], it is sufficient to show that

sup Q log N ( ε , { P 1 h f ∣ h f ∈ H } , ‖ ⋅ ‖ L 2 ( Q ) ) ≤ c ε − 2 δ ( ε → 0 ) ,

for arbitrary δ ∈ ( 0 , 1 ) . Here, c > 0 is some constant, and Q is taken from all possible finite discrete measure. To this end, it is sufficient to show that log N ( ε , { P 1 h f ∣ h f ∈ H } , ‖ ⋅ ‖ L 2 ( Q ) ) ≤ c ′ log N ( ε , ℋ l , ‖ ⋅ ‖ L ∞ ) with a constant c ′ > 0 . Fix P 1 h f ∈ { P 1 h f ∣ h f ∈ H } arbitrary, and set f i which satisfies ‖ f − f i ‖ L 2 ( Q ) ≤ ε . Then, we have

∥ P 1 h f − P 1 h f i ∥ L 2 ( Q ) 2 = ∫ ∫ ( h f ( u , u ′ ) + h f ( u ′ , u ) ) / 2 − ( h f i ( u , u ′ ) + h f i ( u ′ , u ) ) / 2 d P ( u ′ ) 2 d Q ( u ) = ∫ ∫ h f ( u , u ′ ) − h f i ( u , u ′ ) d P ( u ′ ) 2 d Q ( u ) ≤ C ∫ ∫ ∣ k ( z , z ′ ) ∣ d P ( u ′ ) 2 d Q ( u ) ‖ f − f i ‖ L ∞ 2 ≤ C ′ ‖ f − f i ‖ L ∞ 2 ≤ C ′ ε 2 ,

with constants C , C ′ > 0 . The first inequality follows Lemma A7 with the bounded property of f , f ′ , and Y . The second inequality follows the bounded condition of k in Theorem A5. Hence, the entropy condition shows the first statement.

For the second condition, we have the similar strategy. For any h f ∈ H , we consider i ∈ { 1 , 2 , … , N } such that ‖ f − f i ‖ L ∞ ≤ ε . Then, we measure the following value:

‖ h f − h f i ‖ L 1 ( U n 2 ) = ∫ ∣ h f ( u , u ′ ) − h f i ( u , u ′ ) ∣ d U n 2 ( u , u ′ ) ≤ C ″ ‖ f − f i ‖ L ∞ ≤ C ″ ε ,

with a constant C ″ > 0 . Hence, we have

E [ n − 1 / 2 log N ( n − 1 / 2 ε , H , ‖ ⋅ ‖ L 1 ( U n 2 ) ) ] ≤ n − 1 / 2 log N ( n − 1 / 2 ε , ℋ l , ‖ ⋅ ‖ L ∞ ) ≤ C n − 1 / 2 n 1 / 2 ε s = C n ( s − 1 ) / 2 → 0 , ( n → ∞ ) ,

since s ∈ ( 0 , 1 ) .□

From Theorem A6 and Lemma A8, we rewrite the central limit theorem utilizing terms of functionals. Note that ι − 1 ( U n 2 ) , ι − 1 ( P 2 ) ∈ B S holds. Then, we can obtain

n ( ι − 1 ( U n 2 ) − ι − 1 ( P 2 ) ) ⇝ 2 G P 1 in ℓ ∞ ( H ) .

Learning map and functional delta method: We consider a learning map S : B S → ℋ l . For a functional F ∈ B S , we define

S λ ( F ) ≔ argmin f ∈ ℋ l ι ( F ) h f + λ ‖ f ‖ ℋ l 2 .

Obviously, we have

f ˆ = S λ ( ι − 1 ( U n 2 ) ) , and f λ 0 ∗ = S λ 0 ( ι − 1 ( P 2 ) ) .

We consider a derivative of S λ in the sense of the Gateau differentiation by the following steps.

First, we define a partial derivative of the map R Q 2 ( f ) . We investigate the optimality of the minimizer of

R Q 2 , λ ( f ) ≔ ∫ h f ( u , u ′ ) d Q 2 ( u , u ′ ) + λ ‖ f ‖ ℋ l 2 .

To this end, we consider the following derivative ∇ R Q 2 , λ [ f ] : ℋ l → ℋ l with a direction f as follows:

∇ R Q 2 , λ [ f ] ( f ′ ) ≔ 2 λ f + ∫ ∂ f , 1 h f ( u , u ′ ) f ′ ( x ) + ∂ f , 2 h f ( u , u ′ ) f ′ ( x ′ ) d Q 2 ( u , u ′ ) .

Here, ∂ f , 1 h f is a partial derivative of h f in terms of the input f ( x ) as follows:

∂ f , 1 h f ( u , u ′ ) = ∂ t ∣ t = f ( x ) ( y − t ) k ( z , z ′ ) ( y ′ − f ( x ′ ) ) = − ( y ′ − f ( x ′ ) ) k ( z , z ′ ) ,

and ∂ f , 2 h f follows it respectively. The following lemma validates the derivative.

Lemma A9

If the assumptions in Theorem 2 hold, then ∇ R Q 2 , λ [ f ] is a Gateau-derivative of R Q 2 , λ with the direction f ∈ ℋ l .

Proof of Lemma A9

We consider a sequence of functions h n ∈ ℋ l for n ∈ N , such that h n ( x ) ≠ 0 , ∀ x ∈ X and ‖ h n ‖ L ∞ → 0 as n → ∞ . Then, for f ∈ ℋ l , a simple calculation yields

R Q 2 , λ ( f + h n ) − R Q 2 , λ ( f ) − ∇ R Q 2 , λ [ f ] ( h n ) ‖ h n ‖ L ∞ ≤ ∫ ‖ h n ‖ L ∞ − 1 ∣ k ( z , z ′ ) ( ( y − f ( x ) ) h n ( x ) + ( y ′ − f ( x ′ ) ) h n ( x ′ ) + h n ( x ) h n ( x ′ ) ) − ∇ R Q 2 , λ [ f ] ( h n ) ∣ d Q 2 ( u , u ′ ) ≤ ∫ ‖ h n ‖ L ∞ − 1 ∣ k ( z , z ′ ) h n ( x ) h n ( x ′ ) ∣ d Q 2 ( u , u ′ ) ≤ ∫ ‖ h n ‖ L ∞ − 1 ∣ k ( z , z ′ ) ‖ h n ‖ L ∞ 2 ∣ d Q 2 ( u , u ′ ) ≤ ‖ h n ‖ L ∞ ∫ ∣ k ( z , z ′ ) ∣ d Q 2 ( u , u ′ ) → 0 , ( n → ∞ ) .

The convergence follows the definition of h n and the absolute integrability of k , which follows the bounded property of k and compactness of Z . Then, we obtain the statement.□

Here, we consider its RKHS-type formulation of ∇ R Q 2 , λ , which is convenient to describe a minimizer. Let Φ l : X → ℋ l be the feature map associated with the RKHS ℋ l , such that ⟨ Φ l [ x ] , f ⟩ ℋ l = f ( x ) for any x ∈ X and f ∈ ℋ l . Let ∇ R ˜ Q 2 , λ : ℋ l → ℋ l be an operator such that

∇ R ˜ Q 2 , λ ( f ) ≔ 2 λ f + ∫ ∂ f , 1 h f ( u , u ′ ) Φ l [ x ] ( ⋅ ) + ∂ f , 2 h f ( u , u ′ ) Φ l [ x ′ ] ( ⋅ ) d Q 2 ( u , u ′ ) .

Obviously, ∇ R Q 2 , λ [ f ] ( ⋅ ) = ⟨ ∇ R ˜ Q 2 , λ ( f ) , ⋅ ⟩ ℋ l . Now, we can describe the first-order condition of the minimizer of the risk. Namely, we can state that

f ˆ = argmin f ∈ ℋ l R Q 2 , λ ( f ) ⇔ ∇ R ˜ Q 2 , λ ( f ˆ ) = 0 .

This equivalence follows Theorem 7.4.1 and Lemma 8.7.1 in ref. [77].

Next, we apply the implicit function theorem to obtain an explicit formula of the derivative of S . To this end, we consider a second-order derivative ∇ 2 R ˜ Q 2 , λ : ℋ l → ℋ l as follows:

∇ 2 R ˜ Q 2 , λ ( f ) ≔ 2 λ f + ∫ k ( z , z ′ ) ( f ( x ) Φ l [ x ] ( ⋅ ) + f ( x ′ ) Φ l [ x ′ ] ( ⋅ ) ) d Q 2 ( u , u ′ ) ,

which follows (b) in Lemma A.2 in ref. [29]. Its basic properties are provided in the following result.

Lemma A10

If Assumption 1 and the assumptions in Theorem 2 hold, then ∇ 2 R ˜ Q 2 , λ is a continuous linear operator and it is invertible.

Proof of Lemma A10

By (b) in Lemma A.2 in ref. [29], ∇ 2 R ˜ Q 2 , λ is a continuous linear operator. In the following, we define A : ℋ l → ℋ l as A ( f ) = ∫ k ( z , z ′ ) f ( x ) Φ l [ x ] ( ⋅ ) + f ( x ′ ) Φ l [ x ′ ] ( ⋅ ) d Q 2 ( u , u ′ ) . To show that ∇ 2 R ˜ Q 2 , λ is invertible, it is sufficient to show that (i) ∇ 2 R ˜ Q 2 , λ is injective, and (ii) A is a compact operator.

For the injectivity, we fix nonzero f ∈ ℋ l and obtain

‖ ∇ 2 R ˜ Q 2 , λ ( f ) ‖ ℋ l 2 = ⟨ 2 λ f + A ( f ) , 2 λ f + A ( f ) ⟩ ℋ l = 4 λ 2 ‖ f ‖ ℋ l 2 + 4 λ ⟨ f , A ( f ) ⟩ ℋ l + ‖ A ( f ) ‖ ℋ l 2 > 4 λ ⟨ f , A ( f ) ⟩ ℋ l = 4 λ f , ∫ k ( z , z ′ ) f ( x ) Φ [ x ] d Q 2 ( u , u ′ ) ℋ l + 4 λ f , ∫ k ( z , z ′ ) f ( x ′ ) Φ [ x ′ ] d Q 2 ( u , u ′ ) ℋ l = 4 λ ∫ k ( z , z ′ ) f ( x ) 2 d Q 2 ( u , u ′ ) + 4 λ ∫ k ( z , z ′ ) f ( x ′ ) 2 d Q 2 ( u , u ′ ) ≥ 0 .

The last equality follows the property of Φ l , and the last inequality follows the ISPD property in Assumption 1.

For the compactness, we follow Lemma A.5 in ref. [29] and obtain that operators ( f ↦ ∫ k ( z , z ′ ) f ( x ) Φ l [ x ] ( ⋅ ) d Q 2 ( u , u ′ ) ) and ( f ↦ ∫ k ( z , z ′ ) f ( x ′ ) Φ l [ x ′ ] ( ⋅ ) d Q 2 ( u , u ′ ) ) are compact.□

We define the Gateau derivative of S . For a functional F ′ ∈ ℓ ∞ ( G 1 ∪ G 2 ) , we define the following function:

∇ S Q 2 , λ ( F ′ ) ≔ − ∇ 2 R ˜ Q 2 , λ − 1 ∫ ∂ f , 1 h f Q 2 ( u , u ′ ) Φ l [ x ] ( ⋅ ) + ∂ f , 2 h f Q 2 ( u , u ′ ) Φ l [ x ′ ] ( ⋅ ) d ι ( F ′ ) ( u , u ′ ) ,

where f Q 2 = S λ ( ι − 1 ( Q 2 ) ) and Q 2 is a signed measure on W × W . Then, we provide the following derivative theorem.

Proposition A1

Suppose the assumptions in Theorem 2 hold. For F ∈ B S , F ′ ∈ ℓ ∞ ( G 1 ∪ G 2 ) , and s ∈ R such that F + s F ′ ∈ B S , ∇ S ι ( F ) , λ ( F ′ ) is a Gateau-derivative of S λ , namely,

lim s → 0 S λ ( F + s F ′ ) − S λ ( F ) s − ∇ S ι ( F ) , λ ( F ′ ) ℋ l = 0 .

Proof of Proposition A1

This proof has the following two steps, (i) define a proxy operator Γ and then (ii) prove the statement by the implicit function theorem.

(i) Define Γ : Note that ι ( F ′ ) exists since F + s F ′ ∈ B S implies F ′ ∈ B 0 . We define the following operator Γ ( s , f , λ ) : ℋ l → ℋ l for f ∈ ℋ l :

Γ ( s , f , λ ) ≔ ∇ R ˜ ι ( F ) + s ι ( F ′ ) , λ = 2 λ f + ∫ ∂ f , 1 h f ( u , u ′ ) Φ l [ x ] ( ⋅ ) + ∂ f , 2 h f ( u , u ′ ) Φ l [ x ′ ] ( ⋅ ) d ι ( F ) ( u , u ′ ) + s ∫ ∂ f , 1 h f ( u , u ′ ) Φ l [ x ] ( ⋅ ) + ∂ f , 2 h f ( u , u ′ ) Φ l [ x ′ ] ( ⋅ ) d ι ( F ′ ) ( u , u ′ ) .

For simple derivatives, Lemma A.2 in ref. [29] provides

∇ s Γ ( s , f , λ ) = ∫ ∂ f , 1 h f ( u , u ′ ) Φ l [ x ] ( ⋅ ) + ∂ f , 2 h f ( u , u ′ ) Φ l [ x ′ ] ( ⋅ ) d ι ( F ′ ) ( u , u ′ )

and

∇ f Γ ( s , f , λ ) = ∇ 2 R ˜ ι ( F ) + s ι ( F ′ ) , λ .

(ii) Apply the implicit function theorem: By its definition and the optimal conditions, we have

Γ ( s , f , λ ) = 0 ⇔ f = S λ ( ι ( F ) + s ι ( F ′ ) ) .

Also, we obtain

∇ f Γ ( 0 , S λ ( F ) , λ ) = ∇ 2 R ˜ ι ( F ) , λ .

Then, for each λ > 0 , by the implicit function theorem, there exists a smooth map φ λ : R → ℋ l such that

Γ ( s , φ λ ( s ) , λ ) = 0 , ∀ s ,

and it satisfies

∇ s φ λ ( 0 ) = − ( ∇ f Γ ( 0 , φ λ ( 0 ) , λ ) ) − 1 ( ∇ s Γ ( 0 , φ λ ( 0 ) , λ ) ) = ∇ S ι ( F ) , λ ( F ′ ) .

Also, we have φ λ ( s ) = S ( Q 2 + s μ 2 ) . Then, we have

lim s → 0 S λ ( F + s F ′ ) − S λ ( F ′ ) s − ∇ S ι ( F ) , λ ( F ′ ) ℋ l = lim s → 0 φ λ ( s ) − φ λ ( 0 ) s − ∇ s φ λ ( 0 ) ℋ l = 0 .

Then, we obtain the statement.□

Now, we are ready to prove Theorems 2 and A5.

Proof of Theorem A5

As a preparation, we mention that S λ is differentiable in the Hadamard sense, which is Gateau differentiable by Proposition A1. Lemmas A.7 and A.8 in ref. [29] show that ∇ S ι ( F ) , λ , ι ( G ) is Hadamard-differentiable for any λ , G and F .

Then, we apply the functional delta method. As shown in Theorem A6 and Lemma A8, we have

n ( ι − 1 ( U n 2 ) − ι − 1 ( P 2 ) ) ⇝ 2 G P 1 .

Hence, we obtain

n ( ( λ 0 / λ ) ι − 1 ( U n 2 ) − ι − 1 ( P 2 ) ) = λ 0 λ n ( ι − 1 ( U n 2 ) − ι − 1 ( P 2 ) ) + n ( λ − λ 0 ) λ ⇝ 2 G P 1 ,

since λ − λ 0 = o ( n − 1 / 2 ) . By utilizing the result, we can obtain

n ( f ˆ − f λ 0 ∗ ) = n ( S λ 0 ( ( λ 0 / λ ) ι − 1 ( U n 2 ) ) − S λ 0 ( ι − 1 ( P 2 ) ) ) + o P ( 1 ) ⇝ ∇ S P 2 , λ 0 ( 2 G P 1 ) ,

in ℓ ∞ ( H ) . The convergence follows the functional delta method.□

Proof of Theorem 2

This proof is completed by substituting the Central Limit Theorem part (Theorem A6) of the proof of Theorem A5. From Section 3 in ref. [78], the V- and U-processes have the same limit distribution asymptotically, so the same result holds.□

A.14 Proof of Proposition 5

Proof of Proposition 5

For short, we write Λ = Λ min and Λ ′ = Λ max .

We start by investigating H and H ˇ . By using the independent property between U = ( X , Y , Z ) and U ′ = ( X ′ , Y ′ , Z ′ ) , we have

H = E [ ∇ θ 2 ( Y − f θ ( X ) ) ( Y ′ − f θ ( X ′ ) ) k ( Z , Z ′ ) ] = E [ ∇ θ 2 f θ ∗ ( X ) ε ′ k ( Z , Z ′ ) ] + E [ ∇ θ 2 f θ ∗ ( X ′ ) ε k ( Z , Z ′ ) ] + 2 E [ ∇ θ f θ ∗ ( X ) k ( Z , Z ′ ) ∇ θ f θ ∗ ( X ′ ) ] = 2 E Z , Z ′ [ E X [ ∇ θ f θ ∗ ( X ) ∣ Z ] k ( Z , Z ′ ) ∇ θ E X ′ [ f θ ∗ ( X ′ ) ∣ Z ′ ] ] = ∬ q ( z ) q ( z ′ ) k ( z , z ′ ) d P ( z , z ′ ) = ∬ q ( z ) p ( z ) q ( z ′ ) p ( z ′ ) k ( z , z ′ ) d z d z ′ ,

where q : Z → R d , z ↦ E X [ ∇ θ f θ ∗ ( X ) ∣ z ] with the parameter dimension d . The third equality follows that ε = Y − f θ ∗ ( X ) is an independent noise variable, and the last equality follows the independent property between Z and Z ′ . By the same way, H ˇ is rewritten as follows:

H ˇ = ∬ q ( z ) p ( z ) q ( z ′ ) p ( z ′ ) d z d z ′ .

On the basis of these facts, we obtain the following form

H = H ˇ ∘ W ,

where W is a d × d matrix whose all elements belong to [ Λ , Λ ′ ] , and ∘ denotes the Haramard product of matrices. In words, the elements of H are obtained by multiplying some coefficients to their corresponding elements of H ˇ , and the coefficients belongs to [ Λ , Λ ′ ] . By the same calculation, we obtain E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] = E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] w 2 with some w ∈ [ Λ , Λ ′ ] .

By using the result, we study H − 1 with the decomposition

H − 1 = H ˇ − 1 + ( H − 1 − H ˇ − 1 ) .

Let I be a d × d matrices whose all elements are 1. We have the following difference

H − 1 − H ˇ − 1 = H − 1 ( H ˇ − H ) H ˇ − 1 = H − 1 ( H ˇ − W ∘ H ) H ˇ − 1 = H − 1 ( ( I − W ) ∘ H ˇ ) H ˇ − 1 = H − 1 H ˇ H ˇ − 1 ( I ∘ ( I − W ) ) = H − 1 ∘ ( I − W ) .

Finally, we study Σ V . It is decomposed as follows:

Σ V = 4 H − 1 diag ( E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] ) H − 1 = 4 ( H ˇ − 1 + ( H − 1 − H ˇ − 1 ) ) ( diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) + ( diag ( E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] ) − diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) ) ) ( H ˇ − 1 + ( H − 1 − H ˇ − 1 ) ) = 4 H ˇ − 1 diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) H ˇ − 1 + 4 ( H − 1 − H ˇ − 1 ) diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) H ˇ − 1 + 8 H ˇ − 1 ( diag ( E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] ) − diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) ) H ˇ − 1 + 8 H ˇ − 1 ( diag ( E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] ) − diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) ) ( H − 1 − H ˇ − 1 ) + 4 ( H − 1 − H ˇ − 1 ) 2 diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) + 4 ( H − 1 − H ˇ − 1 ) 2 ( diag ( E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] ) − diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) ) = Σ ˇ V + 8 ( w 2 − 1 ) diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) H ˇ − 2 + 4 ( 2 − w 2 ) diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) H ˇ − 1 ( H − 1 − H ˇ − 1 ) + 4 ( H − 1 − H ˇ − 1 ) 2 diag ( E U [ E U ′ 2 [ h θ ∗ ( U , U ′ ) ] ] ) = Σ ˇ V + 8 ( w 2 − 1 ) diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) H ˇ − 2 + 4 ( 2 − w 2 ) diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) H ˇ − 1 ( H − 1 ∘ ( I − W ) ) + 4 w 2 diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) ( H − 1 ∘ ( I − W ) ) 2 .

By using the statement, we study the Frobenisu norm ‖ Σ V ‖ F . As preparation, we set c ≔ diag ( E U [ E U ′ 2 [ h ˇ θ ∗ ( U , U ′ ) ] ] ) and recall the inequality ‖ A ∘ B ‖ F ≤ ‖ A ‖ F ‖ B ‖ F . Also, recall the definition δ Λ ≔ max { ∣ Λ ′ − 1 ∣ , ∣ 1 − Λ ∣ } . We obtain

‖ Σ V ‖ F ≤ ‖ Σ ˇ V ‖ F + 8 ( w 2 − 1 ) c ‖ H ˇ − 2 ‖ F + 4 ( 2 − w 2 ) c ‖ H ˇ − 1 ‖ F 2 ‖ I − W ‖ F + 4 w 2 c ‖ H ˇ − 1 ‖ F 2 ‖ I − W ‖ F 2 ≤ ‖ Σ ˇ V ‖ F + 8 Λ ′ δ Λ c ‖ H ˇ − 2 ‖ F + 8 c ‖ H ˇ − 1 ‖ F 2 d δ Λ + 4 ( Λ ′ ) 2 c ‖ H ˇ − 1 ‖ F 2 d 2 δ Λ 2 = ‖ Σ ˇ V ‖ F + O ( Λ ′ δ Λ + ( Λ ′ δ Λ ) 2 ) .

Landau’s Big O notation in the last equality picks up the terms depends only Λ and Λ ′ .□

B Related studies beyond IV regression setting

Beyond the IV regression, there are numerous prior studies that are related to ours, especially in policy evaluation, reinforcement learning, and causal inference. First, the idea of “kernel loss” was proposed in ref. [79] to estimate the value function in reinforcement learning. Similar ideas have been used to estimate the importance ratio of two state or state-action distributions in refs [80,81] and to estimate the average policy effect (APE) and policy learning in refs [82,83]. Second, in the area of causal inference, ref. [84] employed a similar technique to estimate an average treatment effect. Despite the methodological similarity to our work, these works did not consider the IV regression setting and the analytical cross validation error. We will first introduce the objective functions of the aforementioned works for a better understanding of the connection and then highlight the differences, challenges, and novelties of our work.

Although estimating different subjects f , refs [80] and [79] employ similar population objective functions in the following form:

(A2) min f E X E X ′ [ ( ( A f ) ( X ) − f ( X ) ) k ( X , X ′ ) ( ( A f ) ( X ′ ) − f ( X ′ ) ) ] ,

where A is a task-specific operator acting on the subject f and X ′ is an independent copy of X . More specifically, ref. [80] aims to estimate the importance ratio of state distributions of two policies as f ( X ) . In the task of value function estimation by ref. [79], f ( X ) is the value function. In both works, X represents the state variable in the context of reinforcement learning. Further, ref. [82] studies the estimation of the APE similar to that of the average treatment effect, which estimate an average effect of a provided policy for a treatment conditioned on a covariate.

Compared with our population and empirical risks (4) and (5), the kernel loss (A2) has a similar quadratic form. The difference is that the kernel loss has variables involved in both the kernel function and the residual, whereas the kernel function in our objective depends on the instrument Z which does not appear in the residual. Besides, our analytical cross validation error was not studied by these works, and the similar forms of the objectives allow to adapt our result to their approaches.

The difference in losses also simplifies the estimation in the related work and requires us to perform new analyses. Specifically, the consistency of the estimators in the aforementioned related work holds under mild conditions. That is, in the estimation of the value function, minimizing the population objective function to zero guarantees a solution for the Bellman equation, which is consistent to the value function according to the unique solution property of the Bellman equation [79]; for the importance ratio and the APE estimation, the consistency holds under mild conditions of data distributions [80,82]. In contrast, the estimator obtained by minimizing the MMR risk (4) to zero is hardly consistent to the true f ( X ) under mild conditions because there can be f ˆ ≠ f satisfying E [ Y − f ˆ ( X ) ∣ Z ] = 0 almost surely. Therefore, we first introduce the completeness condition in Assumption 2 for the estimation, which is classical in the IV regression literature and is needed to identify f ( X ) from the CMR. Second, a new theoretical analysis is necessary to clarify the nature of the estimation on f ( X ) , such as consistency and asymptotic normality. We develop a novel theory for this point in Section 5, which has not been studied by these related works.

References

[1] Angrist JD, Pischke JS. Mostly harmless econometrics: an empiricistas companion. Princeton and Oxford: Princeton University Press; 2008. 10.2307/j.ctvcm4j72Search in Google Scholar

[2] Klungel O, Jamal Uddin M, de Boer A, Belitser S, Groenwold R, Roes K. Instrumental variable analysis in epidemiologic studies: an overview of the estimation methods. Pharm Anal Acta. 2015;6(353):2. 10.4172/2153-2435.1000353Search in Google Scholar

[3] Card D. The causal effect of education on earnings. In: Ashenfelter O, Card D, editors. Handbook of labor economics. vol. 3 of Handbook of labor economics. Elsevier; 1999. p. 1801–63. 10.1016/S1573-4463(99)03011-4Search in Google Scholar

[4] Burgess S, Small DS, Thompson SG. A review of instrumental variable estimators for Mendelian randomization. Statist Meth Med Res. 2017;26(5):2333–55. 10.1177/0962280215597579Search in Google Scholar PubMed PubMed Central

[5] Burgess S, Foley CN, Allara E, Staley JR, Howson JMM. A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nature Commun. 2020;11(1):376. 10.1038/s41467-019-14156-4Search in Google Scholar PubMed PubMed Central

[6] Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. J Amer Stat Assoc. 1996;91(434):444–55. 10.1080/01621459.1996.10476902Search in Google Scholar

[7] Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50(4):1029–54. 10.2307/1912775Search in Google Scholar

[8] Newey WK, Powell JL. Instrumental variable estimation of nonparametric models. Econometrica. 2003;71(5):1565–78. 10.1111/1468-0262.00459Search in Google Scholar

[9] Hall P, Horowitz JL. Nonparametric methods for inference in the presence of instrumental variables. Ann Stat. 2005 12;33(6):2904–2929. 10.1214/009053605000000714Search in Google Scholar

[10] Blundell R, Chen X, Kristensen D. Semi-nonparametric IV estimation of shape-invariant Engel curves. Econometrica. 2007;75(6):1613–69. 10.1111/j.1468-0262.2007.00808.xSearch in Google Scholar

[11] Horowitz JL. Applied nonparametric instrumental variables estimation. Econometrica. 2011;79(2):347–94. 10.3982/ECTA8662Search in Google Scholar

[12] Hartford J, Lewis G, Leyton-Brown K, Taddy M. Deep IV: a flexible approach for counterfactual prediction. In: Proceedings of the 34th ICML. vol. 70. PMLR; 2017. p. 1414–23. Search in Google Scholar

[13] Lewis G, Syrgkanis V. Adversarial generalized method of moments; ArXiv Preprint: 1803.07164; 2018. Search in Google Scholar

[14] Bennett A, Kallus N, Schnabel T. Deep generalized method of moments for instrumental variable analysis. In: NeurIPS 32. Curran Associates, Inc., 2019. p. 3564–74. Search in Google Scholar

[15] Muandet K, Mehrjou A, Lee SK, Raj A. Dual instrumental variable regression. In: NeurIPS 33. Curran Associates, Inc.; 2020. Search in Google Scholar

[16] Singh R, Sahani M, Gretton A. Kernel instrumental variable regression. In: NeurIPS 32. Curran Associates, Inc.; 2019. p. 4593–605. Search in Google Scholar

[17] Liao L, Chen Y, Yang Z, Dai B, Wang Z, Kolar M. Provably efficient neural estimation of structural equation model: an adversarial approach. In: NeurIPS 33. Curran Associates, Inc.; 2020. Search in Google Scholar

[18] Bennett A, Kallus N. The variational method of moments; ArXiv Preprint: 2012.09422; 2020. Search in Google Scholar

[19] Daskalakis C, Panageas I. The limit points of (optimistic) gradient descent in min-max optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems; Curran Associates, Inc. 2018. p. 9256–66. Search in Google Scholar

[20] Bailey JP, Gidel G, Piliouras G. Finite regret and cycles with fixed step-size via alternating gradient descent-ascent. In: Conference on Learning Theory. PMLR; 2020. p. 391–407. Search in Google Scholar

[21] Mastouri A, Zhu Y, Gultchin L, Korba A, Silva R, Kusner MJ, et al. Proximal causal learning with kernels: two-stage estimation and moment restriction. In: Proceedings of the 38th International Conference on Machine Learning. vol. 139 of Proceedings of Machine Learning Research. PMLR; 2021. p. 7512–23. Search in Google Scholar

[22] Horowitz JL, Lee S. Uniform confidence bands for functions estimated nonparametrically with instrumental variables. J Econom. 2012;168(2):175–88. 10.1016/j.jeconom.2011.12.001Search in Google Scholar

[23] Chen X, Christensen TM. Optimal uniform convergence rates and asymptotic normality for series estimators under weak dependence and weak conditions. J Econom. 2015;188(2):447–65. 10.1016/j.jeconom.2015.03.010Search in Google Scholar

[24] Chen X, Christensen TM. Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric IV regression. Quant Econom. 2018;9(1):39–84. Search in Google Scholar

[25] Babii A. Honest confidence sets in nonparametric IV regression and other ill-posed models. Econometric Theory. 2020;36(4):658–706. 10.1017/S0266466619000380Search in Google Scholar

[26] Chen X, Christensen T, Kankanala S. Adaptive estimation and uniform confidence bands for nonparametric iv. 2021. arXiv: http://arXiv.org/abs/arXiv:210711869. Search in Google Scholar

[27] Muandet K, Jitkrittum W, Kübler J. Kernel conditional moment test via maximum moment restriction. In: Proceedings of the 36th Conference on UAI. vol. 124 of Proceedings of Machine Learning Research. PMLR; 2020. p. 41–50. Search in Google Scholar

[28] Serfling R. Approximation theorems of mathematical statistics. New York, NY: John Wiley & Sons; 1980. 10.1002/9780470316481Search in Google Scholar

[29] Hable R. Asymptotic normality of support vector machine variants and other regularized kernel methods. J Multivariate Anal. 2012;106:92–117. 10.1016/j.jmva.2011.11.004Search in Google Scholar

[30] Newey W. Efficient estimation of models with conditional moment restrictions. In: Handbook of statistics. vol. 11. Elsevier; 1993. p. 419–54. 10.1016/S0169-7161(05)80051-3Search in Google Scholar

[31] Donald SG, Newey WK. Choosing the number of instruments. Econometrica. 2001;69(5):1161–91. 10.1111/1468-0262.00238Search in Google Scholar

[32] Hall AR. Generalized method of moments. Advanced texts in econometrics. Oxford: Oxford University Press; 2005. Search in Google Scholar

[33] Aronszajn N. Theory of reproducing kernels. Trans Am Math Soc. 1950;68(3):337–404. 10.1090/S0002-9947-1950-0051437-7Search in Google Scholar

[34] Schölkopf B, Smola A. Learning with Kernels: support vector machines, regularization, optimization, and beyond. Cambridge, MA, USA: MIT Press; 2002. 10.7551/mitpress/4175.001.0001Search in Google Scholar

[35] Berlinet A, Thomas-Agnan C. Reproducing Kernel Hilbert spaces in probability and statistics. Amsterdam: Kluwer Academic Publishers; 2004. 10.1007/978-1-4419-9096-9Search in Google Scholar

[36] Muandet K, Fukumizu K, Sriperumbudur B, Schölkopf B. Kernel mean embedding of distributions: a review and beyond. Foundations Trends Machine Learn. 2017;10(1–2):1–141. 10.1561/2200000060Search in Google Scholar

[37] Steinwart I, Christmann A. Support vector machines. New York, NY: Springer Science & Business Media; 2008. Search in Google Scholar

[38] Steinwart I. On the influence of the kernel on the consistency of support vector machines. JMLR 2002 Mar;2:67–93. Search in Google Scholar

[39] Fukumizu K, Bach FR, Jordan MI. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. JMLR. 2004 Dec;5:73–99. 10.21236/ADA446572Search in Google Scholar

[40] Sriperumbudur BK, Fukumizu K, Lanckriet GRG. Universality, characteristic kernels and RKHS embedding of measures. JMLR. 2011 Jul;12:2389–410. Search in Google Scholar

[41] Simon-Gabriel CJ, Schölkopf B. Kernel distribution embeddings: universal kernels, characteristic kernels and kernel metrics on distributions. JMLR. 2018;19(44):1–29. Search in Google Scholar

[42] D’Haultfoeuille X. On the completeness condition in nonparametric instrumental problems. Econom Theory. 2011;27(3):460–71. 10.1017/S0266466610000368Search in Google Scholar

[43] Vapnik VN. Statistical learning theory. Hoboken, NJ, USA: Wiley-Interscience; 1998. Search in Google Scholar

[44] Kariya T, Kurata H. Generalized least squares. Hoboken, NJ, USA: John Wiley & Sons; 2004. 10.1002/0470866993Search in Google Scholar

[45] Van der Vaart A. Asymptotic statistics. Cambridge, UK: Cambridge University Press; 2000. Search in Google Scholar

[46] Carrasco M, Florens JP, Renault E. Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. In: Heckman JJ, Leamer EE, editors. Handbook of econometrics. vol. 6B. 1st ed. Elsevier; 2007. 10.1016/S1573-4412(07)06077-1Search in Google Scholar

[47] Schölkopf B, Herbrich R, Smola AJ. A generalized representer theorem. In: International Conference on Computational Learning Theory. Springer; 2001. p. 416–26. Search in Google Scholar

[48] Williams CKI, Seeger M. Using the Nyström method to speed up kernel machines. In: NeurIPS 13. Cambridge, MA: MIT Press; 2001. p. 682–8. Search in Google Scholar

[49] Flannery BP, Press WH, Teukolsky SA, Vetterling W. Numerical recipes in C. vol. 24. New York: Press Syndicate of the University of Cambridge; 1992. p. 78. Search in Google Scholar

[50] Dikkala N, Lewis G, Mackey L, Syrgkanis V. Minimax estimation of conditional moment models. CoRR. 2020;abs/2006.07201. Search in Google Scholar

[51] Vehtari A, Mononen T, Tolvanen V, Sivula T, Winther O. Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. JMLR. 2016;17(1):3581–618. Search in Google Scholar

[52] Greene WH. Econometric analysis. Delhi: Pearson Education India; 2003. Search in Google Scholar

[53] Newey WK, McFadden D. Chapter 36 Large sample estimation and hypothesis testing. In: Handbook of econometrics. vol. 4. Elsevier; 1994. p. 2111–245. 10.1016/S1573-4412(05)80005-4Search in Google Scholar

[54] Wainwright MJ. High-dimensional statistics: a non-asymptotic viewpoint. vol. 48. Cambridge, UK: Cambridge University Press; 2019. 10.1017/9781108627771Search in Google Scholar

[55] Van der Vaart A, Wellner J. Weak convergence and empirical processes. New York: Springer; 1996. 10.1007/978-1-4757-2545-2Search in Google Scholar

[56] Zhou DX. The covering number in learning theory. J Complexity. 2002;18(3):739–67. 10.1006/jcom.2002.0635Search in Google Scholar

[57] Steinwart I, Hush DR, Scovel C. Optimal rates for regularized least squares regression. COLT; 2009. p. 79–93. Search in Google Scholar

[58] Caponnetto A, De Vito E. Optimal rates for the regularized least-squares algorithm. Foundation Comput Math. 2007;7(3):331–68. 10.1007/s10208-006-0196-8Search in Google Scholar

[59] Chen X, Pouzo D. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica. 2012;80(1):277–321. 10.3982/ECTA7888Search in Google Scholar

[60] Chen X, Christensen TM. Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric IV regression. Quant Econ. 2018;9(1):39–84. 10.3982/QE722Search in Google Scholar

[61] Darolles S, Fan Y, Florens JP, Renault E. Nonparametric instrumental regression. Econometrica. 2011;79(5):1541–65. 10.3982/ECTA6539Search in Google Scholar

[62] Song L, Huang J, Smola A, Fukumizu K. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In: Proceedings of the 26th ICML (ICML); 2009. p. 961–8. 10.1145/1553374.1553497Search in Google Scholar

[63] Song L, Fukumizu K, Gretton A. Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine. 2013;30(4):98–111. 10.1109/MSP.2013.2252713Search in Google Scholar

[64] Schölkopf B, Herbrich R, Smola AJ. A generalized representer theorem. In: Helmbold D, Williamson B, editors. Computational learning theory. Berlin, Heidelberg: Springer Berlin Heidelberg; 2001. p. 416–26. 10.1007/3-540-44581-1_27Search in Google Scholar

[65] Ivanov VK, Vasin VV, Tanana VP. Theory of linear ill-posed problems and its applications. Inverse and ill-posed problems series. Boston: De Gruyter; 2002. 10.1515/9783110944822Search in Google Scholar

[66] LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324. 10.1109/5.726791Search in Google Scholar

[67] Hartwig FP, Davey Smith G, Bowden J. Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. Int J Eepidemiol. 2017;46(6):1985–98. 10.1093/ije/dyx102Search in Google Scholar PubMed PubMed Central

[68] Kuang Z, Sala F, Sohoni N, Wu S, Córdova-Palomera A, Dunnmon J, et al. Ivy: instrumental variable synthesis for causal inference. In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. vol. 108 of Proceedings of Machine Learning Research. PMLR; 2020. p. 398–410. Search in Google Scholar

[69] Hartford JS, Veitch V, Sridhar D, Leyton-Brown K. Valid causal inference with (some) invalid instruments. CoRR. 2020;abs/2006.11386. Search in Google Scholar

[70] Sjolander A, Martinussen T. Instrumental variable estimation with the R package ivtools. Epidemiol Meth. 2019;8(1):20180024. 10.1515/em-2018-0024Search in Google Scholar

[71] Meehan M, Penckofer S. The role of vitamin D in the aging adult. J Aging Gerontol. 2014;2(2):60. 10.12974/2309-6128.2014.02.02.1Search in Google Scholar PubMed PubMed Central

[72] Zhang R, Muandet K, Schölkopf B, Imaizumi M. Instrument space selection for kernel maximum moment restriction. 2021. arXiv: http://arXiv.org/abs/arXiv:210603340. Search in Google Scholar

[73] Bennett A, Kallus N. The variational method of moments. 2020. arXiv: http://arXiv.org/abs/arXiv:201209422. Search in Google Scholar

[74] Hoeffding W. Probability inequalities for sums of bounded random variables. J Am Stat Assoc. 1963;58(301):13–30. 10.1080/01621459.1963.10500830Search in Google Scholar

[75] Mann HB, Wald A. On stochastic limit and order relationships. Ann Math Stat. 1943 Sep;14 (3):217–26. 10.1214/aoms/1177731415Search in Google Scholar

[76] Arcones MA, Gine E. Limit theorems for U-processes. Ann Probab. 1993;21(3):1494–542. 10.1214/aop/1176989128Search in Google Scholar

[77] Luenberger DG. Optimization by vector space methods. Hoboken, NJ: John Wiley & Sons; 1997. Search in Google Scholar

[78] Akritas MG. Empirical processes associated with V-statistics and a class of estimators under random censoring. The Annals of Statistics. 1986;14(2):619–37. 10.1214/aos/1176349942Search in Google Scholar

[79] Feng Y, Li L, Liu Q. A kernel loss for solving the bellman equation. NeurIPS. Curran Associates, Inc. 2019;32. Search in Google Scholar

[80] Liu Q, Li L, Tang Z, Zhou D. Breaking the curse of horizon: infinite-horizon off-policy estimation. In: NeurIPS. vol. 31. Curran Associates, Inc.; 2018. Search in Google Scholar

[81] Uehara M, Huang J, Jiang N. Minimax weight and q-function learning for off-policy evaluation. In: ICML. PMLR; 2020. p. 9659–68. Search in Google Scholar

[82] Kallus N. Balanced policy evaluation and learning. NeurIPS. Curran Associates, Inc. 2018;31. Search in Google Scholar

[83] Kallus N. Generalized optimal matching methods for causal inference. JMLR. 2020;21(62):1–54. Search in Google Scholar

[84] Wong RK, Chan KCG. Kernel-based covariate functional balancing for observational studies. Biometrika. 2018;105(1):199–213. 10.1093/biomet/asx069Search in Google Scholar PubMed PubMed Central

Received: 2022-11-07

Accepted: 2023-02-02

Published Online: 2023-04-04

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2022-0073

Keywords for this article

nonlinear instrumental variable regression; conditional moment restriction; kernel method; reproducing kernel Hilbert space; U/V-statistics

Creative Commons

BY 4.0