Home Asymptotic Properties of ReLU FFN Sieve Estimators
Article Open Access

Asymptotic Properties of ReLU FFN Sieve Estimators

  • Frank J. Fabozzi , Hasan Fallahgoul EMAIL logo , Vincentius Franstianto and Grégoire Loeper
Published/Copyright: September 5, 2024
Become an author with De Gruyter Brill

Abstract

Recently, machine learning algorithms have increasing become popular tools for economic and financial forecasting. While there are several machine learning algorithms for doing so, a powerful and efficient algorithm for forecasting purposes is the multi-layer, multi-node neural network with rectified linear unit (ReLU) activation function – deep neural network (DNN). Studies have demonstrated the empirical applications of DNN but have devoted less research to investigate its statistical properties which is mainly due to its severe nonlinearity and heavy parametrization. By borrowing tools from a non-parametric regression framework, sieve estimator, we first show that there exists such a sieve estimator for a DNN. We next establish three asymptotic properties of the ReLU network: consistency, sieve-based convergence rate, and asymptotic normality, and then validate our theoretical results using Monte Carlo analysis.

JEL Classification: C1; C5

1 Introduction

There has been increased application of several machine learning (ML) algorithms for the forecasting of economic and financial variables, e.g. asset pricing (Gu, Kelly, and Xiu 2020) and forecasting bond risk premiums (Bianchi, Büchner, and Tamoni 2021). One such ML technique that has been used in several studies is the feed-forward network (FFN). This ML algorithm involves three layers: (i) input layer which is features or predictors, (ii) single or multiple hidden layers where the features can interact in a nonlinear fashion resulting in transformation of the features, and (iii) an output layer that aggregates the results from the hidden layers to provide a forecast of the target variable. This algorithm, also referred to as a deep network, multi-layer perceptron, was the first and one of the most successful ML applications to forecasting in economics and finance.[1]

In applying FFN, several decisions have to be made: the number of hidden layers, the number of neurons for each hidden layer, and the activation function. The activation function is how the output will be expressed. There are two types of activation function that can be used in FFN: linear activation function or non-linear activation function. A linear (or identity) activation function does not confine the output to between a range. A non-linear activation function, the most commonly used type of activation function, allows generalization or adaption of the model to different domain problems. There are four main types of non-linear activation functions: rectified linear unit (ReLU) activation function, leaky ReLU, sigmoid (logistic) activation function, and hyperbolic activation function.[2] The most popular activation function used in economics and finance applications of FFN is ReLU, which we refer to as ReLU FFN.

With few exceptions (e.g. Farrell, Liang, and Misra 2021; Kohler and Langer 2021; Schmidt-Hieber 2020), despite the fact that ReLU FFN is the most commonly used ML algorithm for forecasting, the asymptotic properties have been less explored.[3] Although it has been shown that ReLU FFN has considerable accuracy in statistical regression and classification tasks, the lack of these properties has left ReLU networks, like many other neural networks, as “black boxes”. Moreover, the lack of understanding of this model has been an obstacle in the development of statistical inferences of ReLU network regressions. It has been shown that the regularization of neural networks requires more care than its machine learning counterparts, due to severe nonlinearity and heavy parametrization. These challenges have limited the applications of neural networks in providing solutions to problems in economic and finance. Thus, investigating the asymptotic properties of ReLU FFN is critical to the further development and acceptance of this model for forecasting.

Using the results from sieve estimation, Shen et al. (2023) find several statistical inference findings about the single-layer FFN estimators and sigmoid activation function: the consistency, the rates of convergence, and the asymptotic normality.[4] Our paper shares the same objectives as Shen et al. (2023), extending their results to deep neural networks (DNN) and ReLU – any ReLU FFN with more than one hidden-layer. We derive the asymptotic properties for a neural network with a fixed number of layers where depth of the network does not grow with the sample size. We prove the convergence rate for both a growing and fixed number of hidden nodes per hidden layer. Furthermore, since the neural network function can be considered within a non-parametric regression framework, it can be viewed as a form of sieve estimator. Therefore, it would be natural to explore formally the existence of a solution for such a network. According to Chen (2007), there are some conditions that are required for a least-squares sieve estimator to have a presence in the sieve space. Proving the existence of the exact solution of least squares instead of assuming its existence makes the asymptotic results of a sieve estimator – consistency, sieve-based convergence rate, and asymptotic normality – more rigorous.[5]

There are several challenges when adapting the results from Shen et al. (2023). First, there is the issue regarding the satisfaction of the third existence condition which referred to by “EC3”.[6] We need to prove the continuity of the multi-layer ReLU networks when it is viewed as mapping the Euclidean vector of the weight parameter of the ReLU neural networks. In the context of a one-layer sigmoid network (as in Shen et al. 2023), it is much easier because the sigmoid activation function is bounded and thus we can easily bound the semi-norm distance of the mapping with the Euclidean distance of the parameter vector multiplied by a certain constant. However, in the context of DNN, we need to introduce an upper bound to each of the weight parameters of the multi-layer ReLU networks so that we can apply the proof of EC3.

Another challenge is the bounding of the metric entropy of the ReLU network function space. We need to apply Theorem 14.5 of Anthony and Bartlett (2009) to bound the integral of the metric entropy of the Rademacher complexity of the function space. Note that the function space is the difference between an arbitrary member of the ReLU function space and the target function in regression (this function represents the true mean of the response variable). The boundedness of the sigmoid activation function makes applying Theorem 14.5 of Anthony and Bartlett (2009) much easier (as in Shen 1997). However, when it comes to multi-layer ReLU networks, an arbitrary upper bound to weight the parameters needs to be introduced to ensure that we can bound the metric entropy integral.[7]

We prefer to explore the theoretical properties of ReLU neural networks as sieve estimators rather than other non-parametric estimators described in Chen and Shen (1998) because the former has an inherent advantage in regression. A ReLU neural network is a powerful tool for regressions. It can detect very complex, non-linear interactions between independent variables in each of the hidden nodes, which use linear aggregations of the nodes from the preceding layers (and these aggregations represent the interactions between independent variables) as inputs. This network can also “turn on/off” some of its hidden nodes to fit the real interactions of the independent variables,[8] which often cannot be represented by common single polynomial, non-linear regressors such as those discussed in Chen and Shen (1998) due to their complex variations. Our results can be extended to broader continuous activation functions under the assumption of bounded DNN weight parameters. For example, neural networks with the Exponential Linear Unit (ELU) activation function and bounded weights will also exhibit the asymptotic behaviors described in this paper. This bounded weight assumption is necessary to leverage Theorem 14.5 in Anthony and Bartlett (2009) and ensure the normality condition of the target function.

The approximation rate using the ReLU network is O W n 1 / d , where W n is the total number of weight parameters of the ReLU neural networks (see Proposition 1 in Yarotsky 2018). This approximation rate is similar to that of the non-parametric sieve approximation, i.e. O K s / d where s is the smoothness of E[Y|X] and K is the order of the sieve. It is well-known that the non-parametric sieve estimation suffers from the curse-of-dimensionality problem due to large d.[9] If we allow the depth of the network (which is the number of hidden nodes) to grow with the sample size, then the networks have more power to detect much more complex interactions since mathematically the networks have much more variations of the forms of the polynomial expansions by turning on or off some of their hidden nodes in the estimation process (see Section 5). The same is true if we replace the ReLU activation function with other smooth, non-polynomial activation functions. The classical non-parametric methods cannot detect this high-dimensional complexity as well as the multi-layer neural networks because their mathematical forms are composed of single functions.

Note that we violate the assumption of the target function when deriving the asymptotic normality result. Specifically, we assume that the stated function is also not continuous, but belongs to the Sobolev space W k , , where k N . This is necessary because we require the results from Yarotsky (2017) to get a stronger convergence rate for the ReLU least-squares sieves estimator to derive asymptotic normality. Hence, although a stronger assumption is needed for the target function, the stronger property of the ReLU sieve estimator (asymptotic normality) can be derived. This results regarding asymptotic distribution can be very helpful for developing statistical tests that are based on the ReLU sieve estimator, see, for example, Shen et al. (2021).

Our current analysis establishes asymptotic normality only for the simple sample average, a specific functional of the DNN sieve estimator. While the covariate independence is a strength, it limits the generality compared to the pointwise asymptotic normality for each x i suggested by Farrell, Liang, and Misra (2021). Although Farrell, Liang, and Misra (2021) prove the consistency on a broad framework which is more general than the consistency of this paper, it does not prove the existence of the sieve estimator. Knowledge of the asymptotic distribution of the ReLU sieve estimator opens the door to new research on ReLU regressor statistical testing, which is often very useful in many economic and financial applications involving regressions. For instance, the asymptotic normality of the ReLU sieve estimator can help other researchers who want to construct a statistical test for assessing the overall importance of all independent variables in a regression (equivalent to the F-test in a linear regression), see Fallahgoul, Franstianto, and Lin (2024) and Horel and Giesecke (2020).

The paper is organized as follows. Section 2 provides an overview of the ReLU FFN. Section 3 sets forth the theoretical setting needed to prove the asymptotic properties of ReLU neural networks. Section 4 presents the main theoretical results of the paper (consistency, sieve-based convergence rate, and asymptotic normality). Section 5 explores the validity of the theoretical results in simulations. Section 6 concludes our paper.

Notation:

For the rest of the paper, vectors are denoted by either bold or capital fonts, e.g., x i ∈ [0,1] d . N ( ϵ , D , ρ ) is a covering number for a pseudo-metric space ( D , ρ ) .[10] Asymptotic inequality a n b n means R > 0 , N 1 N such that ∀nN1, a n Rb n , a n b n implies lim n a n b n = 1. The covering number N ( ϵ , D , ρ ) for a pseudo-metric space ( D , ρ ) is the minimum number of ϵ-balls of the pseudo-metric ρ needed to cover D , with possible overlapping. C 0 ( [ 0,1 ] d ) is the set of all continuous functions on [0,1] d , and W k , ( [ 0,1 ] d ) denotes the Sobolev space of order k on [0,1] d .[11] We denote big-O and small-O as O and o, respectively. Also, both O P and o P denote big-O and small-O in probability. EC and CC are abbreviations for existence condition and consistency condition, respectively. We use n for sample size. To show that a set increases with sample size, we use n as its subscript, e.g. F n , f n , etc.

2 Feed-Forward Networks

In this section, as ReLU FFN is perhaps less familiar to economists and other social scientists, we provide a non-technical discussion about the architecture of a ReLU FFN. The algebraic definition for ReLU FFN is provided in Section 3. Anthony and Bartlett (2009) and Goodfellow, Bengio, and Courville (2016) provide a detailed exposition. Readers familiar with the neural networks, can skip this section and move to the sections that following. Note that the notations used in this section are simply meant to explain neural networks for those unfamiliar with the machine learning algorithm, and they are not involved in other parts of the paper.

In regression, we observe the y i s whose values are driven by the underlying target function f0, which is a function of a d-dimensional vector x i . Each element of x i is an observed predictor/independent variable. As the exact functional form of f0 is rarely known, the target function f0 is estimated using a specific function of x i . In our context, this estimating function is a multi-layer neural network with ReLU activation function.

Figure 1 provides an example of a ReLU network. This is an example of an FFN where information propagates only forward.[12] This network begins by taking input from d initial nodes, i.e. x i . The number of initial nodes in Figure 1 is two, i.e. x i = (x1,i, x2,i). Each initial node corresponds to each element of the d-dimensional predictor vector x i , and the layer consisting of these nodes is called the input layer. These initial nodes can be seen as impulse receptors in biological neural networks.

Figure 1: 
Architecture of a fully connected feed-forward network with two hidden layers. An example of the multi-layer a feed-forward network being described. The green, blue, and orange layers indicate input, hidden, and output layers, respectively. The number of hidden layers is L
n
 = 2, and it has H
n
 = 3 nodes per hidden layer. For the input layer, the node x
i
 indicates the ith predictor (node x1 means the first predictor, x2 means the second). For the hidden and output layers, the indices indicate the ReLU function hu,j associated with the related nodes, with u and j is the layer and node indices, respectively, where u = 0 or u = 3 implies the input and output layers, respectively. For example, node 3 in the second hidden layer is the function 




h


2,3




x


=
R
e
L
U




∑


k
=
1




H


n


=
3




γ


2,3
,
k


⋅


h


1
,
k




x


+


γ


2,3,0





${h}_{2,3}\left(x\right)=\mathrm{R}\mathrm{e}\mathrm{L}\mathrm{U}\left({\sum }_{k=1}^{{H}_{n}=3}{\gamma }_{2,3,k}\cdot {h}_{1,k}\left(x\right)+{\gamma }_{2,3,0}\right)$


 where ReLU(x) = max(x, 0). A directed arrow going from node k in layer u − 1 to node j in layer u is the parameter γu,j,k. As an example, the arrow from node 2 in the first hidden layer to node 3 in the second means parameter γ2,3,2.
Figure 1:

Architecture of a fully connected feed-forward network with two hidden layers. An example of the multi-layer a feed-forward network being described. The green, blue, and orange layers indicate input, hidden, and output layers, respectively. The number of hidden layers is L n = 2, and it has H n = 3 nodes per hidden layer. For the input layer, the node x i indicates the ith predictor (node x1 means the first predictor, x2 means the second). For the hidden and output layers, the indices indicate the ReLU function hu,j associated with the related nodes, with u and j is the layer and node indices, respectively, where u = 0 or u = 3 implies the input and output layers, respectively. For example, node 3 in the second hidden layer is the function h 2,3 x = R e L U k = 1 H n = 3 γ 2,3 , k h 1 , k x + γ 2,3,0 where ReLU(x) = max(x, 0). A directed arrow going from node k in layer u − 1 to node j in layer u is the parameter γu,j,k. As an example, the arrow from node 2 in the first hidden layer to node 3 in the second means parameter γ2,3,2.

The inputs are then transformed into signals that are propagated forward to the next layer called the first hidden layer, equivalent to neurons in the biological counterpart. This layer contains H n hidden nodes, and each of them is connected to all nodes in the input layer. The value of each node, Y, is specified in the following way. First, one needs to calculate

(1) Y = i w i x i + b

where w i , x i , and b are the weight, the value of each node, and bias, respectively. Since the value of Y can belong to (−∞, ∞), one needs to use a function to decide whether the neuron associated with Y should be “fired” or not. An activation function is used for this purpose. While there are various types of the activation function, in this paper, we use the most popular one, the rectified linear unit, which is given by ReLU(x) = max(x, 0).

The ReLU activation function has been mainly used for both regression and classification. Its popularity is due to easier training compared to other activation function such as sigmoid or tanh activation functions. The ReLU function derivative, a step function whose values are either 0 or 1, explains why ReLU networks are suitable for the stochastic gradient descent (SGD)-based algorithms. This mathematical formulation of the ReLU derivative function stabilizes the gradients of the neural network whose its mathematical formulation involves the ReLU derivative. Hence, we can simply adjust other parameters in the algorithms if we want to train the networks for a longer time period. This simplifies the training process for the ReLU neural networks.

In plain words, the input signals are taken in the form of linear combinations of all input nodes, whose weights depend on each hidden node. Then, these hidden nodes may be activated based on the activation function used. The activation is the same for all nodes in the hidden layer. We refer to this network as the ReLU network. If the input is positive, then the hidden node is activated and produces a positive output. If it is not, then the node is not activated and produces a zero output. In Figure 1, the first and second hidden layers have three nodes.

The outputs of this hidden layer will then be used as new inputs to the next layer with nodes, and again each of them is connected to the inputting nodes from the first hidden layer. If the next layer is also a hidde layer, then the input signals from the previously hidden layer will be processed by each node of this next hidden layer and turned into output in the same manner as the previously hidden layer nodes process input.[13] If the next layer is the output layer, equivalent to the response to the impulse in its biological counterpart, the input signals are still linear aggregations, but no activation function is applied to them. They are simply taken as linear combinations. We denote the number of hidden layers in a ReLU network by L n , which is L n = 2 in Figure 1.

Our ReLU networks come from the multi-layer perceptron (MLP) class. Any networks that belong to this class have the following characteristics. First, a node belonging to a hidden layer or the output layer is fully connected to all nodes from the preceding layer. Second, there is only a single node in the output layer (meaning that there is only one output, which is the estimated value of the target function). Our analysis requires us to use the ReLU network with this structure. Note that the analysis in Farrell, Liang, and Misra (2021) is also based on MLPs. It is easy to see that FFNs that do not have the MLP structure can always be transformed into MLPs, see, Lemma 1 of Farrell, Liang, and Misra (2021).

A crucial step in approximating target function f0 via neural networks is training the model – finding unknown parameters, i.e. weights w i and biases. The unknown parameters are estimated by minimizing a loss function.[14] One advantage of neural networks overs its machine learning counterparts is that training a neural network allows for joint updates of all model parameters at each step in the optimization procedure. However, the optimization procedure can be highly computationally intensive due to its high degree of nonlinearity and heavy parameterization. To overcome this problem, SGD has been used to train a neural network.[15]

3 The Setting

In this section, we discuss the setting that is needed to establish the main results in Section 4.

3.1 Non-Parametric Estimation

In this non-parametric regression model, we observe n values of vectors x i ∈ [0,1] d and n responses y i R from the model

y i = f 0 ( x i ) + ϵ i , i = 1,2 , , n

where ϵ i are independent identically distributed defined on a complete probability space, Ω , A , P , E ϵ i = 0, Var ϵ i = σ2 < ∞ (homoscedasticity), and f 0 F : = f C 0 f : [ 0,1 ] d R . Note also that in our case, the randomness comes from the error term ϵ, while the covariate vectors x i are not random, following the assumptions in Shen et al. (2023).

The main statistical objective is to estimate the unknown function f0 from the sample ( x i , y i ) i = 1 n . To do so, various non-parametric approaches have been applied in the literature: traditional kernel smoothing, lasso methods, series estimators/wavelets and splines, and sigmoid-based shallow neural networks.[16] In this paper, similar to Farrell, Liang, and Misra (2021) and Schmidt-Hieber (2020), we consider estimating f0 by fitting a MLP to the data. It is shown that the estimator achieves nearly optimal convergence rates under various constraints on the regression function, i.e. Farrell, Liang, and Misra (2021).

Remark 1.

Note that in this paper, we assume f0 to be a continuous function instead of a member of Sobolev space as in Farrell, Liang, and Misra (2021) (although later we strengthen it when proving asymptotic normality). While this might seem to be more restricted (or stronger), it should be noted that this stronger condition enables us to prove the existence of ReLU networks sieve estimator. In many applications involving regression in economics and finance, the mean function of the response variable can safely be assumed to be a continuous function of the independent variables. It should be noted that Theorem 2 of Farrell, Liang, and Misra (2021) does not need any assumption regarding the bias ϵ of the regression model.

To prove consistency, to drive the convergence rate, and to obtain asymptotic normality for the ReLU FFN, we use the method of sieve.[17] In non-parametric econometrics terms: neural networks can be thought of as a (complex) type of sieve estimation where the basis functions are flexibly learned from the data. Specifically, we solve an empirical criterion function over infinite-dimensional sieve space (which in our case is F , please see the following subsection). However, solving such optimisation problem is infeasible because the sieve space is infinite-dimensional and can be computationally difficult to estimate the objective function using finite samples. Furthermore, even if one could solve the problem of optimizing a sample criterion over an infinite-dimensional function space, the resulting estimator may have undesirable large sample properties.[18] The method of sieves provides one way to tackle such infeasible problem by optimizing the empirical criterion function over a sequence of approximating finite spaces, called sieves, which are significantly less complex than the original function space.

There are a couple of important steps to be taken to apply the method of sieve. First, we need to construct sieve space and its estimator – sieves – in terms of ReLU FFN and explore whether the sieve estimator based on ReLU network exists. Second, as explained in Chen (2007), to ensure consistency of the method, we require that the complexity of sieves increases with the sample size so that in the limit the sieves are dense in the original function/parameter space (denoted by F ).

In next section, we discuss details of sieve space and sieves for the ReLU FFN.

3.2 Sieve Space and Sieves in Terms of ReLU FFNs

In this section, we define the sieve space and estimator in terms of the ReLU FFNs.

Define the sample squared loss on f F and the population criterion function, respectively, as

Q n ( f ) : = 1 n i = 1 n ( y i f ( x i ) ) 2 = 1 n i = 1 n ( f 0 ( x i ) f ( x i ) ) 2 + 2 n i = 1 n ϵ i ( f 0 ( x i ) f ( x i ) ) + 1 n i = 1 n ϵ i 2
Q n ( f ) : = E Q n ( f ) = E 1 n i = 1 n ( y i f ( x i ) ) 2 = 1 n i = 1 n ( f ( x i ) f 0 ( x i ) ) 2 + σ 2 .

In regression, we are interested in finding f ̂ such that

f ̂ : = arg min f F Q n ( f ) .

However, if F is too rich, f ̂ may be inconsistent.[19] Hence, we are interested in finding a sequence of nested function spaces F n , which satisfies

F 1 F 2 F n F n + 1 F

where f F , f n F n s . t . lim n ρ f , f n = 0 . More precisely, F n is dense in F . F n itself is called a sieve space of F with respect to the pseudo-metric ρ, and the sequence f n is called a sieve. Take ρρ n , where ρ n : F [ 0 , )

ρ n ( f ) = 1 n i = 1 n f ( x i ) 2

where ρ n is a pseudo-norm.[20]

Assumption 1

(Approximate Sieve Estimator). Instead of f ̂ , we consider an approximate sieve estimator f ̂ n that satisfies the following inequality

Q n ( f ̂ n ) inf f F n Q n ( f ) + O P η n

where η n represents the order of the difference between the exact sieve estimator and the approximate sieve estimator, and limn→∞η n = 0.

Note that the sequence η n here does not need to converge with a specific rate in n. Its role is to define the approximate sieve estimator. The approximate sieve estimator is taken instead of the exact sieve estimator (the real infimum of F n ) because in practice we can only get the approximate version of the estimator instead of the exact one.

Remark 2.

If there is a violation of Assumption 1, then there is no guarantee that the asymptotic loss of our sieve estimator f ̂ n is less than π n f0 (the sieve sequence inside F n that converges to f0). However, the analysis of the asymptotic properties is conducted on f ̂ n because in practice, π n f0 cannot be obtained, while f ̂ n can be obtained by finding f that minimizes the squared loss function Q n ( f ) .

The next question is how to construct F n that makes it dense in F . First of all, consider the following fixed-width ReLU FFN function space indexed by W n

F W n : = h L n + 1,1 x : x [ 0,1 ] d

where h u , j x is the output of the jth node of the layer u in the ReLU network with input x , u = 0 or u = L n + 1 corresponds to the input and output layers, respectively, and 1 ≤ uL n corresponds to the uth hidden layer. Note that W n is the number of parameters in each of the ReLU networks in the sieve space, F n . It is indexed by n (number of samples) because the values of W n grow larger with n.

We also have j 1,2 , . . , H n , u , where Hn,u is the number of nodes in the uth layer, Hn,0 = d, H n , L n + 1 = 1, and Hn,u = H n for another uth layer. For 1 ≤ uL n , the formula for h u , j x is

h u , j x = R e L U k = 1 H n , u 1 γ u , j , k h u 1 , k x + γ u , j , 0

where h 0 , k x = x k , the kth element of x . It should be noted that γu,j,k and γu,j,0 are equivalent to w i and b in (1).

We use the upper bound max 1 j H n , u k = 0 H n , u 1 γ u , j , k M n , u , ∀ 1 ≤ uL n + 1, where Mn,u > 1, Mn,u can depend on n, and Mn,0 = 1 as X = [0,1] d . This upper bound is used in the entropy number upper bound. W n itself is the number of parameters γu,j,k in a single ReLU network, with W n = u = 0 L n H n , u + 1 H n , u + 1 .

Note that the neural network function h L n + 1,1 represents the output of an MLP whose specifications (number of hidden layers, weight upper bound, and so on) described as above. An example of FFN in MLP form with L = 2 hidden layers and H = 3 hidden nodes per hidden layer is shown in Figure 1. We emphasize that any FFN can be remade into an MLP as described in Section 2.

According to Proposition 1 in Yarotsky (2018), f 0 F , π W n f 0 F n s.t.

π W n f 0 f 0 : = sup x [ 0,1 ] d π W n f 0 ( x ) f 0 ( x ) O ω f 0 O W n 1 / d

with ω f 0 : [ 0 , ] [ 0 , ) , ω f 0 ( r ) = max f 0 ( x ) f 0 ( y ) : x , y [ 0,1 ] d , x y < r . It is clear that ‖ ⋅‖ is a pseudo-metric, F W n , is a sieve space of f0, and π W n f 0 is the related sieve. It is imperative that W n 1 / d 0 as n ↑ ∞ to ensure dense F W n . Also, for Γ n : = γ u , j , k : u , j , k ,[21] we need to ensure that the range of the parameters inside ∪ n Γ n can span through R W n . These requirements for having a dense F W n can be summarized as

W n  and  Γ n R W n ,  as  n

and the chosen set Γ n is a compact set

Γ n = M n , u , j , k ( γ ) , M n , u , j , k ( γ ) W n , 1 u L n + 1 , 1 j H n , u , 0 k H n , u 1

such that ∀u, j, k, γ u , j , k M n , u , j , k ( γ ) , and also j = 1 H n , u k = 0 H n , u 1 M n , u , j , k ( γ ) = Mn,u. Hence, the two requirements of the denseness of F W n are given under the following assumption.

Assumption 2

(Assumption for Dense F W n ). H n , M n , u , j , k ( γ ) as n ↑ ∞, ∀1 ≤ uL n + 1, ∀1 ≤ jHn,u, ∀0 ≤ kHn,u−1.

Note that we need to impose this upper bound because we need to use Theorem 14.5 of Anthony and Bartlett (2009) when proving the boundedness of the metric entropy integral later in the proof. Farrell, Liang, and Misra (2021) does not explicitly assume bounded weights for DNNs, but instead implicitly assumes the boundedness of the weight parameters (see equation (2.4) in Farrell, Liang, and Misra 2021). For the following sections, the results and proofs are obtained by applying the same strategies employed by Shen et al. (2023). It should be noted that this adaptation is not straightforward, as it requires the accomodation of the results of Yarotsky (2017) and Yarotsky (2018).

Remark 3.

Although the number of weights, W n , might seem to be irrelevant to the applications, it is clear that W n is directly related to H n . This is obvious as W n is the number of weights (or parameters) in the linear aggregation taken by the hidden/output nodes in a ReLU network, while H n is the number of hidden nodes per hidden layer. In practice, we often increase H n (or L, the number of hidden layers) when we deal with high-dimensional data with a large number of samples, due to the extremely complex interactions among variables. Practitioners often indirectly increase W n by increasing H n . However, in our case, W n is used mainly for deriving theoretical properties.

4 Main Results

We discuss our three main theoretical results regarding sieve-estimator consistency, convergence rate, and asymptotic normality in this section. All proofs are provided in an Online Appendix.

4.1 Existence

A sieve existence requires several conditions exist. The following remark states the conditions which are needed for Theorem 1.

Remark 4

(Existence Conditions). (Remark 2.1. in Chen 2007). There exists an approximate sieve estimator f ̂ n inside F W n if the following statements hold

  1. Q n ( f ) is measurable function of the data x i , y i , i 1,2 , , n .

  2. Q n ( f ) is lower semicontinuous on F W n under the pseudo-metric ρ n , for each ω ∈ Ω fixing the sequence x i , y i ( ω ) i = 1 n .

  3. F W n is a sieve of F and compact under ρ n .

It should be noted that EC1 is satisfied by Q n ( f ) , as y i = f0( x i ) + ϵ i . Note also that fixing x i , y i ( ω ) is equivalent to fixing ϵ i ( ω ) , ∀ ω ∈ Ω. To prove EC2 and EC3, we make use of the following lemma.

Lemma 1.

For each n, 1 ≤ uL n + 1, and 1 ≤ jHn,u,

sup 1 j H n , u h n , j M n , u * : = i = 0 u M n , i 1

and this implies

sup f F W n f M n , L n + 1 * = u = 0 L n + 1 M n , u 1 .

The Online Supplementary Material provides the proof of this lemma, EC2, and EC3. Note also that Q n ( f ) can be proven to be continuous on F W n , ρ n , which is a stronger condition than EC2. Thus, the existence of f ̂ n is justified. We are now ready to state the existence theorem for a sieve estimator.

Theorem 1

(Existence). There exists an approximate sieve estimator f ̂ n in F W n .

We emphasize that existence is a crucial property that needs to be proven before proving the consistency of the least squares ReLU sieve estimator. The existence result complements the recent work of Farrell, Liang, and Misra (2021). If the sieve space F W n is not compact on the original function space F under the used norm, for instance, then it is possible that the sieve estimator does not exist at all. Even if derived successfully, the consistency property will be meaningless if the sieve estimator itself does not existent. Additionally, even if the exact sieve estimator does exist, locating it could be challenging. Therefore, it is crucial for us to focus on developing an approximate sieve estimator. The next subsection will focus on the general consistency theorem, which can be validly proven once the existence has been proven.

4.2 Consistency

Define the product space Ω * , A * , P * = i = 1 n Ω , A , P × Z , C , P Z , with the last probability measure containing additional random variables independent of i = 1 n Ω , A , P . The consistency of f ̂ n is satisfied under this probability measure, with a condition on the parameter number growth. The conditions for consistency are given in the following remark.

Remark 5

(Consistency Conditions). (Remark 3.1.(3) in Chen 2007). The approximate sieve estimator f ̂ n in the sieve space F W n of F satisfies

plim n ρ n ( f ̂ n f 0 ) = 0

if the following conditions are satisfied

  1. Q n (f) is continuous at f0 in F , Q n (f0) < ∞.

  2. For all ζ > 0, Q n ( f 0 ) < inf { f F : ρ n ( f f 0 ) ζ } Q n ( f ) .

  3. Q n ( f ) is a measurable function of the data x i , y i , i 1,2 , , n .

  4. Q n ( f ) is lower semicontinuous on F W n under ρ n , for each ω ∈ Ω fixing the sequence x i , y i ( ω ) i = 1 n .

  5. F W n , ρ n is compact sieve space.

  6. (Uniform convergence) plim n sup f F W n Q n ( f ) Q n ( f ) = 0 , for each W n .

Note that Q n (f) = 1 n i = 1 n f ( x i ) f 0 ( x i ) 2 + σ 2 is continuous on F . The proof of its continuity is very similar to the proof of the lower semi-continuity of Q n ( f ) , where the related constant is now D ̂ n , L n + 1 = n 1 2 M n , L n + 1 * + 2 max x [ 0,1 ] d | f 0 ( x ) | , and hence CC1 is satisfied. It is obvious that CC2 is satisfied as f0 minimizes Q n in F . As CC3, CC4, and CC5 are the Existence Conditions, we already have them. The last thing that needs to be dealt with is CC6.

Lemma 2

(CC6 Satisfaction). If M n , L n + 1 * 2 C n , d , L n + 1 , W n * = o(n), then CC6 is satisfied under Ω * , A * , P * .

Having satisfied the last consistency condition, we now proceed to state the assumption needed to prove consistency.

Assumption 3.

All multi-layer neural networks belonging to F W n are bounded above and below.

We are now ready to state the consistency theorem.

Theorem 2

(Consistency). Define

M n , L n + 1 ( a l l ) : = max 1 i L n + 1 M n , i > 1 C n , d , L n + 1 , W n * : = W n ln d M n , L n + 1 * W n ( M n , L n + 1 ( a l l ) ) L n

If M n , L n + 1 * 2 C n , d , L n + 1 , W n * = o(n), then

plim n ρ n ( f ̂ n f 0 ) = 0 , under  Ω * , A * , P * .

Remark 6.

The purpose of having Assumption 3 is to employ Theorem 14.5 of Anthony and Bartlett (2009) in the proof of consistency, which requires that the activation function be bounded. As the existence of these bounds does not affect the proofs of our results, the following discussions involving the ReLU multi-layer networks simply assume that the actual activation function that is used is the bounded ReLU activation function that is bounded below and above by L B f 0 and U B f 0 , respectively. In fact, the introduction of the weight parameter bound of the ReLU networks themselves have already ensured that the whole ReLU network function to be bounded below and above. As most f0 encountered in practice are rarely that large, we can treat bounded ReLU network as if they are unbounded ReLU network in most applications. Note that this truncation of the ReLU activation function here serves solely for proofing the theorem. It does not have an inherent advantage or disadvantage in practice, as U B f 0 and L B f 0 can be taken arbitrarily large.

In the next subsection, we show that the convergence rate of f ̂ n can be bounded by η n convergence rate.

4.3 Rate of Convergence

The following remark underlies the Rate of Convergence Theorem proof.

Remark 7

(Convergence Rate of ρ n ( f ̂ n π W n f 0 ) ). (Theorem 3.4.1 in Vaart and Wellner 1996) For each n, let δ n satisfying 0 ≤ δ n α be arbitary (δ n is typically a multiple of ρ n ( π W n f 0 f 0 ) )). Suppose that, for every n and δ n < δα,

sup δ / 2 ρ n ( f π W n f 0 ) δ f F W n Q n ( π W n f 0 ) Q n ( f ) δ 2 E P * sup δ / 2 ρ n ( f π W n f n ) δ f F W n n Q n Q n ( π W n f 0 ) Q n Q n ( f ) ϕ n ( δ )

for functions ϕ n such that δϕ n (δ)/δ β is decreasing on δ n , α for some β < 2. Let r n δ n 1 satisfy

r n 2 ϕ n 1 r n n , for every n .

If the approximate sieve estimator f ̂ n satisfies Q n ( f ̂ n ) Q n ( π W n f 0 ) + O P r n 2 and ρ n ( f ̂ n π W n f 0 ) converges to zero in outer probability defined in Ω * , A * , P * , then

ρ n ( f ̂ n π W n f 0 ) = O P * r n 1 .

If the displayed conditions are valid for α = ∞, then the condition that f ̂ n is consistent is unnecessary.

The two supremum-upper-bound conditions in Remark 7 have been proven in Shen et al. (2023). We state them in the following remark.

Remark 8.

(Lemma 1 and Lemma 2 in the Supplementary Materials of Shen et al. 2023)

  1. For every n and δ > 8 ρ n π W n f 0 f 0 , we have

    sup δ / 2 ρ n ( f π W n f n ) δ f F W n Q n ( π W n f 0 ) Q n ( f ) δ 2
  2. For every sufficiently large n and δ > 8 ρ n π W n f 0 f 0 , we have

    E P * sup δ / 2 ρ n ( f π W n f 0 ) δ f F W n n Q n Q n ( π W n f 0 ) Q n Q n ( f ) 0 δ ln N η , F W n , ρ n d η .

We emphasize again that Theorem 3 when we work with F W n * , and in this case the involved parameters such as π W n f 0 , C n , d , L n + 1 , W n * and M n , L n * + 1 * are replaced with their counterparts for F W n * such as π W n * f 0 , C n , d , L n + 1 , W n * * and M n , L n * * + 1 * , and so on. We can now state the rate of convergence theorem of f ̂ n .

Theorem 3.

(Rate of Convergence) Suppose that

η n = O max ρ n π W n f 0 f 0 2 , C n , d , L n + 1 , W n * n 2 / 3

where C n , d , L n + 1 , W n * is defined in the Consistency Theorem and M n , L n + 1 * 2 C n , d , L n + 1 , W n * = o(n). Then

ρ n f ̂ n f 0 = O P * max ρ n π W n f 0 f 0 , C n , d , L n + 1 , W n * n 1 / 3 .

4.4 Asymptotic Normality

In this subsection, our main objective is to show the asymptotic normality of the square-root sample-size average of the centered difference between f ̂ n and f0. The procedure that we follow is the same as the proof of the one-layer sigmoid network asymptotic normality in Shen et al. (2023), which is inspired by the General Theory on Asymptotic Normality in Shen (1997).

Assumption 4.

We require f 0 f C 0 ( [ 0,1 ] d ) W k , ( [ 0,1 ] d ) f W M W , M W > 0, and k N .

W k , ( [ 0,1 ] d ) is the Sobolev space defined in [0,1] d , composed of functions whose derivatives up to order k are defined in the weak sense, in terms of partial integration for d = 1, or, distribution for d > 1. This space is a Banach space with respect to the norm  f W : = max k : 0 | k | k D k f ( x ) L ( [ 0,1 ] d ) , where k ( N { 0 } ) d , D k f ( x ) : = | k | f x 1 k 1 x 2 k 2 x d k d is the related weak derivative, x1, …, x d and k1, …, k d are the elements of vectors x and k , respectively. We use the G a ̂ teaux derivative of Q n ( f ) at f0 in the direction of ff0.[22] It is obvious that

d Q n ( f 0 ; f f 0 ) = lim τ 0 Q n f 0 + τ ( f f 0 ) Q n f 0 τ = lim τ 0 i = 1 n y i f 0 ( x i ) τ f ( x i ) f 0 ( x i ) 2 i = 1 n y i f 0 ( x i ) 2 n τ = 2 n i = 1 n ϵ i f ( x i ) f 0 ( x i )

with the related first-order Taylor remainder term

R 1 f 0 ; f f 0 = Q n ( f ) Q n ( f 0 ) d Q n ( f 0 ; f f 0 ) = 1 n i = 1 n y i f ( x i ) 2 i = 1 n y i f 0 ( x i ) 2 + 2 n ϵ i f ( x i ) f 0 ( x i ) = 1 n i = 1 n ϵ i + f 0 ( x i ) f ( x i ) 2 i = 1 n ϵ i 2 + 2 n ϵ i f ( x i ) f 0 ( x i ) = 1 n i = 1 n f ( x i ) f 0 ( x i ) 2 = ρ n f f 0 2 .

Note that G a ̂ teaux derivatives of Q n can be defined as F is a convex vector space. We define a pseudo-scalar product , ρ n : F × F R , with the mapping rule

f , g ρ n = 1 n i = 1 n f ( x i ) g ( x i )

where the subscript ρ n indicates f g , f g ρ n = ρ n (fg)2. The proof that , ρ n is indeed a pseudo-scalar product is the proof of Proposition 6.2 in Shen et al. (2023).

We also make use of the following remark, which is useful to bound the empirical process n d Q n ( f 0 ; f f 0 ) in the proof.

Remark 9.

(Lemma 5.1. in Shen et al. 2023). Let X1, …, X n be independent random variables, X i is under probability measure P i . Define the empirical process ν n ( g ) as

ν n ( g ) : = 1 n i = 1 n g X i E P i g X i .

Let G n = g : g M n , ϵ > 0 and V sup g G n 1 n i = 1 n Var g X i be arbitrary. Define ψ B , n , V : = B 2 / 2 V 1 + B M n 2 n V . If ln N u , G n , A n u r for some 0 < r < 2 and u ∈ (0, a], where a is a small positive number, and, there exists a positive constant K i = K i (r, ϵ) 1 = 1, 2 such that

B K 1 A n 2 r + 2 M n 2 r r + 2 n r 2 2 ( r + 2 ) K 2 A 2 1 / 2 V 2 r 4 .

Then

P * sup g G n ν n ( g ) > B 5 exp ( 1 ϵ ) ψ ( B , n , V ) .

Now, we are ready to state the asymptotic Gaussianity exhibited by f ̂ n .

Theorem 4.

(Asymptotic Normality) Suppose that η n = o r n 2 , and also

r n 1 = o ( n 1 / 2 ) M n , L n + 1 * C n , d , L n + 1 , W n * = o n 1 / 4 ρ n π W n f 0 f 0 = o min n 1 / 4 , n 1 / 6 C n , d , L n + 1 , W n * 1 / 3

then the distribution of the statistics

1 n i = 1 n f ̂ n ( x i ) f 0 ( x i )

approaches N ( 0 , σ 2 ) when n → ∞.

Remark 10.

It should also be noted that Theorem 4 proves asymptotic normality for a functional that is different from the n sample mean of the centered sample values y i f0( x i ), whose asymptotic normality comes straightforward from the Central Limit Theorem as y i f0( x i ) = ϵ i and ϵ i are the errors in the regression setting which already has a normal distribution.

4.4.1 Asymptotic Normality Conditions Satisfaction for Sufficiently Smooth f0

Here we discuss the requirements implied by the Asymptotic Normality Theorem conditions. We require that

f 0 f C 0 ( [ 0,1 ] d ) W k , ( [ 0,1 ] d ) f W M 3

for some M3 > 0, and every k N , and W k , ( [ 0,1 ] d ) is the Sobolev space in [0,1] d . The following remark is needed to derive the main result of this subsubsection.

Remark 11.

(Proposition 1 from Yarotsky 2017). For any function

f G * : = f C 0 [ 0,1 ] d W k , ( [ 0,1 ] d ) f W 1

and any k , d N , ɛ ∈ (0,1), there is a feed-forward ReLU network, whose layers may be connected with layers after their adjacent layers, with a weight assignment that

  1. is capable of expressing f with error ɛ.

  2. has the depth at most c(ln(1/ɛ) + 1) and at most d/k(ln(1/ɛ) + 1) weights and hidden layer nodes, with some constant c = c(d, k).

It should be noted that:

f W K , K > 0 , w e h a v e f * = 1 K f , w h e r e f * G *

and this allows us to connect our target function f0 to the function f * G * .

The main intention of this subsubsection is to discuss the conditions required by the weak differentiation order k and sieve sequence error order ɛ = ɛ n from Remark 11. This remark makes it possible to construct a sieve sequence { π W n f 0 } that satisfies π W n f 0 f 0 = ɛ n ∈ (0, 1), ɛ n ↓ 0.

One might question the possibilities of getting such weight assignments from a compact Γ n that can adjust ɛ n . We emphasize that the range of Γ n can be made as large as possible. For example, one can just take M n , u , j , k γ = M′, where M′ can be taken arbitarily large, and replace Γ n constructed from element-wise bounds on γu,j,k by the set made from 1-norm bounds for k = 0 H n , u 1 | γ u , j , k | , where the bounds are Mn,u = j = 1 H n , u k = 0 H n , u 1 M . The resulting Γ n is still compact. However, as M′ can be made as large as possible, the sieve sequence { π W n f 0 } has its tail in Γ n for sufficiently large n, as it converges to f0.

The convergence rate of the sieve sequence { π W n f 0 } needs to be proven by following the results from Yarotsky (2017) instead of Yarotsky (2018). To follow the requirement in Yarotsky (2017), we assume now f0 is a member of W k , [ 0,1 ] d C 0 [ 0,1 ] d instead of just C 0 [ 0,1 ] d . Again, proving the asymptotic normality requires us to have a faster convergence rate of { π W n f 0 } than those that are needed for consistency, as we can see from the second and third conditions of Theorem 4 when compared to the conditions of Theorem 3.

As the feed-forward networks that are stated in the remark above might not be in the form of MLPs, we need to show that the ReLU feed-forward network required by the remark can be contained in a ReLU network with layers connected only to their adjacent layers, which is what we referred to earlier as an MLP. Our idea shares similarities with Lemma 1 in Farrell, Liang, and Misra (2021), although we use the hidden-layer-nodes upper bound instead of a weight bound.

Lemma 3.

If θ is a ReLU FFN with non-adjacent layer connections with N n hidden layer nodes and L n hidden layers, then there is a ReLU MLP θ′ with full previous-layer connections, H n nodes per hidden layer, and L n hidden layers such that θ(x) = θ′(x), where x is the input vector, and H n N n L n + d.

We can now derive the conditions for satisfaction of all conditions of the Asymptotic Normality Theorem, which requires Lemma 3. Suppose that the order of the weak differentiation of f0 (denoted by k) satisfies k = ud, u N , and a sieve-sequence error that ɛ n = na, ∃a > 0. If we assume n-polynomial growth condition on ɛ n , then both H n and L n have n-polynomial growth rate. Therefore, W n = O H n 2 L n . As ρ n ( π W n f 0 f 0 ) ε n and C n , d , L n + 1 , W n * = O ( W n L n ) , the two Asymptotic Gaussianity rate conditions can thus be written as

H n 2 L n 2 = o ( n 1 / 4 ) ε n = o min n 1 / 4 , n 1 / 6 C n , d , L n + 1 , W n * 1 / 3

where H n = O ( N n L n ) by Lemma 3, and also N n is the number of hidden unit nodes in the original, possibly non-MLP ReLU networks from which the ReLU sieve sequence { π W n f 0 } is constructed. Remark 11 tells us that

N n = c n a / u a u ln ( n ) + 1 L n = c a u ln ( n ) + 1

and these together with the rewritten rate conditions yield

N n 2 L n 3 = O c 5 n 2 a / u a u ln ( n ) + 1 5 n a = o ( n 1 / 4 ) n a = o n 1 / 6 c 5 / 3 n 2 a / 3 u a u ln ( n ) + 1 5 / 3

and these conditions lead to

n 2 a / u < n 1 / 4 ,  n a > n 1 / 4  and  n a < n 1 / 6 n 2 a / 3 u

which then simplify to

(2) 1 4 < a < u 8  and  a > u 6 u 4 .

The last two conditions are satisfied for every u ≥ 3, as the function b : ( 0 , ) R , b ( x ) = x 6 x 4 is decreasing on [1, ∞).

Note that some might find that the distinction between this paper and Farrell, Liang, and Misra (2021) to be less clear as we assume the number of hidden layers (denoted by L) to grow with sample size n instead of fixed and just dependent on the number of independent variables d. To address this vagueness, we stress again that Farrell, Liang, and Misra (2021) did not prove the asymptotic normality of f ̂ n or any functionals of f ̂ n (e.g. 1 n i f ̂ n ( x i ) ). Their normality result is on some second-stage finite dimensional parameter, which is “independent” of f0. This property is very useful as it opens a new path for the development of statistical test of ReLU sieve estimator, which often requires the knowledge of asymptotic distribution, e.g. Shen et al. (2021).

5 Monte Carlo Analysis

This simulations presented in this section are meant to confirm that multi-layer ReLU network sieve estimator f ̂ n does indeed converge to the true regression function f0, and also their difference is asymptotically normal. As f0 rarely has the same form as the estimating neural network f ̂ n , parameter comparisons such as those in Section 4.1 of Shen et al. (2023) is not practically important because it cannot be done for f ̂ n and f0 with different functional forms. Alternatively, instead of this impracticality of studying parameter consistency, one can study the asymptotic properties of the estimating function without bothering the parameter consistency.[23]

5.1 Consistency of ReLU Feed-Forward Network

We conduct a simulation of y i = f0(x i ) + ϵ i for showing the probabilistic convergence of f ̂ n . We simulate x i from the uniform distribution in [0,1], i.e. x i U [ 0,1 ] , and residuals are independent and identically normally distributed with mean zero and standard deviation 0.7, i.e. ϵ i ∼ i.i.d N ( 0,0 . 7 2 ) . The functions that serve as the true mean function f0(x) are

  1. A sigmoid function

    f 0 ( x ) = 5 + 18 σ 9 x 2 12 σ 2 x 9
  2. A periodic function

    f 0 ( x ) = sin 2 π x + 1 3 cos 3 π x + 3
  3. A non-differentiable function

    f 0 ( x ) = 8 1 2 x , if  x 0 , 1 2 10 x 1 2 2 x , if  x 1 2 , 1
  4. A superposition between sigmoid and periodic function

    f 0 ( x ) = 5 sin 8 π x + 18 σ 9 x 2 12 σ 2 x 9 .

Note that we have chosen functions that have similar functional form with simulation functions in Shen et al. (2023), but with larger parameter values. Although being defined in a very short range [0,1], they have significant value variations. They are harder to fit than those functions used in Shen et al. (2023), which are much gentler. This difficulty in fitting makes the comparison between the f0 and f ̂ n more interesting, as these two functions are much more likely to have different plots for smaller values of n. Also, as we compare the performance between multi-layer ReLU and one-layer sigmoid networks, the neural network with better performance is also more likely to show significantly better numerical and visual convergence if we use f0 that are challenging to fit.

To conduct the simulation, we take M n , u , j , k ( γ ) for ReLU networks and V n for sigmoid networks to be M′ and Mr n , respectively,[24] where M′ is a very large number, and one possible example of its value is M′ = 10100,000. We can replace the original Γ n with the new compact set made by using 1-norm bounds for k = 0 H n , u 1 | γ u , j , k | , and the bounds are Mn,u = j = 1 H n , u k = 0 H n , u 1 M . Remark 6 guarantees that our output-unbounded ReLU networks can be seen as output-bounded with the upper and lower bounds that are very large and small, respectively, and these bounds are independent of sample size n.

By bounding the parameter sets and the output with a large real number, we can conduct the training minimization as an unbounded optimization. This is meant to simplify the implementations, as one can use common gradient descent algorithms instead of the subgradient projection algorithm when doing unbounded minimization. The subgradient projection algorithm projects each point in each gradient descent iteration to the convex set, such as Γ n , where the parameters are assumed to belong, and thus it simplifies to the standard gradient/subgradient descent if in each iteration the parameter stays inside the convex set.[25]

The training is done by using Keras 2.2.4 for Python 3.7 in Spyder 3.3.4. The gradient algorithm being used is Nadam with learning rate 0.001. The simulation is conducted by setting the growth rate of H n and r n (for one-layer sigmoid networks) to be n0.4. For multi-layer ReLU networks, L n = 2. The number of samples are n ∈ {2 × 103, 5 × 103, 2 × 104, 5 × 104}. For the superposition f0, because the convergence is slower, we also consider n ∈ {2 × 105, 5 × 105}. The results can be seen in Table 1. For Q n ( f ̂ n ) , the values are considered good if it is close to Q n ( f 0 ) = 0.49.

Table 1:

Error and least squared errors for different functions of f0. ρ n ( f ̂ n f 0 ) 2 and Q n ( f ̂ n ) are the error and least squared errors, respectively. n is the sample size of training data. f ̂ n is the approximated sieve estimator. σ(.) is a Sigmoid function. ReLU, rectified linear unit. The visualizations of the convergence for the sigmoid, periodic, non-differentiable, and sigmoid and periodic f0 are in Figures 25, respectively.

n f0(x) = 18σ(9x − 2) − 12σ(2 − 9x) + 5 f 0 ( x ) = sin 2 π x + 1 3 cos 3 π x + 3
ρ n ( f ̂ n f 0 ) 2 Q n ( f ̂ n ) ρ n ( f ̂ n f 0 ) 2 Q n ( f ̂ n )
ReLU Sigmoid ReLU Sigmoid ReLU Sigmoid ReLU Sigmoid
2 × 103 13.9306 51.6624 14.6018 52.8889 0.1469 0.4475 0.6433 0.9428
5 × 103 13.6968 13.4850 14.1075 13.9171 0.0378 0.4472 0.5305 0.9281
2 × 104 0.0340 0.0439 0.5223 0.5330 0.0018 0.4413 0.4907 0.9299
5 × 104 0.0070 0.0140 0.4950 0.5025 0.0079 0.4134 0.4958 0.9077
f 0 ( x ) = 8 x 1 2 1 { 0 x 0.5 } + 10 x 1 2 2 x 1 { 0.5 < x 1 } f0(x) = 18σ(9x − 2) − 12σ(2 − 9x) + 5 sin(8πx)
2 × 103 0.8408 3.6109 1.3705 4.1554 26.7791 56.6896 27.5223 57.9058
5 × 103 0.5048 2.3753 1.0156 2.9082 14.7668 25.8588 15.3108 26.3610
2 × 104 0.0187 0.8677 0.5058 1.3548 8.4030 12.3960 8.8938 12.8759
5 × 104 0.0194 0.1739 0.5076 0.6694 8.2855 11.4466 8.7745 11.9541
2 × 105 0.9574 7.8232 1.4476 8.3020
5 × 105 0.1372 6.4662 0.6274 6.9519

An inspection of the errors, ρ n ( f ̂ n f 0 ) 2 , and least squares errors, Q n ( f ̂ n ) , reveals two major points. First, by increasing the sample size, both ρ n ( f ̂ n f 0 ) 2 and Q n ( f ̂ n ) converge to zero where the activation function is either ReLU or sigmoid across all simulated functions, i.e. f0. In fact, the errors ρ n ( f ̂ n f 0 ) 2 follow a decreasing pattern as the sample size increases for both ReLU and sigmoid. Second, when the simulated function f0 has a more complicated structure, the two-layer ReLU network outperforms the one-layer sigmoid in terms of convergent rate. Overall, the consistency of the estimated function f ̂ n is confirmed by the results reported in Table 1.

We close this section with a detailed comparison of two-layer ReLU networks to one-layer sigmoid networks as they were used in Shen et al. (2023). An inspection of Table 1 reveals that the one-layer sigmoid networks convergence speed matches the multi-layer ReLU network when f0 is the sigmoid. This result holds both numerically, the left half of Table 1, and visually, Figure 2. This is not surprising as the sigmoid neural networks themselves are linear combinations of sigmoid functions.

Figure 2: 
The graphs of multi-layer ReLU and one-layer sigmoid neural networks approximating f0(x) = 18σ(9x − 2) − 12σ(2 − 9x) + 5 for different sample size. Both batch and epoch numbers being used during the training are 32. The numbers of nodes per layer after rounding the nearest integer of n0.4 for each choice of n are 21, 30, 53, and 76, respectively, where n is the sample size. For ReLU, the number of hidden layer L
n
 is 2.
Figure 2:

The graphs of multi-layer ReLU and one-layer sigmoid neural networks approximating f0(x) = 18σ(9x − 2) − 12σ(2 − 9x) + 5 for different sample size. Both batch and epoch numbers being used during the training are 32. The numbers of nodes per layer after rounding the nearest integer of n0.4 for each choice of n are 21, 30, 53, and 76, respectively, where n is the sample size. For ReLU, the number of hidden layer L n is 2.

As evidenced by Figures 35, the two-layer ReLU can detect the fluctuating patterns and the non-differentiable point better and quicker than the one-layer sigmoid. The sigmoid networks somehow become wavy and less accurate when approaching the point of non-differentiability at larger n (see Figure 4). As expected, the ReLU networks also have faster numerical convergence speed for fluctuating and non-differentiable f0 as indicated by Table 1.

Figure 3: 
The graphs of multi-layer ReLU and one-layer sigmoid neural networks approximating f0(x) = 


sin


2
π
x


+


1


3


cos


3
π
x
+
3



$\sin \left(2\pi x\right)+\frac{1}{3}\cos \left(3\pi x+3\right)$


 for different sample size. Both batch and epoch numbers being used during the training are 32. The numbers of nodes per layer after rounding the nearest integer of n0.4 for each choice of n are 21, 30, 53, and 76, respectively, where n is the sample size. For ReLU, the number of hidden layer L
n
 is 2.
Figure 3:

The graphs of multi-layer ReLU and one-layer sigmoid neural networks approximating f0(x) = sin 2 π x + 1 3 cos 3 π x + 3 for different sample size. Both batch and epoch numbers being used during the training are 32. The numbers of nodes per layer after rounding the nearest integer of n0.4 for each choice of n are 21, 30, 53, and 76, respectively, where n is the sample size. For ReLU, the number of hidden layer L n is 2.

Figure 4: 
The graphs of multi-layer ReLU and one-layer sigmoid neural networks approximating f0(x) = 


−
8


x
−


1


2






1



{

0
≤
x
≤
0.5

}



+
10


x
−


1


2






2
−
x




1



{

0.5
<
x
≤
1

}




${-}8\left(x-\frac{1}{2}\right){\mathbb{1}}_{\left\{0\le x\le 0.5\right\}}+10\sqrt{x-\frac{1}{2}}\left(2-x\right){\mathbb{1}}_{\left\{0.5<  x\le 1\right\}}$


 for different sample size. Both batch and epoch numbers being used during the training are 32. The numbers of nodes per layer after rounding the nearest integer of n0.4 for each choice of n are 21, 30, 53, and 76, respectively, where n is the sample size. For ReLU, the number of hidden layer L
n
 is 2.
Figure 4:

The graphs of multi-layer ReLU and one-layer sigmoid neural networks approximating f0(x) = 8 x 1 2 1 { 0 x 0.5 } + 10 x 1 2 2 x 1 { 0.5 < x 1 } for different sample size. Both batch and epoch numbers being used during the training are 32. The numbers of nodes per layer after rounding the nearest integer of n0.4 for each choice of n are 21, 30, 53, and 76, respectively, where n is the sample size. For ReLU, the number of hidden layer L n is 2.

Figure 5: 
The graphs of multi-layer ReLU and one-layer sigmoid neural networks approximating f0(x) = 18σ(9x − 2) − 12σ(2 − 9x) + 5 sin(8πx) for different sample size. Both batch and epoch numbers being used during the training are 32. The numbers of nodes per layer after rounding the nearest integer of n0.4 for each choice of n are 21, 30, 53, 76, 132, and 190, respectively, where n is the sample size. For ReLU, the number of hidden layer L
n
 is 2.
Figure 5:

The graphs of multi-layer ReLU and one-layer sigmoid neural networks approximating f0(x) = 18σ(9x − 2) − 12σ(2 − 9x) + 5 sin(8πx) for different sample size. Both batch and epoch numbers being used during the training are 32. The numbers of nodes per layer after rounding the nearest integer of n0.4 for each choice of n are 21, 30, 53, 76, 132, and 190, respectively, where n is the sample size. For ReLU, the number of hidden layer L n is 2.

5.2 Asymptotic Normality

Here we focus on the simulation of the Asymptotic Normality Theorem. For conducting the simulation, the number of nodes per hidden layer is chosen to be H n = 9n0.1(0.1 ln(n) + 1)2, and the hidden layer depth is L n = 3(0.1 ln(n) + 1). This growth rate follows the bounding argument discussed in Remark 11. As deep neural networks are notorious for their training difficulties, we conduct the training with batch size = 4 and epoch size = 40. Thus the training iterations are more than 8 times of those of the fixed-depth ReLU networks. The training is still conducted with the same device, operating system, Python library, method, and learning rate as the consistency simulations. The training is done with data sample size n ∈ {2 × 103, 5 × 103, 2 × 104, 5 × 104, 2 × 105}.

For this simulation, the true regression functions f0 are the first two functions in consistency simulation, and the sigmoid and periodic superposition

f 0 ( x ) = 10 sin 16 π x + 12 σ 2 x 9 18 σ 9 x 2 , x [ 0,1 ] .

We choose them as all of these functions are infinitely differentiable, to satisfy the smoothness requirement of asymptotic normality in (2). As before, the true target functions f0 used in this asymptotic normality are significantly steeper than those used in the normality simulation of Shen et al. (2023). This makes getting the right estimation accuracy and stability more challenging. After the training is done, we repeat the data simulation 200 times (similar to Shen et al. 2023) to get samples of the statistics 1 n i = 1 n f ̂ n ( x i ) f 0 ( x i ) . Note that the ρ n ( f ̂ n f 0 ) and Q n ( f ̂ n ) values reported in Table 2 verify the consistency of the increasing-depth ReLU networks, as shown in Section 4.2.

Table 2:

Goodness-of-fit tests results for different functions of f0. ρ n ( f ̂ n f 0 ) 2 and Q n ( f ̂ n ) are the error and least squared errors, respectively. n is the sample size of training data. f ̂ n is the approximated sieve estimator. σ(.) is a sigmoid function. KS, Kolmogorov–Smirnov test; SW, Shapiro–Wilk test; AP, d’Agostino–Pearson test. The Q–Q plots of the standardized data is provided in Figure 6.

n f 0 ( x ) = 18 σ 9 x 2 12 σ 2 9 x + 5
ρ n ( f ̂ n f 0 ) 2 Q n ( f ̂ n ) KS (p-value) SW (p-value) AP (p-value)
2 × 103 0.0722 0.5801 0.0505 (0.7300) 0.9967 (0.9485) 0.0103 (0.9948)
5 × 103 0.1054 0.5870 0.0427 (0.8668) 0.9955 (0.8232) 1.3818 (0.5011)
2 × 104 0.0897 0.5775 0.0495 (0.7050) 0.9954 (0.8200) 0.2040 (0.9029)
5 × 104 0.0722 0.5549 0.0332 (0.9783) 0.9849 (0.0314) 7.8790 (0.0194)
2 × 105 0.0490 0.5416 0.0491 (0.7105) 0.9936 (0.5432) 0.4241 (0.8089)
f 0 ( x ) = sin 2 π x + 1 3 cos 3 π x + 3
2 × 103 0.0103 0.5032 0.0403 (0.9230) 0.9939 (0.5901) 1.5961 (0.4501)
5 × 103 0.0470 0.5323 0.0435 (0.8507) 0.9948 (0.7213) 2.2750 (0.3206)
2 × 104 0.0075 0.4953 0.0376 (0.9376) 0.9928 (0.4426) 1.0288 (0.5978)
5 × 104 0.0148 0.4992 0.0449 (0.8060) 0.9912 (0.2695) 5.0601 (0.0796)
2 × 105 0.0048 0.4967 0.0442 (0.8209) 0.9957 (0.8561) 0.5117 (0.7742)
f 0 ( x ) = 12 σ 2 9 x 18 σ 9 x 2 + 10 sin 16 π x
2 × 103 41.3207 41.6786 0.0485 (0.7752) 0.9949 (0.7458) 0.0539 (0.9733)
5 × 103 21.0600 21.6952 0.0582 (0.5192) 0.9929 (0.4553) 2.3510 (0.3086)
2 × 104 0.5636 1.0433 0.0465 (0.7750) 0.9939 (0.5947) 1.2024 (0.5481)
5 × 104 0.9429 1.4291 0.0348 (0.9662) 0.9942 (0.6418) 1.3650 (0.5053)
2 × 105 0.2481 0.7400 0.0317 (0.9866) 0.9959 (0.8807) 1.5539 (0.4597)

Next, after standardizing the samples, we construct the Q–Q plots by comparing them against N ( 0,1 ) and conduct normality tests on them. The statistical tests used are Kolmogorov–Smirnov, Shapiro–Wilk, and d’Agostino–Pearson tests. We normalize the data even for the Kolmogorov–Smirnov test. Our interest here is to see the form of asymptotic distribution themselves, not its parametric mean and variance, which explains why the standardization is done. We use the Kolmogorov–Smirnov test only to check the normality of the distribution.

The Q–Q plots from Figure 6 definitely indicate the normality of the 200 statistics’ samples.[26] Almost all statistical tests results in Table 2 do not reject the normality of these samples at 5 % significance level. The notable exception is for the sigmoid f0 when n = 5 × 104 (Table 2), where the Shapiro–Wilk and d’Agostino–Pearson tests reject the normality of the statistics. Our explanation is the existence of two extreme outliers that are separated from other samples in the case of sigmoid f0 when n = 5 × 104 (Figure 6). This makes the tail a little bit heavier. This little tail heaviness creates a problem for the Shapiro–Wilk test that considers the variance of the samples in testing, and also the d’Agostino–Pearson test, which considers samples’ skewness and kurtosis. However, the general population remains consistent and thus exhibits normality. Note also that for other values of n, both of these tests do not reject normality.

Figure 6: 
The Q–Q plots for multi ReLU network estimation of f0(x) = 


5
+
18
σ


9
x
−
2


−
12
σ


2
x
−
9



$5+18\sigma \left(9x-2\right)-12\sigma \left(2x-9\right)$


, x ∈ [0, 1], for different sample sizes. The theoretical quantile is 


N

(

0,1

)


$\mathcal{N}(0,1)$


. The batch and epoch numbers used in the training are 4 and 40, respectively. The number of nodes per hidden layer and the depth of the ReLU networks are H
n
 = 9n0.1(0.1 ln(n) + 1)2 and L
n
 = 3(0.1 ln(n) + 1), respectively, where n is the sample size.
Figure 6:

The Q–Q plots for multi ReLU network estimation of f0(x) = 5 + 18 σ 9 x 2 12 σ 2 x 9 , x ∈ [0, 1], for different sample sizes. The theoretical quantile is N ( 0,1 ) . The batch and epoch numbers used in the training are 4 and 40, respectively. The number of nodes per hidden layer and the depth of the ReLU networks are H n = 9n0.1(0.1 ln(n) + 1)2 and L n = 3(0.1 ln(n) + 1), respectively, where n is the sample size.

5.3 A High-Dimensional Simulation

In this section, we compare the performance of a multi-layer ReLU network to a non-parametric estimator in the context of approximating a function with a large number of variables. We compare the performance of a multi-layer ReLU network with random forest.[27]

The multi-dimensional simulation is interesting to analyze because it is notorious for suffering from the curse-of-dimensionality problem. As explained in the introduction, the availability of the data becomes relatively sparse relative to the number of the independent variables (the dimension). Many conventional regression functions, especially those that have a fixed mathematical functional forms (such as linear or polynomial regressors), struggle to capture the dynamic of the target function that has so many variables. Even machine learning methods such as kernel regression and decision trees may struggle to handle multi-dimensional problem. However, one of the main reasons practitioners use multi-layer neural networks are their ability to handle the multi-dimensional problem. Thus, it would be interesting to compare the performance of ReLU multi-layer neural networks and other machine learning regressors in this simulation.

The target function that is chosen has 500 variables. The chosen functions are

f 0 ( x ) = j = 1 10 x j sin x j + 1 x j + 2 + x j + 3 , x [ 0,1 ] 500 , f 0 ( x ) = 1 j 100 5 e x j sin ( x j + 1 ) + 101 j 200 10 cos ( x j ) sin ( x j + 1 ) + cos ( x j + 1 ) e x j + 1 + 201 j 300 9 sin ( x j ) + 3 cos ( x j + 1 ) + 301 j 400 8 cos e x j + 401 j 500 10 R e L U sin 5 cos e x j , x [ 0,1 ] 500 .

The first target function defined above is a fluctuating smooth function. We choose this trigonometric function because in many applications, wavy periodic functions tend to play a fundamental role in defining the fluctuation of the target functions. For example, in finance, practitioners tend to use the trigonometric function to mathematically define stock excess returns as functions of the macroeconomic indicators. The trigonometric function such as sine or cosine are suitable for this case because their values waver between positive and negative real values. When the values are positive, it means the macroeconomic conditions yield a positive contribution to the excess return, which is common when the economic situation of the country where the issuing company is located is experiencing a good condition. Similarly, when the trigonometric function output is a negative real number, it implies a negative impact attributable to the macroeconomic conditions, indicating the economic downturn in the related country of origin of the issuing company.

Although it might be quite difficult to identify the empirical application of the second target function, it should be noted that the form of this function is very complex. It is very hard to estimate this target function accurately. This function is a combination of fluctuating trigonometric functions, an exponential function, and a non-differentiable ReLU function. It is unlikely that simple regressors would be able to give the correct, unbiased estimation of this function. Even non-neural-network regression functions such as polynomial functions may not be able to depict its dynamic accurately. Hence, it is interesting to see the performance of neural network estimation based on this function when pitted against other machine learning regressors, which in our case is random forest.

For training, we use Python console in Spyder 4.0.0. with Keras library for the neural network and Scikit library for the random forest. There are several reasons for choosing random forest (denoted by RF) as a benchmark for the ReLU MLP performance. First, RF is a variant of a decision tree, a non-linear machine learning method which is known to have better performance than standard linear regression. Moreover, its mathematical form is not fixated on specific functional like polynomial regressors. Second, RF is one of the most popular decision tree-methods that is commonly used among machine learning practitioners. Moreover, it has a form of an ensemble of trees which enables a stabler estimation of the target value than a standard decision tree. Thus RF has every potential to be a good competitor of ReLU MLP networks. For the training specifications, the ReLU network has two hidden layers with 32 hidden nodes for each layers. The RF follows the standard specification of the Scikit 0.24.2 package. The training of the neural network is conducted using Nadam SGD algorithm of Keras 2.4.0 package. The training specification is as follows: the batch size is 1; the epochs are 800; the learning rate is 10−4, and the ϵ parameter of the Nadam algorithm is 1.0.

It can be seen from Table 3 that for target functions with a large number of variables such as the functions f0 above, multi-layer ReLU network can approximate this function much better than the usual non-parametric estimator such as RF. The mean-square errors (MSEs) of the multi-layer ReLU are around 100–1,000 times smaller than those of the RF. Neural network’s ability to detect the rich inter-variable interactions in the function in each hidden layer cannot be outperformed by the simpler, tree-based machine learning method. It is clear that the networks’ ability to represent various forms of mathematical functions (by turning on/off some of its hidden networks) allows flexibility and efficiency in estimating the target function in high-dimensional cases, something that RF cannot do. This finding helps justify the application of neural networks in regression when the number of independent variables is very large. The results that we provide in Table 3 show that multi-layer ReLU networks have better convergence than RF (another popular powerful machine learning method), and this finding serves as another justification of the importance of the exploration of the asymptotic properties of a multi-layer ReLU network.

Table 3:

The MSE results for prediction high-dimensional functions. This table reports the mean squared error (MSE) for prediction two high-dimensional function. The simulation results of the comparison between multi-layer ReLU and random forest for estimating a function with a large number of variables (500 in this case). For training, we use Python console in Spyder 4.0.0. with Keras library for the neural network and Scikit library for the random forest. The ReLU network has two hidden layers with 32 hidden nodes for each layers. The training is conducted using Nadam SGD algorithm. The training specification is as follows; the batch size is 1; the epochs are 800; the learning rate is 10−4, the ϵ parameter of the Nadam algorithm is 1.0. For RF, the specification follows the default spec of Scikit 0.24.2. NN, neural network; RF, random forest.

f 0 ( x ) = j = 1 10 x j sin x j + 1 x j + 2 + x j + 3
Sample size MSE NN MSE RF
200 0.0266 118.4978
500 0.2032 115.2601
2000 0.4624 126.9473
f 0 ( x ) = 1 j 100 5 e x j sin ( x j + 1 ) + 101 j 200 10 cos ( x j ) sin ( x j + 1 ) + cos ( x j + 1 ) e x j + 1
201 j 300 9 sin ( x j ) + 3 cos ( x j + 1 ) + 301 j 400 8 cos e x j +
401 j 500 10 R e L U sin 5 cos e x j
200 1.3360 × 10−5 0.0016
500 7.6086 × 10−5 0.0020
2000 3.3205 × 10−5 0.0020

Remark 12.

We emphasize again that in the context of dealing with high-dimensional data, the thing that most computational practitioners are concerned about is the bias, which is directly related to convergence. This is the main reason why this multi-dimensional simulation is conducted. However, this does not diminish the importance of the asymptotic distribution that we prove in the previous section. It is still one of the most sought theoretical properties of the regression estimators, as it opens up a new probability of the derivation of the test statistics that definitely require the knowledge of asymptotic distribution. Fallahgoul, Franstianto, and Lin (2024) and Horel and Giesecke (2020), as examples, provide more information about statistical tests related to neural networks. The results in this subsection also serves as the additional technical gain that is obtained relative to Shen et al. (2023), which did not give additional simulation that can depict the superiority of neural networks relative to the other simpler machine learning methods.

6 Summary

Recently, machine learning (ML) algorithms have become popular tools for tackling economic and financial problems (such as prediction) where deep neural network (DNN) with rectified linear unit (ReLU) activation function has been the most popular one. Empirical studies/experiments show that the regularization of DNN requires more care than its ML counterparts, due to its severe non-linearity and heavy parameterization. DNN has also been perceived as “black-boxes” that offers little insight into how predictions are being made. To tackle the explainability issue of DNN, we apply tools from a non-parametric regression framework (i.e. sieve estimator) to demonstrate that there exists such a sieve estimator for the DNN. We next established three asymptotic properties of the ReLU network: consistency, sieve-based convergence rate, and asymptotic normality. Finally, to validate our theoretical results, we conducted a Monte Carlo analysis that confirms our theoretical findings.


Corresponding author: Hasan Fallahgoul, School of Mathematics and Centre for Quantitative Finance and Investment Strategies, Monash University, 9 Rainforest Walk, 3800, Melbouren, Victoria, Australia, E-mail:

Acknowledgments

We thank Yan Dolinsky, Loriano Mancini, and Juan-Pablo Ortega for comments on the earlier draft. We also thank the editor (Jeremy Piger) and an anonymous referee for their invaluable comments, which improved the clarity and content of the paper. Monash Centre for Quantitative Finance and Investment Strategies has been supported by BNP Paribas.

References

Anthony, M., and P. L. Bartlett. 2009. Neural Network Learning: Theoretical Foundations. Cambridge: Cambridge University Press.Search in Google Scholar

Bach, F. 2017. “Breaking the Curse of Dimensionality with Convex Neural Networks.” Journal of Machine Learning Research 18: 629–81.Search in Google Scholar

Bianchi, D., M. Büchner, and A. Tamoni. 2021. “Bond Risk Premiums with Machine Learning.” Review of Financial Studies 34: 1046–89. https://doi.org/10.1093/rfs/hhaa062.Search in Google Scholar

Chen, X. 2007. “Large Sample Sieve Estimation of Semi-nonparametric Models.” In Handbook of Econometrics, Vol. 6B, edited by J. Heckman, and E. Leamer, 5549–632.10.1016/S1573-4412(07)06076-XSearch in Google Scholar

Chen, J. 2019. “Estimating Latent Group Structure in Time-Varying Coefficient Panel Data Models.” The Econometrics Journal 22: 223–40. https://doi.org/10.1093/ectj/utz008.Search in Google Scholar

Chen, X., and X. Shen. 1998. “Sieve Extremum Estimates for Weakly Dependent Data.” Econometrica 66 (2): 289–314. https://doi.org/10.2307/2998559.Search in Google Scholar

Clevert, D.-A., T. Unterthiner, and S. Hochreiter. 2015. “Fast and Accurate Deep Network Learning by Exponential Linear Units (Elus).” arXiv preprint arXiv:1511.07289.Search in Google Scholar

Fallahgoul, H., V. Franstianto, and X. Lin. 2024. “Asset Pricing with Neural Networks: Significance Tests.” Journal of Econometrics 238: 105574. https://doi.org/10.1016/j.jeconom.2023.105574.Search in Google Scholar

Farrell, M. H., T. Liang, and S. Misra. 2021. “Deep Neural Networks for Estimation and Inference.” Econometrica 89 (1): 181–213. https://doi.org/10.3982/ecta16901.Search in Google Scholar

Goodfellow, I., D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. 2013. “Maxout Networks.” In International Conference on Machine Learning, 1319–27. PMLR.Search in Google Scholar

Goodfellow, I., Y. Bengio, and A. Courville. 2016. Deep Learning. Cambridge: MIT press.Search in Google Scholar

Gu, S., B. Kelly, and D. Xiu. 2020. “Empirical Asset Pricing via Machine Learning.” Review of Financial Studies 33: 2223–73. https://doi.org/10.1093/rfs/hhaa009.Search in Google Scholar

Györfi, L., M. Kohler, A. Krzyżak, and H. Walk. 2002. A Distribution-free Theory of Nonparametric Regression, 1. New York: Springer.10.1007/b97848Search in Google Scholar

Hansen, B. E. 2014. “Nonparametric Sieve Regression: Least Squares, Averaging Least Squares, and Cross-Validation.” In Handbook of Applied Nonparametric and Semiparametric Econometrics and Statistics.Search in Google Scholar

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer-Verlag.10.1007/978-0-387-84858-7Search in Google Scholar

He, K., X. Zhang, S. Ren, and J. Sun. 2015. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification.” In Proceedings of the IEEE International Conference on Computer Vision, 1026–34.10.1109/ICCV.2015.123Search in Google Scholar

Horel, E., and K. Giesecke. 2020. “Significance Tests for Neural Networks.” Journal of Machine Learning Research 21: 1–29.Search in Google Scholar

Hubner, S., and P. Čížek. 2019. “Quantile-based Smooth Transition Value at Risk Estimation.” The Econometrics Journal 22: 241–61. https://doi.org/10.1093/ectj/utz009.Search in Google Scholar

Kohler, M., and S. Langer. 2021. “On the Rate of Convergence of Fully Connected Deep Neural Network Regression Estimates.” Annals of Statistics 49: 2231–49. https://doi.org/10.1214/20-aos2034.Search in Google Scholar

Krizhevsky, A., I. Sutskever, and G. E. Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25: 1097–105.Search in Google Scholar

Leoni, G. 2017. A First Course in Sobolev Spaces, 2nd ed. Providence: American Mathematical Society.10.1090/gsm/181Search in Google Scholar

Nair, V., and G. E. Hinton. 2010. “Rectified Linear Units Improve Restricted Boltzmann Machines.” In Proceedings of the 27 th International Conference on Machine Learning. Haifa.Search in Google Scholar

Schmidt-Hieber, J. 2020. “Nonparametric Regression Using Deep Neural Networks with ReLU Activation Function.” Annals of Statistics 48: 1875–97. https://doi.org/10.1214/19-aos1875.Search in Google Scholar

Shen, X. 1997. “On Methods of Sieves and Penalization.” Annals of Statistics 25: 2555–91. https://doi.org/10.1214/aos/1030741085.Search in Google Scholar

Shen, X., C. Jiang, L. Sakhanenko, and Q. Lu. 2021. “A Goodness-Of-Fit Test Based on Neural Network Sieve Estimators.” Statistics & Probability Letters 174: 109100. https://doi.org/10.1016/j.spl.2021.109100.Search in Google Scholar PubMed PubMed Central

Shen, X., C. Jiang, L. Sakhanenko, and Q. Lu. 2023. “Asymptotic Properties of Neural Network Sieve Estimators.” Journal of Nonparametric Statistics 35: 839–68. https://doi.org/10.1080/10485252.2023.2209218.Search in Google Scholar PubMed PubMed Central

Stone, C. J. 1982. “Optimal Global Rates of Convergence for Nonparametric Regression.” Annals of Statistics 10 (4): 1040–53. https://doi.org/10.1214/aos/1176345969.Search in Google Scholar

Stone, C. J. 1985. “Additive Regression and Other Nonparametric Models.” Annals of Statistics 13 (2): 689–705. https://doi.org/10.1214/aos/1176349548.Search in Google Scholar

Stone, C. J. 1994. “The Use of Polynomial Splines and Their Tensor Products in Multivariate Function Estimation.” Annals of Statistics 22 (1): 118–71. https://doi.org/10.1214/aos/1176325361.Search in Google Scholar

Tsybakov, A. B. 2009. Introduction to Nonparametric Estimation. New York: Springer Series in Statistics.10.1007/b13794Search in Google Scholar

Vaart, A. W., and J. A. Wellner. 1996. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer-Verlag.Search in Google Scholar

Wasserman, L. 2006. All of Nonparametric Statistics. New York: Springer Tests in Statistics.Search in Google Scholar

Wilson, D. R., and T. R. Martinez. 2003. “The General Inefficiency of Batch Training for Gradient Descent Learning.” Neural Networks 16: 1429–51. https://doi.org/10.1016/s0893-6080(03)00138-2.Search in Google Scholar

Yarotsky, D. 2017. “Error Bounds for Approximations with Deep ReLU Networks.” Neural Networks 94: 103–14. https://doi.org/10.1016/j.neunet.2017.07.002.Search in Google Scholar PubMed

Yarotsky, D. 2018. “Optimal Approximation of Continuous Functions by Very Deep ReLU Networks.” In Proceedings of the 31st Conference on Learning Theory, Vol. 75, 639–49. PMLR.Search in Google Scholar

Zhou, P., and J. Feng. 2018. “Understanding Generalization and Optimization Performance of Deep CNNs.” arXiv preprint arXiv:1805.10767.Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/snde-2023-0072).


Received: 2023-09-15
Accepted: 2024-08-11
Published Online: 2024-09-05

© 2024 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 23.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/snde-2023-0072/html
Scroll to top button