Home Mathematics On stochastic accelerated gradient with convergence rate
Article Open Access

On stochastic accelerated gradient with convergence rate

  • Xingxing Zha , Yongquan Zhang EMAIL logo and Yiyuan Cheng EMAIL logo
Published/Copyright: October 13, 2022

Abstract

This article studies the regression learning problem from given sample data by using stochastic approximation (SA) type algorithm, namely, the accelerated SA. We focus on problems without strong convexity, for which all well-known algorithms achieve a convergence rate for function values of O ( 1 / n ) . We consider and analyze accelerated SA algorithm that achieves a rate of O ( 1 / n ) for classical least-square regression and logistic regression problems, respectively. Comparing with the well-known results, we only need fewer conditions to obtain the tight convergence rate for least-square regression and logistic regression problems.

MSC 2010: 68Q19; 68Q25; 68Q30

1 Introduction

Large-scale machine learning problems are becoming ubiquitous in science, engineering, government business, and almost all areas. Faced with huge data, investigators typically prefer algorithms that process each observation only once, or a few times. Stochastic approximation (SA) algorithms such as stochastic gradient descent (SGD), although introduced more than 60 years ago [1], still were widely used and studied method in some contexts (see [2,3, 4,5,6, 7,8,9, 10,11,12, 13,14,15, 16,17,18, 19,20,21, 22,23,24, 25,26]).

To our knowledge, Robbins and Monro [1] first proposed the SA on the gradient descent method. From then on, SA algorithms were widely used in stochastic optimization and machine learning. Polyak [2] and Polyak and Juditsky [3] developed an important improvement of the SA method by using longer stepsizes with consequent averaging of the obtained iterates. The mirror-descent SA was demonstrated by Nemirovski et al. [6] who showed that the mirror-descent SA exhibited an unimprovable expected rate for solving nonstrongly convex programming (CP) problems. Shalev-Shwartz et al. [5] and Nemirovski et al. [6] studied averaged SGD and achieved the rate of O ( 1 / μ n ) in the strongly convex case, and they obtained only O ( 1 / n ) in the non strongly convex case. Bach and Moulines [10] considered and analyzed SA algorithms that achieve a rate of O ( 1 / n ) for least-square regression and logistic regression learning problems in the non strongly-convex case. The convergence rate of the SA algorithm for least-square regression and logistic regression is almost optimal, respectively. However, they need some assumptions (A1–A6). It is natural to ask that the convergence rate for least-square regression is O ( 1 / n ) under fewer assumptions. In this article, we consider an accelerated SA type learning algorithm for solving the least-square regression and logistic regression problem and achieve a rate of O ( 1 / n ) for least-square regression learning problems under assumptions A1–A4 in [10]. For solving a class of CP problems, Nesterov presented the accelerated gradient method in a celebrated work [12]. Now, the accelerated gradient method has also been generalized by Beck and Teboulle [13], Tseng [14], Nesterov [15,16] to solve an emerging class of composite CP problems. In 2012, Lan [17] further showed that the accelerated gradient method is optimal for solving not only smooth CP problems but also general nonsmooth and stochastic CP problems. The accelerated stochastic approximation (AC-SA) algorithm was proposed by Ghadimi and Lan [18,19] using properly modifying Nesterov’s optimal method for smooth CP. Recently, they [20,21] also developed a generic AC-SA algorithmic framework, which can be specialized to yield optimal or nearly optimal methods for solving strongly convex stochastic composite optimization problems. Motivated by those mentioned jobs, we aim to consider and analyze an accelerated SA algorithm that achieves a rate of O ( 1 / n ) for classical least-square regression and logistic regression problems, respectively.

Zhu [25] introduced Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (offline) stochastic optimization. It can be incorporated into a variance-reduction-based algorithm and speed it up, in terms of both sequential and parallel performance. A new gradient-based optimization approach by automatically adjusting the learning rate is proposed by Cao [26]. This approach can be applied to design nonadaptive learning rate and adaptive learning rate. This approach could be an alternative method to optimize the learning rate based on the SGD algorithm besides the current nonadaptive learning rate methods e.g. SGD, momentum, Nesterov and the adaptive learning rate methods, e.g., AdaGrad, AdaDelta, and Adam.

In this article, we consider minimizing a convex function f , which is defined on a closed convex set in Euclidean space, given by f ( θ ) = 1 2 E [ ( y , θ , x ) ] , where ( x , y ) X × R denotes the sample data and denotes a loss function that is convex with respect to the second variable. This loss function includes least-square regression and logistic regression. In the SA framework, z = { z i } i = 1 n = { ( x i , y i ) } i = 1 n Z n denote a set of random samples, which are independently drawn according to the unknown probability measure ρ and the predictor defined by θ is updated after each pair is seen.

The rest of this article is organized as follows. In Section 2, we give a brief introduction to the accelerated gradient algorithm for least-square regression. In Section 3, we study the accelerated gradient algorithm for logistic regression. In Section 4, we compare our results with the known related work. Finally, we conclude this article with the obtained results.

2 The stochastic accelerated gradient algorithm for least-square regression

In this section, we consider the accelerated gradient algorithm for least-square regression. The novelty of this article is that our convergence result can obtain a nonasymptotic rate O ( 1 / n ) . To give the convergence property of the stochastic accelerated gradient algorithm for the regression problem, we make the following assumptions:

  1. is a d -dimensional Euclidean space, with d 1 .

  2. Let ( X , d ) be a compact metric space and let Y = R . Let ρ be a probability distribution on Z = × Y and ( X , Y ) be a corresponding random variable.

  3. E x n 2 is finite, i.e., E x k 2 M for any k 1 .

  4. The global minimum of f ( θ ) = 1 2 E [ θ , x k 2 2 y k θ , x k ] is attained at a certain θ R d . Let ξ k = ( y k θ , x k ) x k denote the residual. For any k 1 , we have E ξ k = 0 . We also assume that E ξ k 2 σ 2 for every k and ξ ¯ k = 1 k i = 1 k ξ i .

Assumptions (a)–(d) are standard in SA (see, e.g., [9,10,22]). Compared with the work of Bach and Moulines [10], we do not need the conditions that the covariance operator = E ( x k x k ) is invertible for any k 1 , and that the operator E ( x k x k ) satisfies E [ ξ i ξ i ] σ 2 and E ( x i 2 x k x k ) R 2 for a positive number R .

Let x 0 , { α k } satisfy α 1 = 1 and α k > 0 for any k 2 , β k > 0 , and λ k .

  1. Set the initial θ 0 a g = θ 0 and

    (1) θ k m d = ( 1 α k ) θ k 1 a g + α k θ k 1 .

  2. Set

    (2) θ k = θ k 1 λ k f ( θ k m d ) = θ k 1 λ k { E ( θ k m d , x k x k y k x k ) } ,

    (3) θ k a g = θ k m d β k ( f ( θ k m d ) + ξ ¯ k ) = θ k m d β k { E ( θ k m d , x k x k y k x k ) + ξ ¯ k } .

  3. Set k k + 1 and go to step (i).

To establish the convergence rate of the accelerated gradient algorithm, we need the following Lemma (see Lemma 1 of [7]).

Lemma 1

Let α k be the stepsizes in the accelerated gradient algorithm and the sequence { η k } satisfies

η k = ( 1 α k ) η k 1 + τ k , k = 1 , 2 , ,

where

(4) Γ k = 1 , k = 1 , ( 1 α k ) Γ k 1 , k 2 .

Then we have η k Γ k i = 1 k τ i Γ i for any k 1 .

We establish the convergence rate of the developed algorithm. The goal is to estimate the bound on the expectation E [ f ( θ n a g ) f ( θ ) ] . Theorem 1 describes the convergence property of the accelerated gradient algorithm for least-square regression.

Theorem 1

Let { θ k m d , θ k a g } be computed by the accelerated gradient algorithm and Γ k be defined in (4). Assume (a)–(d). If { α k } , { β k } , a n d { λ k } are chosen such that

α k λ k β k 1 2 M , α 1 λ 1 Γ 1 α 2 λ 2 Γ 2 ,

then for any n 1 , we have

E [ f ( θ n a g ) f ( θ ) ] Γ n 2 λ 1 θ 0 θ 2 + M σ 2 Γ n k = 1 n β k 2 k Γ k .

Proof

By Taylor expansion of the function f and (2), we have

f ( θ k a g ) = f ( θ k m d ) + f ( θ k m d ) , θ k a g θ k m d + ( θ k a g θ k m d ) T 2 f ( θ k m d ) ( θ k a g θ k m d ) f ( θ k m d ) β k f ( θ k m d ) 2 β k f ( θ k m d ) , ξ ¯ k + β k 2 E x k 2 f ( θ k m d ) + ξ ¯ k 2 f ( θ k m d ) β k f ( θ k m d ) 2 β k f ( θ k m d ) , ξ ¯ k + β k 2 M f ( θ k m d ) + ξ ¯ k 2 .

where the last inequality follows from the assumption (c).

Since

f ( μ ) f ( ν ) = f ( ν ) , μ ν + ( μ ν ) T E ( x k x k T ) ( μ ν ) ,

we have

(5) f ( ν ) f ( μ ) = f ( ν ) , ν μ ( μ ν ) T E ( x k x k T ) ( μ ν ) f ( ν ) , ν μ ,

where the inequality follows from the positive semidefinition of matrix E ( x k x k T ) .

By (1) and (5), we have

f ( θ k m d ) [ ( 1 α k ) f ( θ k 1 a g ) + α k f ( θ ) ] = α k [ f ( θ k m d ) f ( θ ) ] + ( 1 α k ) [ f ( θ k m d ) f ( θ k 1 a g ) ] α k f ( θ k m d ) , θ k m d θ + ( 1 α k ) f ( θ k m d ) , θ k m d θ k 1 a g = f ( θ k m d ) , α k ( θ k m d θ ) + ( 1 α k ) ( θ k m d θ k 1 a g ) = α k f ( θ k m d ) , θ k 1 θ .

So we obtain

f ( θ k a g ) ( 1 α k ) f ( θ k 1 a g ) + α k f ( θ ) + α k f ( θ k m d ) , θ k 1 θ β k f ( θ k m d ) 2 β k f ( θ k m d ) , ξ ¯ k + β k 2 M f ( θ k m d ) + ξ ¯ k 2 .

It follows from (2) that

θ k θ 2 = θ k 1 λ k f ( θ k m d ) θ 2 = θ k 1 θ 2 2 λ k f ( θ k m d ) , θ k 1 θ + λ k 2 f ( θ k m d ) 2 .

Then, we have

(6) f ( θ k m d ) , θ k 1 θ = 1 2 λ k [ θ k 1 θ 2 θ k θ 2 ] + λ k 2 f ( θ k m d ) 2 ,

and meanwhile,

(7) f ( θ k m d ) + ξ ¯ k 2 = f ( θ k m d ) 2 + ξ ¯ k 2 + 2 f ( θ k m d ) , ξ ¯ k .

Combining the aforementioned two equalities (6) and (7), we obtain

f ( θ k a g ) ( 1 α k ) f ( θ k 1 a g ) + α k f ( θ ) + α k 2 λ k [ θ k 1 θ 2 θ k θ 2 ] β k 1 λ k α k 2 β k β k M f ( θ k m d ) 2 + M β k 2 ξ ¯ k 2 + ξ ¯ k , ( 2 β k 2 M β k ) f ( θ k m d ) .

The aforementioned inequality is equal to

f ( θ k a g ) f ( θ ) ( 1 α k ) [ f ( θ k 1 a g ) f ( θ ) ] + α k 2 λ k [ θ k 1 θ 2 θ k θ 2 ] β k 1 λ k α k 2 β k β k M f ( θ k m d ) 2 + M β k 2 ξ ¯ k 2 + ξ ¯ k , ( 2 β k 2 M β k ) f ( θ k m d ) .

By using Lemma 1, we have

f ( θ n a g ) f ( θ ) Γ n k = 1 n α k 2 λ k Γ k [ θ k 1 θ 2 θ k θ 2 ] Γ n k = 1 n β k Γ k 1 λ k α k 2 β k β k M f ( θ k m d ) 2 + Γ n k = 1 n β k 2 M Γ k ξ ¯ k 2 + Γ n k = 1 n 1 Γ k ξ ¯ k , ( 2 β k 2 M β k ) f ( θ k m d ) .

Since

α 1 λ 1 Γ 1 α 2 λ 2 Γ 2 , α 1 = Γ 1 = 1 ,

then

k = 1 n α k 2 λ k Γ k [ θ k 1 θ 2 θ k θ 2 ] α 1 2 λ 1 Γ 1 [ θ 0 θ 2 ] = 1 2 λ 1 θ 0 θ 2 .

So we obtain

(8) f ( θ n a g ) f ( θ ) Γ n 2 λ 1 θ 0 θ 2 + Γ n k = 1 n β k 2 M Γ k ξ ¯ k 2 + Γ n k = 1 n 1 Γ k ξ ¯ k , ( 2 β k 2 M β k ) f ( θ k m d ) ,

where the inequality follows from the assumption

α k λ k β k 1 2 M .

Under assumption (d), we have

E ξ ¯ k = 1 k i = 1 k E ξ i = 0 , E ξ ¯ k 2 = E 1 k i = 1 k ξ i 2 σ 2 k .

Taking expectation on both sides of the inequality (8) with respect to ( x i , y i ) , we obtain for x R d ,

E [ f ( θ n a g ) f ( θ ) ] Γ n 2 λ 1 θ 0 θ 2 + M σ 2 Γ n k = 1 n β k 2 k Γ k .

Now, fixing θ = θ , we have

E [ f ( θ n a g ) f ( θ ) ] Γ n 2 λ 1 θ 0 θ 2 + M σ 2 Γ n k = 1 n β k 2 k Γ k .

This finishes the proof of Theorem 2.2.□

In the following, we apply the results of Theorem 1 to some particular selections of { α k } , { β k } , and { λ k } . We obtain the following Corollary 1.

Corollary 1

Suppose that α k and β k in the accelerated gradient algorithm for regression learning are set to

(9) α k = 1 k + 1 , β k = 1 M ( k + 1 ) , and λ k = 1 2 M k 1 ,

then for any n 1 , we have

E [ f ( θ n a g ) f ( θ ) ] M 2 θ 0 θ 2 + σ 2 M ( n + 1 ) .

Proof

In the view (4) and (9), we have for k 2

Γ k = ( 1 α k ) Γ k 1 = k k + 1 × k 1 k × k 2 k 1 × × 2 3 × Γ 1 = 2 k + 1 .

It is easy to verify

α k λ k = 1 2 M ( k + 1 ) β k = 1 M ( k + 1 ) 1 2 M , α 1 λ 1 Γ 1 = α 2 λ 2 Γ 2 = = 1 4 M .

Then, we obtain

M Γ n σ 2 k = 1 n β k 2 k Γ k = 2 σ 2 n + 1 k = 1 n M M 2 ( k + 1 ) 2 2 k k + 1 = σ 2 M ( n + 1 ) k = 1 n 1 k ( k + 1 ) = σ 2 M ( n + 1 ) 1 1 2 + 1 2 1 3 + + 1 n 1 1 n σ 2 M ( n + 1 ) .

From the result of Theorem 1, we have

E [ f ( θ n a g ) f ( θ ) ] M n + 1 θ 0 θ 2 + σ 2 M ( n + 1 ) = M 2 θ 0 θ 2 + σ 2 M ( n + 1 ) .

The proof of Corollary 1 is completed.□

Corollary 1 shows that the developed algorithm is able to achieve a convergence rate of O ( 1 / n ) without strong convexity and Lipschitz continuous gradient assumptions.

3 The stochastic accelerated gradient algorithm for logistic regression

In this section, we consider the convergence property of the accelerated gradient algorithm for logistic regression.

We make the following assumptions:

  1. is a d -dimension Euclidean space, with d 1 .

  2. The observations ( x i , y i ) × { 1 , 1 } are independent and identically distributed.

  3. E x i 2 is finite, i.e., E x i 2 M for any i 1 .

  4. We consider l ( θ ) = E [ log ( 1 + exp ( y i x i , θ ) ) ] . We denote by θ R d a global minimizer of l and thus assume to exist. Let ξ i = ( y i θ , x i ) x i denote the residual. For any i 1 , we have E ξ i = 0 . We also assume that E ξ i 2 σ 2 for every i and ξ ¯ k = 1 k i = 1 k ξ i .

Let x 0 , { α k } satisfy α 1 = 1 and α k > 0 for any k 2 , β k > 0 , and λ k .

  1. Set the initial θ 0 a g = θ 0 and

    (10) θ k m d = ( 1 α k ) θ k 1 a g + α k θ k 1 .

  2. Set

    (11) θ k = θ k 1 λ k l ( θ k m d ) = θ k 1 λ k y k exp { y k x k , θ k m d } x k 1 + exp { y k x k , θ k m d } ,

    (12) θ k a g = θ k m d β k ( l ( θ k m d ) + ξ ¯ k ) = θ k m d β k y k exp { y k x k , θ k m d } x k 1 + exp { y k x k , θ k m d } + ξ ¯ k .

  3. Set k k + 1 and go to step (i).

Theorem 2 describes the convergence property of the accelerated gradient algorithm for logistic regression.

Theorem 2

Let { θ k m d , θ k a g } be computed by the accelerated gradient algorithm and Γ k be defined in (4). Assume (B1)–(B4). If { α k } , { β k } , a n d { λ k } are chosen such that

α k λ k β k 1 2 M ,

α 1 λ 1 Γ 1 α 2 λ 2 Γ 2 ,

and then for any n 1 , we have

E [ f ( θ n a g ) f ( θ ) ] Γ n 2 λ 1 θ 0 θ 2 + M σ 2 Γ n k = 1 n β k 2 k Γ k .

Proof

By Taylor expansion of the function l , there exists a ϑ such that

(13) l ( θ k a g ) = l ( θ k m d ) + l ( θ k m d ) , θ k a g θ k m d + ( θ k a g θ k m d ) T 2 l ( ϑ ) ( θ k a g θ k m d ) = l ( θ k m d ) β k l ( θ k m d ) 2 + β k l ( θ k m d ) , ξ ¯ k + ( θ k a g θ k m d ) T E exp { y k x k , ϑ } x k x k T 1 + exp { y k x k , ϑ } ( θ k a g θ k m d ) .

It is easy to verify that the matrix

E exp { y k x k , ϑ } x k x k T 1 + exp { y k x k , ϑ }

is positive semidefinite and the largest eigenvalue of it satisfies

λ m a x E exp { y k x k , ϑ } x k x k T 1 + exp { y k x k , ϑ } E x k 2 M .

Combining with (12) and (13), we have

l ( θ k a g ) l ( θ k m d ) β k l ( θ k m d ) 2 + β k l ( θ k m d ) , ξ ¯ k + β k 2 M l ( θ k m d ) + ξ ¯ k 2 .

Similar to (13), there exists a ζ R d satisfying

l ( μ ) l ( ν ) = l ( ν ) , μ ν + ( μ ν ) T E exp { y k x k , ζ } x k x k T 1 + exp { y k x k , ζ } ( μ ν ) , μ , ν R d ,

and we have

l ( ν ) l ( μ ) = l ( ν ) , ν μ ( μ ν ) T E exp { y k x k , ζ } x k x k T 1 + exp { y k x k , ζ } ( μ ν ) l ( ν ) , ν μ ,

where the inequality follows from the positive semidefinition of matrix E exp { y k x k , ζ } x k x k T 1 + exp { y k x k , ζ } .

Similar to (5), we have

l ( θ k m d ) [ ( 1 α k ) l ( θ k 1 a g ) + α k l ( θ ) ] α k l ( θ k m d ) , θ k 1 θ .

So we obtain

l ( θ k a g ) ( 1 α k ) l ( θ k 1 a g ) + α k l ( θ ) + α k l ( θ k m d ) , θ k 1 θ β k l ( θ k m d ) 2 + β k l ( θ k m d ) , ξ ¯ k + β k 2 M l ( θ k m d ) + ξ ¯ k 2 .

It follows from (11) that

θ k θ 2 = θ k 1 λ k l ( θ k m d ) θ 2 = θ k 1 θ 2 2 λ k l ( θ k m d ) , θ k 1 θ + l ( θ k m d ) 2 .

Then, we have

(14) l ( θ k m d ) , θ k 1 θ = 1 2 λ k [ θ k 1 θ 2 θ k θ 2 ] + λ k 2 l ( θ k m d ) 2 .

However,

(15) l ( θ k m d ) + ξ ¯ k 2 = f ( θ k m d ) 2 + ξ ¯ k 2 + 2 l ( θ k m d ) , ξ ¯ k .

Combining the aforementioned two equalities (14) and (15), we obtain

l ( θ k a g ) ( 1 α k ) l ( θ k 1 a g ) + α k l ( θ ) + α k 2 λ k [ θ k 1 θ 2 θ k θ 2 ] β k 1 λ k α k 2 β k β k M l ( θ k m d ) 2 + M β k 2 ξ ¯ k 2 + ξ ¯ k , ( 2 β k 2 M β k ) l ( θ k m d ) .

The aforementioned inequality is equal to

l ( θ k a g ) l ( θ ) ( 1 α k ) [ l ( θ k 1 a g ) l ( θ ) ] + α k 2 λ k [ θ k 1 θ 2 θ k θ 2 ] β k 1 λ k α k 2 β k β k M l ( θ k m d ) 2 + M β k 2 ξ ¯ k 2 + ξ ¯ k , ( 2 β k 2 M β k ) l ( θ k m d ) .

By using Lemma 1, we have

l ( θ n a g ) l ( θ ) Γ n k = 1 n α k 2 λ k Γ k [ θ k 1 θ 2 θ k θ 2 ] Γ n k = 1 n β k Γ k 1 λ k α k 2 β k β k M l ( θ k m d ) 2 + Γ n k = 1 n β k 2 M Γ k ξ ¯ k 2 + Γ n k = 1 n 1 Γ k ξ ¯ k , ( 2 β k 2 M β k ) l ( θ k m d ) .

Since

α 1 λ 1 Γ 1 α 2 λ 2 Γ 2 , α 1 = Γ 1 = 1 ,

then

k = 1 n α k 2 λ k Γ k [ θ k 1 θ 2 θ k θ 2 ] α 1 2 λ 1 Γ 1 [ θ 0 θ 2 ] = 1 2 λ 1 θ 0 θ 2 .

So we obtain

(16) l ( θ n a g ) l ( θ ) Γ n 2 λ 1 θ 0 θ 2 + Γ n k = 1 n β k 2 M Γ k ξ ¯ k 2 + Γ n k = 1 n 1 Γ k ξ ¯ k , ( 2 β k 2 M β k ) l ( θ k m d ) ,

where the inequality follows from the assumption

α k λ k β k 1 2 M .

Under assumption (d), we have

E ξ ¯ k = 1 k i = 1 k E ξ i = 0 , E ξ ¯ k 2 = E 1 k i = 1 k ξ i 2 σ 2 k .

Taking expectation on both sides of the inequality (16) with respect to ( x i , y i ) , we obtain for θ R d ,

E [ l ( θ n a g ) l ( θ ) ] Γ n 2 λ 1 θ 0 θ 2 + M σ 2 Γ n k = 1 n β k 2 k Γ k .

Now, fixing θ = θ , we have

E [ l ( θ n a g ) l ( θ ) ] Γ n 2 λ 1 θ 0 θ 2 + M σ 2 Γ n k = 1 n β k 2 k Γ k .

This finishes the proof of Theorem 2.□

Similar to Corollary 1, we specialize the results of Theorem 2 for some particular selections of { α k } , { β k } and λ k .

Corollary 2

Suppose that α k , β k , and λ k in the accelerated gradient algorithm for regression learning are set to

α k = 1 k + 1 , β k = 1 M ( k + 1 ) , and λ k = 1 2 M , k 1 ,

and then for any n 1 , we have

E [ l ( θ n a g ) l ( θ ) ] M 2 θ 0 θ 2 + σ 2 M ( n + 1 ) .

4 Comparisons with related work

In Sections 2 and 3, we have studied the AC-SA type algorithms for least-square regression and least-square learning problems, respectively. We have derived the upper bound of AC-SA learning algorithms by using the convexity of the aim function. In this section, we discuss how our results relate to other recent studies.

4.1 Comparison with convergence rate for stochastic optimization

Our convergence analysis of SA learning algorithms is based on a similar analysis for stochastic composite optimization by Ghadimi and Lan in [8]. There are two differences between our work and that of Ghadimi and Lan. The first difference in our convergence analysis of SA algorithms compared with the problems of stochastic optimization in [8] is for any iteration, rather than iteration limit, i.e., the parameters β k , λ k of Corollary 3 in [8] are in relation with iteration limit N , while we do not need this assumption. The second difference is in the two error bounds. Ghadimi and Lan obtained a rate of O ( 1 / n ) for stochastic composite optimization, while we obtain the rate of O ( 1 / n ) for the regression problem.

Our developed accelerated stochastic gradient algorithm (SA) for the least-square regression is summarized in (1)–(3). The algorithm takes a stream of data ( x k , y k ) as input, and an initial guess of the parameter θ 0 . The other requirements include { α k } , which satisfies α 1 = 1 and α k > 0 for any k 2 , β k > 0 , and λ k > 0 . The algorithm involves two intermediate variables θ k a g (which is initialized to be θ 0 ) and θ k m d . θ k m d is updated as a linear combination of θ k a g and the current estimation of the parameter θ k (3), where α k is the coefficient. The parameter θ k is estimated in (2) taking λ k as a parameter. The residue ξ k and the average residue ξ ¯ k of previous residues up to the k th data (i.e., ξ ¯ k = 1 k i = 1 k ξ i ) are computed in (3). θ k a g is then updated through a linear combination of θ k m d , where β k is taken as a parameter. The process continues whenever a new pair of data is seen.

The unbiased estimate of the gradient, i.e., ( θ k m d , x k x k y k x k ) for each data point, ( x k , y k ) is used in (2). From this perspective, it is seen that the update of θ k is actually the same as in the SGD (also called least-mean-square) algorithm if we set α k = 1 . Across the training, the relative residue ξ k is computed. All the residues up to now are averaged, and the average relative residue takes effect on the update of θ k a g . It differs from the stochastic accelerated gradient algorithm in [22], where no residue is computed and used in the training.

4.2 Comparison with the work of Bach and Moulines

The work that is perhaps closely related to ours is that of Bach and Moulines [10], who studied the SA problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework that includes machine learning methods based on the minimization of the empirical risk. The sample setting considered by Bach and Moulines is similar to ours: the learner is given a sample set { ( x i , y i ) } i = 1 n , and the goal of the regression learning problem is to learn a liner function θ , x , which forecasts the other inputs in X according to random samples. Both we and Bach and Moulines obtained the rates of O ( 1 / n ) of SA algorithm for the least-square regression, without strong-convexity assumptions. To our knowledge, the convergence rate O ( 1 / n ) is optimal for least-square regression and logistic regression.

Although uniform convergence bounds for regression learning algorithms have replied on the assumptions of input x k and the residual ξ k , we have obtained the optimal upper bound O ( 1 / n ) of stochastic learning algorithms and the order of the upper bound is independent of the dimension of input space. There are some important differences between our work and that of [10]. Bach and Moulines considered generalization properties of stochastic learning algorithms under the assumption that the covariance operator E ( x k x k ) is invertible. However, some covariance operators may not be invertible, such as the covariance operator E ( x k x k ) in R 2 , which is defined by

E ( x k x k ) = E x k 1 2 E x k 1 x k 2 E x k 1 x k 2 E x k 2 2 .

When two random components x k 1 and x k 2 in x k satisfies x k 1 = x k 2 , then the determinant of the covariance operator E ( x k x k ) equals zero. However, only under the assumption of (a-d), the rate of our algorithm can reach O ( 1 / n ) .

5 Conclusion

In this article, we have considered two SA algorithms that can achieve rates of O ( 1 / n ) for the least-square regression and logistic regression, respectively, without strong-convexity assumptions. Without strong convexity, We focus on problems for which the well-known algorithms achieve a convergence rate for function values of O ( 1 / n ) . We consider and analyze accelerated SA algorithm that achieves a rate of O ( 1 / n ) for classical least-square regression and logistic regression problems. Comparing with the well-known results, we only need fewer conditions to obtain the tight convergence rate for least-square regression and logistic regression problems. For the accelerated SA algorithm, we provide a nonasymptotic analysis of the generalization error (in expectation) and experimentally study our theoretical analysis.

  1. Funding information: The authors acknowledge the financial supports from the National Natural Science Foundation of China (No. 61573326), project for outstanding young talents in Colleges and universities in Anhui Province (No. gxyq2018076), Natural science research project of colleges and universities in Anhui Province (No. KJ2021A1033), scientific research project of Chaohu University (Nos. XLY-202103, XLY-202105, and XLZ-202202).

  2. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved.

  3. Conflict of interest: The authors state no conflict of interest.

References

[1] H. Robbins and S. Monro, A stochastic approximation method, In: The Annals of Mathematical Statistics, Institute of Mathematical Statistics, vol. 22, 1951, pp. 400–407. 10.1007/978-1-4612-5110-1_9Search in Google Scholar

[2] B. T. Polyak, New stochastic approximation type procedures, Automat. i Telemekh 7 (1990), 98–107. Search in Google Scholar

[3] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control Optim. 30 (1992), 838–855, https://doi.org/10.1137/0330046. Search in Google Scholar

[4] L. Bottou and O. Bousquet, The tradeoffs of large scale learning, Adv. Neural Inform. Process. Sys. 20 (2007), 1–8. Search in Google Scholar

[5] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, Pegasos: Primal estimated sub-gradient solver for SVM, Math. Program 127 (2011), 3–30, https://doi.org/10.1007/s10107-010-0420-4. Search in Google Scholar

[6] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim. 19 (2009), 1574–1609, https://doi.org/10.1137/070704277. Search in Google Scholar

[7] G. H. Lan and R. D. C. Monteiro, Iteration-complexity of first-order penalty methods for convex programming, Math. Program 138 (2013), 115–139, https://doi.org/10.1007/s10107-012-0588-x. Search in Google Scholar

[8] S. Ghadimi and G. H. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Math. Program 156 (2016), 59–99, https://doi.org/10.1007/s10107-015-0871-8. Search in Google Scholar

[9] F. Bach and E. Moulines, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, NIPS. 2011, 451–459.Search in Google Scholar

[10] F. Bach and E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate O(1∕n), 2013, https://doi.org/10.48550/arXiv.1306.2119. Search in Google Scholar

[11] J. C. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res. 12 (2010), 2121–2159. Search in Google Scholar

[12] Y. E. Nesterov, A method of solving a convex programming problem with convergence rate O(1∕k2), Dokl. Akad. Nauk 269 (1983), 543–547. Search in Google Scholar

[13] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci. 2 (2009), 183–202, https://doi.org/10.1137/080716542. Search in Google Scholar

[14] P. Tseng and S. Yun, Incrementally updated gradient methods for constrained and regularized optimization, J. Optim. Theory Appl. 160 (2014), 832–853.10.1007/s10957-013-0409-2Search in Google Scholar

[15] Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program 103 (2005), 127–152, https://doi.org/10.1007/s10107-004-0552-5. Search in Google Scholar

[16] Y. Nesterov, Gradient methods for minimizing composite functions, Math. Program 140 (2013), 125–161, https://doi.org/10.1007/s10107-012-0629-5. Search in Google Scholar

[17] G. H. Lan, An optimal method for stochastic composite optimization, Math. Program 133 (2012), 365–397, https://doi.org/10.1007/s10107-010-0434-y. Search in Google Scholar

[18] S. Ghadimi and G. H. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework, SIAM J. Optim. 22 (2012), 1469–1492, DOI: https://doi.org/10.1137/110848864. 10.1137/110848864Search in Google Scholar

[19] S. Ghadimi and G. H. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim. 23 (2013), 2341–2368, https://doi.org/10.1137/120880811. Search in Google Scholar

[20] S. Ghadimi, G. H. Lan, and H. C. Zhang, Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Math. Program 155 (2016), 267–305, https://doi.org/10.1007/s10107-014-0846-1. Search in Google Scholar

[21] G. H. Lan, Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization, Math. Program 149 (2015), 1–45, https://doi.org/10.1007/s10107-013-0737-x. Search in Google Scholar

[22] S. Ghadimi and G. H. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Math. Program 156 (2016), 59–99, https://doi.org/10.1007/s10107-015-0871-8. Search in Google Scholar

[23] L. Bottou, Large-scale machine learning with stochastic gradient descent, In: Proceedings of COMPSTAT’2010, Physica-Verlag HD. 2010, pp. 177–186, https://doi.org/10.1007/978-3-7908-2604-3_16. Search in Google Scholar

[24] D. P. Kingma and J. Pa, Adam: A method for stochastic optimization, 2012, DOI: https://doi.org/10.48550/arXiv.1412.6980. Search in Google Scholar

[25] Z. A. Zhu, Katyusha: the first direct acceleration of stochastic gradient methods, In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2017). 2017, Association for Computing Machinery, New York, USA, https://doi.org/10.1145/3055399.3055448. Search in Google Scholar

[26] X. Cao, BFE and AdaBFE: A new approach in learning rate automation for stochastic optimization, 2022, https://doi.org/10.48550/arXiv.2207.02763 Search in Google Scholar

Received: 2022-01-01
Revised: 2022-07-16
Accepted: 2022-09-11
Published Online: 2022-10-13

© 2022 Xingxing Zha et al., published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Regular Articles
  2. A random von Neumann theorem for uniformly distributed sequences of partitions
  3. Note on structural properties of graphs
  4. Mean-field formulation for mean-variance asset-liability management with cash flow under an uncertain exit time
  5. The family of random attractors for nonautonomous stochastic higher-order Kirchhoff equations with variable coefficients
  6. The intersection graph of graded submodules of a graded module
  7. Isoperimetric and Brunn-Minkowski inequalities for the (p, q)-mixed geominimal surface areas
  8. On second-order fuzzy discrete population model
  9. On certain functional equation in prime rings
  10. General complex Lp projection bodies and complex Lp mixed projection bodies
  11. Some results on the total proper k-connection number
  12. The stability with general decay rate of hybrid stochastic fractional differential equations driven by Lévy noise with impulsive effects
  13. Well posedness of magnetohydrodynamic equations in 3D mixed-norm Lebesgue space
  14. Strong convergence of a self-adaptive inertial Tseng's extragradient method for pseudomonotone variational inequalities and fixed point problems
  15. Generic uniqueness of saddle point for two-person zero-sum differential games
  16. Relational representations of algebraic lattices and their applications
  17. Explicit construction of mock modular forms from weakly holomorphic Hecke eigenforms
  18. The equivalent condition of G-asymptotic tracking property and G-Lipschitz tracking property
  19. Arithmetic convolution sums derived from eta quotients related to divisors of 6
  20. Dynamical behaviors of a k-order fuzzy difference equation
  21. The transfer ideal under the action of orthogonal group in modular case
  22. The multinomial convolution sum of a generalized divisor function
  23. Extensions of Gronwall-Bellman type integral inequalities with two independent variables
  24. Unicity of meromorphic functions concerning differences and small functions
  25. Solutions to problems about potentially Ks,t-bigraphic pair
  26. Monotonicity of solutions for fractional p-equations with a gradient term
  27. Data smoothing with applications to edge detection
  28. An ℋ-tensor-based criteria for testing the positive definiteness of multivariate homogeneous forms
  29. Characterizations of *-antiderivable mappings on operator algebras
  30. Initial-boundary value problem of fifth-order Korteweg-de Vries equation posed on half line with nonlinear boundary values
  31. On a more accurate half-discrete Hilbert-type inequality involving hyperbolic functions
  32. On split twisted inner derivation triple systems with no restrictions on their 0-root spaces
  33. Geometry of conformal η-Ricci solitons and conformal η-Ricci almost solitons on paracontact geometry
  34. Bifurcation and chaos in a discrete predator-prey system of Leslie type with Michaelis-Menten prey harvesting
  35. A posteriori error estimates of characteristic mixed finite elements for convection-diffusion control problems
  36. Dynamical analysis of a Lotka Volterra commensalism model with additive Allee effect
  37. An efficient finite element method based on dimension reduction scheme for a fourth-order Steklov eigenvalue problem
  38. Connectivity with respect to α-discrete closure operators
  39. Khasminskii-type theorem for a class of stochastic functional differential equations
  40. On some new Hermite-Hadamard and Ostrowski type inequalities for s-convex functions in (p, q)-calculus with applications
  41. New properties for the Ramanujan R-function
  42. Shooting method in the application of boundary value problems for differential equations with sign-changing weight function
  43. Ground state solution for some new Kirchhoff-type equations with Hartree-type nonlinearities and critical or supercritical growth
  44. Existence and uniqueness of solutions for the stochastic Volterra-Levin equation with variable delays
  45. Ambrosetti-Prodi-type results for a class of difference equations with nonlinearities indefinite in sign
  46. Research of cooperation strategy of government-enterprise digital transformation based on differential game
  47. Malmquist-type theorems on some complex differential-difference equations
  48. Disjoint diskcyclicity of weighted shifts
  49. Construction of special soliton solutions to the stochastic Riccati equation
  50. Remarks on the generalized interpolative contractions and some fixed-point theorems with application
  51. Analysis of a deteriorating system with delayed repair and unreliable repair equipment
  52. On the critical fractional Schrödinger-Kirchhoff-Poisson equations with electromagnetic fields
  53. The exact solutions of generalized Davey-Stewartson equations with arbitrary power nonlinearities using the dynamical system and the first integral methods
  54. Regularity of models associated with Markov jump processes
  55. Multiplicity solutions for a class of p-Laplacian fractional differential equations via variational methods
  56. Minimal period problem for second-order Hamiltonian systems with asymptotically linear nonlinearities
  57. Convergence rate of the modified Levenberg-Marquardt method under Hölderian local error bound
  58. Non-binary quantum codes from constacyclic codes over 𝔽q[u1, u2,…,uk]/⟨ui3 = ui, uiuj = ujui
  59. On the general position number of two classes of graphs
  60. A posteriori regularization method for the two-dimensional inverse heat conduction problem
  61. Orbital stability and Zhukovskiǐ quasi-stability in impulsive dynamical systems
  62. Approximations related to the complete p-elliptic integrals
  63. A note on commutators of strongly singular Calderón-Zygmund operators
  64. Generalized Munn rings
  65. Double domination in maximal outerplanar graphs
  66. Existence and uniqueness of solutions to the norm minimum problem on digraphs
  67. On the p-integrable trajectories of the nonlinear control system described by the Urysohn-type integral equation
  68. Robust estimation for varying coefficient partially functional linear regression models based on exponential squared loss function
  69. Hessian equations of Krylov type on compact Hermitian manifolds
  70. Class fields generated by coordinates of elliptic curves
  71. The lattice of (2, 1)-congruences on a left restriction semigroup
  72. A numerical solution of problem for essentially loaded differential equations with an integro-multipoint condition
  73. On stochastic accelerated gradient with convergence rate
  74. Displacement structure of the DMP inverse
  75. Dependence of eigenvalues of Sturm-Liouville problems on time scales with eigenparameter-dependent boundary conditions
  76. Existence of positive solutions of discrete third-order three-point BVP with sign-changing Green's function
  77. Some new fixed point theorems for nonexpansive-type mappings in geodesic spaces
  78. Generalized 4-connectivity of hierarchical star networks
  79. Spectra and reticulation of semihoops
  80. Stein-Weiss inequality for local mixed radial-angular Morrey spaces
  81. Eigenvalues of transition weight matrix for a family of weighted networks
  82. A modified Tikhonov regularization for unknown source in space fractional diffusion equation
  83. Modular forms of half-integral weight on Γ0(4) with few nonvanishing coefficients modulo
  84. Some estimates for commutators of bilinear pseudo-differential operators
  85. Extension of isometries in real Hilbert spaces
  86. Existence of positive periodic solutions for first-order nonlinear differential equations with multiple time-varying delays
  87. B-Fredholm elements in primitive C*-algebras
  88. Unique solvability for an inverse problem of a nonlinear parabolic PDE with nonlocal integral overdetermination condition
  89. An algebraic semigroup method for discovering maximal frequent itemsets
  90. Class-preserving Coleman automorphisms of some classes of finite groups
  91. Exponential stability of traveling waves for a nonlocal dispersal SIR model with delay
  92. Existence and multiplicity of solutions for second-order Dirichlet problems with nonlinear impulses
  93. The transitivity of primary conjugacy in regular ω-semigroups
  94. Stability estimation of some Markov controlled processes
  95. On nonnil-coherent modules and nonnil-Noetherian modules
  96. N-Tuples of weighted noncommutative Orlicz space and some geometrical properties
  97. The dimension-free estimate for the truncated maximal operator
  98. A human error risk priority number calculation methodology using fuzzy and TOPSIS grey
  99. Compact mappings and s-mappings at subsets
  100. The structural properties of the Gompertz-two-parameter-Lindley distribution and associated inference
  101. A monotone iteration for a nonlinear Euler-Bernoulli beam equation with indefinite weight and Neumann boundary conditions
  102. Delta waves of the isentropic relativistic Euler system coupled with an advection equation for Chaplygin gas
  103. Multiplicity and minimality of periodic solutions to fourth-order super-quadratic difference systems
  104. On the reciprocal sum of the fourth power of Fibonacci numbers
  105. Averaging principle for two-time-scale stochastic differential equations with correlated noise
  106. Phragmén-Lindelöf alternative results and structural stability for Brinkman fluid in porous media in a semi-infinite cylinder
  107. Study on r-truncated degenerate Stirling numbers of the second kind
  108. On 7-valent symmetric graphs of order 2pq and 11-valent symmetric graphs of order 4pq
  109. Some new characterizations of finite p-nilpotent groups
  110. A Billingsley type theorem for Bowen topological entropy of nonautonomous dynamical systems
  111. F4 and PSp (8, ℂ)-Higgs pairs understood as fixed points of the moduli space of E6-Higgs bundles over a compact Riemann surface
  112. On modules related to McCoy modules
  113. On generalized extragradient implicit method for systems of variational inequalities with constraints of variational inclusion and fixed point problems
  114. Solvability for a nonlocal dispersal model governed by time and space integrals
  115. Finite groups whose maximal subgroups of even order are MSN-groups
  116. Symmetric results of a Hénon-type elliptic system with coupled linear part
  117. On the connection between Sp-almost periodic functions defined on time scales and ℝ
  118. On a class of Harada rings
  119. On regular subgroup functors of finite groups
  120. Fast iterative solutions of Riccati and Lyapunov equations
  121. Weak measure expansivity of C2 dynamics
  122. Admissible congruences on type B semigroups
  123. Generalized fractional Hermite-Hadamard type inclusions for co-ordinated convex interval-valued functions
  124. Inverse eigenvalue problems for rank one perturbations of the Sturm-Liouville operator
  125. Data transmission mechanism of vehicle networking based on fuzzy comprehensive evaluation
  126. Dual uniformities in function spaces over uniform continuity
  127. Review Article
  128. On Hahn-Banach theorem and some of its applications
  129. Rapid Communication
  130. Discussion of foundation of mathematics and quantum theory
  131. Special Issue on Boundary Value Problems and their Applications on Biosciences and Engineering (Part II)
  132. A study of minimax shrinkage estimators dominating the James-Stein estimator under the balanced loss function
  133. Representations by degenerate Daehee polynomials
  134. Multilevel MC method for weak approximation of stochastic differential equation with the exact coupling scheme
  135. Multiple periodic solutions for discrete boundary value problem involving the mean curvature operator
  136. Special Issue on Evolution Equations, Theory and Applications (Part II)
  137. Coupled measure of noncompactness and functional integral equations
  138. Existence results for neutral evolution equations with nonlocal conditions and delay via fractional operator
  139. Global weak solution of 3D-NSE with exponential damping
  140. Special Issue on Fractional Problems with Variable-Order or Variable Exponents (Part I)
  141. Ground state solutions of nonlinear Schrödinger equations involving the fractional p-Laplacian and potential wells
  142. A class of p1(x, ⋅) & p2(x, ⋅)-fractional Kirchhoff-type problem with variable s(x, ⋅)-order and without the Ambrosetti-Rabinowitz condition in ℝN
  143. Jensen-type inequalities for m-convex functions
  144. Special Issue on Problems, Methods and Applications of Nonlinear Analysis (Part III)
  145. The influence of the noise on the exact solutions of a Kuramoto-Sivashinsky equation
  146. Basic inequalities for statistical submanifolds in Golden-like statistical manifolds
  147. Global existence and blow up of the solution for nonlinear Klein-Gordon equation with variable coefficient nonlinear source term
  148. Hopf bifurcation and Turing instability in a diffusive predator-prey model with hunting cooperation
  149. Efficient fixed-point iteration for generalized nonexpansive mappings and its stability in Banach spaces
Downloaded on 7.12.2025 from https://www.degruyterbrill.com/document/doi/10.1515/math-2022-0499/html
Scroll to top button