On stochastic accelerated gradient with convergence rate

Xingxing Zha; Yongquan Zhang; Yiyuan Cheng

doi:10.1515/math-2022-0499

Article Open Access

On stochastic accelerated gradient with convergence rate

Xingxing Zha , Yongquan Zhang and Yiyuan Cheng

Published/Copyright: October 13, 2022

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

$Open Mathematics$

From the journal Open Mathematics Volume 20 Issue 1

Abstract

This article studies the regression learning problem from given sample data by using stochastic approximation (SA) type algorithm, namely, the accelerated SA. We focus on problems without strong convexity, for which all well-known algorithms achieve a convergence rate for function values of O ( 1 / n ) . We consider and analyze accelerated SA algorithm that achieves a rate of O ( 1 / n ) for classical least-square regression and logistic regression problems, respectively. Comparing with the well-known results, we only need fewer conditions to obtain the tight convergence rate for least-square regression and logistic regression problems.

Keywords: least-square regression; logistic regression; accelerated stochastic approximation; convergence rate

MSC 2010: 68Q19; 68Q25; 68Q30

1 Introduction

Large-scale machine learning problems are becoming ubiquitous in science, engineering, government business, and almost all areas. Faced with huge data, investigators typically prefer algorithms that process each observation only once, or a few times. Stochastic approximation (SA) algorithms such as stochastic gradient descent (SGD), although introduced more than 60 years ago [1], still were widely used and studied method in some contexts (see [2,3, 4,5,6, 7,8,9, 10,11,12, 13,14,15, 16,17,18, 19,20,21, 22,23,24, 25,26]).

To our knowledge, Robbins and Monro [1] first proposed the SA on the gradient descent method. From then on, SA algorithms were widely used in stochastic optimization and machine learning. Polyak [2] and Polyak and Juditsky [3] developed an important improvement of the SA method by using longer stepsizes with consequent averaging of the obtained iterates. The mirror-descent SA was demonstrated by Nemirovski et al. [6] who showed that the mirror-descent SA exhibited an unimprovable expected rate for solving nonstrongly convex programming (CP) problems. Shalev-Shwartz et al. [5] and Nemirovski et al. [6] studied averaged SGD and achieved the rate of O ( 1 / μ n ) in the strongly convex case, and they obtained only O ( 1 / n ) in the non strongly convex case. Bach and Moulines [10] considered and analyzed SA algorithms that achieve a rate of O ( 1 / n ) for least-square regression and logistic regression learning problems in the non strongly-convex case. The convergence rate of the SA algorithm for least-square regression and logistic regression is almost optimal, respectively. However, they need some assumptions (A1–A6). It is natural to ask that the convergence rate for least-square regression is O ( 1 / n ) under fewer assumptions. In this article, we consider an accelerated SA type learning algorithm for solving the least-square regression and logistic regression problem and achieve a rate of O ( 1 / n ) for least-square regression learning problems under assumptions A1–A4 in [10]. For solving a class of CP problems, Nesterov presented the accelerated gradient method in a celebrated work [12]. Now, the accelerated gradient method has also been generalized by Beck and Teboulle [13], Tseng [14], Nesterov [15,16] to solve an emerging class of composite CP problems. In 2012, Lan [17] further showed that the accelerated gradient method is optimal for solving not only smooth CP problems but also general nonsmooth and stochastic CP problems. The accelerated stochastic approximation (AC-SA) algorithm was proposed by Ghadimi and Lan [18,19] using properly modifying Nesterov’s optimal method for smooth CP. Recently, they [20,21] also developed a generic AC-SA algorithmic framework, which can be specialized to yield optimal or nearly optimal methods for solving strongly convex stochastic composite optimization problems. Motivated by those mentioned jobs, we aim to consider and analyze an accelerated SA algorithm that achieves a rate of O ( 1 / n ) for classical least-square regression and logistic regression problems, respectively.

Zhu [25] introduced Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (offline) stochastic optimization. It can be incorporated into a variance-reduction-based algorithm and speed it up, in terms of both sequential and parallel performance. A new gradient-based optimization approach by automatically adjusting the learning rate is proposed by Cao [26]. This approach can be applied to design nonadaptive learning rate and adaptive learning rate. This approach could be an alternative method to optimize the learning rate based on the SGD algorithm besides the current nonadaptive learning rate methods e.g. SGD, momentum, Nesterov and the adaptive learning rate methods, e.g., AdaGrad, AdaDelta, and Adam.

In this article, we consider minimizing a convex function f , which is defined on a closed convex set in Euclidean space, given by f ( θ ) = 1 2 E [ ℓ ( y , ⟨ θ , x ⟩ ) ] , where ( x , y ) ∈ X × R denotes the sample data and ℓ denotes a loss function that is convex with respect to the second variable. This loss function includes least-square regression and logistic regression. In the SA framework, z = { z i } i = 1 n = { ( x i , y i ) } i = 1 n ∈ Z n denote a set of random samples, which are independently drawn according to the unknown probability measure ρ and the predictor defined by θ is updated after each pair is seen.

The rest of this article is organized as follows. In Section 2, we give a brief introduction to the accelerated gradient algorithm for least-square regression. In Section 3, we study the accelerated gradient algorithm for logistic regression. In Section 4, we compare our results with the known related work. Finally, we conclude this article with the obtained results.

2 The stochastic accelerated gradient algorithm for least-square regression

In this section, we consider the accelerated gradient algorithm for least-square regression. The novelty of this article is that our convergence result can obtain a nonasymptotic rate O ( 1 / n ) . To give the convergence property of the stochastic accelerated gradient algorithm for the regression problem, we make the following assumptions:

ℱ is a d -dimensional Euclidean space, with d ≥ 1 .
Let ( X , d ) be a compact metric space and let Y = R . Let ρ be a probability distribution on Z = ℱ × Y and ( X , Y ) be a corresponding random variable.
E ‖ x n ‖ 2 is finite, i.e., E ‖ x k ‖ 2 ≤ M for any k ≥ 1 .
The global minimum of f ( θ ) = 1 2 E [ ⟨ θ , x k ⟩ 2 − 2 y k ⟨ θ , x k ⟩ ] is attained at a certain θ ∗ ∈ R d . Let ξ k = ( y k − ⟨ θ ∗ , x k ⟩ ) x k denote the residual. For any k ≥ 1 , we have E ξ k = 0 . We also assume that E ξ k 2 ≤ σ 2 for every k and ξ ¯ k = 1 k ∑ i = 1 k ξ i .

Assumptions (a)–(d) are standard in SA (see, e.g., [9,10,22]). Compared with the work of Bach and Moulines [10], we do not need the conditions that the covariance operator ℋ = E ( x k ⨂ x k ) is invertible for any k ≥ 1 , and that the operator E ( x k ⨂ x k ) satisfies E [ ξ i ⨂ ξ i ] ≼ σ 2 ℋ and E ( ‖ x i ‖ 2 x k ⨂ x k ) ≼ R 2 ℋ for a positive number R .

Let x 0 ∈ ℱ , { α k } satisfy α 1 = 1 and α k > 0 for any k ≥ 2 , β k > 0 , and λ k .

Set the initial θ 0 a g = θ 0 and
(1) θ k m d = ( 1 − α k ) θ k − 1 a g + α k θ k − 1 .
Set
(2) θ k = θ k − 1 − λ k ∇ f ( θ k m d ) = θ k − 1 − λ k { E ( ⟨ θ k m d , x k ⟩ x k − y k x k ) } ,

(3) θ k a g = θ k m d − β k ( ∇ f ( θ k m d ) + ξ ¯ k ) = θ k m d − β k { E ( ⟨ θ k m d , x k ⟩ x k − y k x k ) + ξ ¯ k } .
Set k ← k + 1 and go to step (i).

To establish the convergence rate of the accelerated gradient algorithm, we need the following Lemma (see Lemma 1 of [7]).

Lemma 1

Let α k be the stepsizes in the accelerated gradient algorithm and the sequence { η k } satisfies

η k = ( 1 − α k ) η k − 1 + τ k , k = 1 , 2 , … ,

where

(4) Γ k = 1 , k = 1 , ( 1 − α k ) Γ k − 1 , k ≥ 2 .

Then we have η k ≤ Γ k ∑ i = 1 k τ i Γ i for any k ≥ 1 .

We establish the convergence rate of the developed algorithm. The goal is to estimate the bound on the expectation E [ f ( θ n a g ) − f ( θ ∗ ) ] . Theorem 1 describes the convergence property of the accelerated gradient algorithm for least-square regression.

Theorem 1

Let { θ k m d , θ k a g } be computed by the accelerated gradient algorithm and Γ k be defined in (4). Assume (a)–(d). If { α k } , { β k } , a n d { λ k } are chosen such that

α k λ k ≤ β k ≤ 1 2 M , α 1 λ 1 Γ 1 ≥ α 2 λ 2 Γ 2 ≥ ⋯ ,

then for any n ≥ 1 , we have

E [ f ( θ n a g ) − f ( θ ∗ ) ] ≤ Γ n 2 λ 1 ‖ θ 0 − θ ∗ ‖ 2 + M σ 2 Γ n ∑ k = 1 n β k 2 k Γ k .

Proof

By Taylor expansion of the function f and (2), we have

f ( θ k a g ) = f ( θ k m d ) + ⟨ ∇ f ( θ k m d ) , θ k a g − θ k m d ⟩ + ( θ k a g − θ k m d ) T ∇ 2 f ( θ k m d ) ( θ k a g − θ k m d ) ≤ f ( θ k m d ) − β k ‖ ∇ f ( θ k m d ) ‖ 2 − β k ⟨ ∇ f ( θ k m d ) , ξ ¯ k ⟩ + β k 2 E ‖ x k ‖ 2 ‖ ∇ f ( θ k m d ) + ξ ¯ k ‖ 2 ≤ f ( θ k m d ) − β k ‖ ∇ f ( θ k m d ) ‖ 2 − β k ⟨ ∇ f ( θ k m d ) , ξ ¯ k ⟩ + β k 2 M ‖ ∇ f ( θ k m d ) + ξ ¯ k ‖ 2 .

where the last inequality follows from the assumption (c).

Since

f ( μ ) − f ( ν ) = ⟨ ∇ f ( ν ) , μ − ν ⟩ + ( μ − ν ) T E ( x k x k T ) ( μ − ν ) ,

we have

(5) f ( ν ) − f ( μ ) = ⟨ ∇ f ( ν ) , ν − μ ⟩ − ( μ − ν ) T E ( x k x k T ) ( μ − ν ) ≤ ⟨ ∇ f ( ν ) , ν − μ ⟩ ,

where the inequality follows from the positive semidefinition of matrix E ( x k x k T ) .

By (1) and (5), we have

f ( θ k m d ) − [ ( 1 − α k ) f ( θ k − 1 a g ) + α k f ( θ ) ] = α k [ f ( θ k m d ) − f ( θ ) ] + ( 1 − α k ) [ f ( θ k m d ) − f ( θ k − 1 a g ) ] ≤ α k ⟨ ∇ f ( θ k m d ) , θ k m d − θ ⟩ + ( 1 − α k ) ⟨ ∇ f ( θ k m d ) , θ k m d − θ k − 1 a g ⟩ = ⟨ ∇ f ( θ k m d ) , α k ( θ k m d − θ ) + ( 1 − α k ) ( θ k m d − θ k − 1 a g ) ⟩ = α k ⟨ ∇ f ( θ k m d ) , θ k − 1 − θ ⟩ .

So we obtain

f ( θ k a g ) ≤ ( 1 − α k ) f ( θ k − 1 a g ) + α k f ( θ ) + α k ⟨ ∇ f ( θ k m d ) , θ k − 1 − θ ⟩ − β k ‖ ∇ f ( θ k m d ) ‖ 2 − β k ⟨ ∇ f ( θ k m d ) , ξ ¯ k ⟩ + β k 2 M ‖ ∇ f ( θ k m d ) + ξ ¯ k ‖ 2 .

It follows from (2) that

‖ θ k − θ ‖ 2 = ∥ θ k − 1 − λ k ∇ f ( θ k m d ) − θ ∥ 2 = ‖ θ k − 1 − θ ‖ 2 − 2 λ k ⟨ ∇ f ( θ k m d ) , θ k − 1 − θ ⟩ + λ k 2 ∥ ∇ f ( θ k m d ) ∥ 2 .

Then, we have

(6) ⟨ ∇ f ( θ k m d ) , θ k − 1 − θ ⟩ = 1 2 λ k [ ‖ θ k − 1 − θ ‖ 2 − ‖ θ k − θ ‖ 2 ] + λ k 2 ∥ ∇ f ( θ k m d ) ∥ 2 ,

and meanwhile,

(7) ∥ ∇ f ( θ k m d ) + ξ ¯ k ∥ 2 = ∥ ∇ f ( θ k m d ) ∥ 2 + ∥ ξ ¯ k ∥ 2 + 2 ⟨ ∇ f ( θ k m d ) , ξ ¯ k ⟩ .

Combining the aforementioned two equalities (6) and (7), we obtain

f ( θ k a g ) ≤ ( 1 − α k ) f ( θ k − 1 a g ) + α k f ( θ ) + α k 2 λ k [ ‖ θ k − 1 − θ ‖ 2 − ‖ θ k − θ ‖ 2 ] − β k 1 − λ k α k 2 β k − β k M ‖ ∇ f ( θ k m d ) ‖ 2 + M β k 2 ∥ ξ ¯ k ∥ 2 + ⟨ ξ ¯ k , ( 2 β k 2 M − β k ) ∇ f ( θ k m d ) ⟩ .

The aforementioned inequality is equal to

f ( θ k a g ) − f ( θ ) ≤ ( 1 − α k ) [ f ( θ k − 1 a g ) − f ( θ ) ] + α k 2 λ k [ ‖ θ k − 1 − θ ‖ 2 − ‖ θ k − θ ‖ 2 ] − β k 1 − λ k α k 2 β k − β k M ‖ ∇ f ( θ k m d ) ‖ 2 + M β k 2 ∥ ξ ¯ k ∥ 2 + ⟨ ξ ¯ k , ( 2 β k 2 M − β k ) ∇ f ( θ k m d ) ⟩ .

By using Lemma 1, we have

f ( θ n a g ) − f ( θ ) ≤ Γ n ∑ k = 1 n α k 2 λ k Γ k [ ‖ θ k − 1 − θ ‖ 2 − ‖ θ k − θ ‖ 2 ] − Γ n ∑ k = 1 n β k Γ k 1 − λ k α k 2 β k − β k M ‖ ∇ f ( θ k m d ) ‖ 2 + Γ n ∑ k = 1 n β k 2 M Γ k ∥ ξ ¯ k ∥ 2 + Γ n ∑ k = 1 n 1 Γ k ⟨ ξ ¯ k , ( 2 β k 2 M − β k ) ∇ f ( θ k m d ) ⟩ .

Since

α 1 λ 1 Γ 1 ≥ α 2 λ 2 Γ 2 ≥ ⋯ , α 1 = Γ 1 = 1 ,

then

∑ k = 1 n α k 2 λ k Γ k [ ‖ θ k − 1 − θ ‖ 2 − ‖ θ k − θ ‖ 2 ] ≤ α 1 2 λ 1 Γ 1 [ ‖ θ 0 − θ ‖ 2 ] = 1 2 λ 1 ‖ θ 0 − θ ‖ 2 .

So we obtain

(8) f ( θ n a g ) − f ( θ ) ≤ Γ n 2 λ 1 ‖ θ 0 − θ ‖ 2 + Γ n ∑ k = 1 n β k 2 M Γ k ∥ ξ ¯ k ∥ 2 + Γ n ∑ k = 1 n 1 Γ k ⟨ ξ ¯ k , ( 2 β k 2 M − β k ) ∇ f ( θ k m d ) ⟩ ,

where the inequality follows from the assumption

α k λ k ≤ β k ≤ 1 2 M .

Under assumption (d), we have

E ξ ¯ k = 1 k ∑ i = 1 k E ξ i = 0 , E ξ ¯ k 2 = E 1 k ∑ i = 1 k ξ i 2 ≤ σ 2 k .

Taking expectation on both sides of the inequality (8) with respect to ( x i , y i ) , we obtain for x ∈ R d ,

E [ f ( θ n a g ) − f ( θ ) ] ≤ Γ n 2 λ 1 ‖ θ 0 − θ ‖ 2 + M σ 2 Γ n ∑ k = 1 n β k 2 k Γ k .

Now, fixing θ = θ ∗ , we have

E [ f ( θ n a g ) − f ( θ ∗ ) ] ≤ Γ n 2 λ 1 ‖ θ 0 − θ ∗ ‖ 2 + M σ 2 Γ n ∑ k = 1 n β k 2 k Γ k .

This finishes the proof of Theorem 2.2.□

In the following, we apply the results of Theorem 1 to some particular selections of { α k } , { β k } , and { λ k } . We obtain the following Corollary 1.

Corollary 1

Suppose that α k and β k in the accelerated gradient algorithm for regression learning are set to

(9) α k = 1 k + 1 , β k = 1 M ( k + 1 ) , and λ k = 1 2 M ∀ k ≥ 1 ,

then for any n ≥ 1 , we have

E [ f ( θ n a g ) − f ( θ ∗ ) ] ≤ M 2 ‖ θ 0 − θ ∗ ‖ 2 + σ 2 M ( n + 1 ) .

Proof

In the view (4) and (9), we have for k ≥ 2

Γ k = ( 1 − α k ) Γ k − 1 = k k + 1 × k − 1 k × k − 2 k − 1 × ⋯ × 2 3 × Γ 1 = 2 k + 1 .

It is easy to verify

α k λ k = 1 2 M ( k + 1 ) ≤ β k = 1 M ( k + 1 ) ≤ 1 2 M , α 1 λ 1 Γ 1 = α 2 λ 2 Γ 2 = ⋯ = 1 4 M .

Then, we obtain

M Γ n σ 2 ∑ k = 1 n β k 2 k Γ k = 2 σ 2 n + 1 ∑ k = 1 n M M 2 ( k + 1 ) 2 2 k k + 1 = σ 2 M ( n + 1 ) ∑ k = 1 n 1 k ( k + 1 ) = σ 2 M ( n + 1 ) 1 − 1 2 + 1 2 − 1 3 + ⋯ + 1 n − 1 − 1 n ≤ σ 2 M ( n + 1 ) .

From the result of Theorem 1, we have

E [ f ( θ n a g ) − f ( θ ∗ ) ] ≤ M n + 1 ‖ θ 0 − θ ∗ ‖ 2 + σ 2 M ( n + 1 ) = M 2 ‖ θ 0 − θ ∗ ‖ 2 + σ 2 M ( n + 1 ) .

The proof of Corollary 1 is completed.□

Corollary 1 shows that the developed algorithm is able to achieve a convergence rate of O ( 1 / n ) without strong convexity and Lipschitz continuous gradient assumptions.

3 The stochastic accelerated gradient algorithm for logistic regression

In this section, we consider the convergence property of the accelerated gradient algorithm for logistic regression.

We make the following assumptions:

ℱ is a d -dimension Euclidean space, with d ≥ 1 .
The observations ( x i , y i ) ∈ ℱ × { − 1 , 1 } are independent and identically distributed.
E ‖ x i ‖ 2 is finite, i.e., E ‖ x i ‖ 2 ≤ M for any i ≥ 1 .
We consider l ( θ ) = E [ log ( 1 + exp ( − y i ⟨ x i , θ ⟩ ) ) ] . We denote by θ ∗ ∈ R d a global minimizer of l and thus assume to exist. Let ξ i = ( y i − ⟨ θ ∗ , x i ⟩ ) x i denote the residual. For any i ≥ 1 , we have E ξ i = 0 . We also assume that E ξ i 2 ≤ σ 2 for every i and ξ ¯ k = 1 k ∑ i = 1 k ξ i .

Let x 0 ∈ ℱ , { α k } satisfy α 1 = 1 and α k > 0 for any k ≥ 2 , β k > 0 , and λ k .

Set the initial θ 0 a g = θ 0 and
(10) θ k m d = ( 1 − α k ) θ k − 1 a g + α k θ k − 1 .
Set
(11) θ k = θ k − 1 − λ k ∇ l ( θ k m d ) = θ k − 1 − λ k − y k exp { − y k ⟨ x k , θ k m d ⟩ } x k 1 + exp { − y k ⟨ x k , θ k m d ⟩ } ,

(12) θ k a g = θ k m d − β k ( ∇ l ( θ k m d ) + ξ ¯ k ) = θ k m d − β k − y k exp { − y k ⟨ x k , θ k m d ⟩ } x k 1 + exp { − y k ⟨ x k , θ k m d ⟩ } + ξ ¯ k .
Set k ← k + 1 and go to step (i).

Theorem 2 describes the convergence property of the accelerated gradient algorithm for logistic regression.

Theorem 2

Let { θ k m d , θ k a g } be computed by the accelerated gradient algorithm and Γ k be defined in (4). Assume (B1)–(B4). If { α k } , { β k } , a n d { λ k } are chosen such that

α k λ k ≤ β k ≤ 1 2 M ,

α 1 λ 1 Γ 1 ≥ α 2 λ 2 Γ 2 ≥ ⋯ ,

and then for any n ≥ 1 , we have

E [ f ( θ n a g ) − f ( θ ∗ ) ] ≤ Γ n 2 λ 1 ‖ θ 0 − θ ∗ ‖ 2 + M σ 2 Γ n ∑ k = 1 n β k 2 k Γ k .

Proof

By Taylor expansion of the function l , there exists a ϑ such that

(13) l ( θ k a g ) = l ( θ k m d ) + ⟨ ∇ l ( θ k m d ) , θ k a g − θ k m d ⟩ + ( θ k a g − θ k m d ) T ∇ 2 l ( ϑ ) ( θ k a g − θ k m d ) = l ( θ k m d ) − β k ‖ ∇ l ( θ k m d ) ‖ 2 + β k ⟨ ∇ l ( θ k m d ) , ξ ¯ k ⟩ + ( θ k a g − θ k m d ) T E exp { − y k ⟨ x k , ϑ ⟩ } x k x k T 1 + exp { − y k ⟨ x k , ϑ ⟩ } ( θ k a g − θ k m d ) .

It is easy to verify that the matrix

E exp { − y k ⟨ x k , ϑ ⟩ } x k x k T 1 + exp { − y k ⟨ x k , ϑ ⟩ }

is positive semidefinite and the largest eigenvalue of it satisfies

λ m a x E exp { − y k ⟨ x k , ϑ ⟩ } x k x k T 1 + exp { − y k ⟨ x k , ϑ ⟩ } ≤ E ‖ x k ‖ 2 ≤ M .

Combining with (12) and (13), we have

l ( θ k a g ) ≤ l ( θ k m d ) − β k ‖ ∇ l ( θ k m d ) ‖ 2 + β k ⟨ ∇ l ( θ k m d ) , ξ ¯ k ⟩ + β k 2 M ‖ ∇ l ( θ k m d ) + ξ ¯ k ‖ 2 .

Similar to (13), there exists a ζ ∈ R d satisfying

l ( μ ) − l ( ν ) = ⟨ ∇ l ( ν ) , μ − ν ⟩ + ( μ − ν ) T E exp { − y k ⟨ x k , ζ ⟩ } x k x k T 1 + exp { − y k ⟨ x k , ζ ⟩ } ( μ − ν ) , μ , ν ∈ R d ,

and we have

l ( ν ) − l ( μ ) = ⟨ ∇ l ( ν ) , ν − μ ⟩ − ( μ − ν ) T E exp { − y k ⟨ x k , ζ ⟩ } x k x k T 1 + exp { − y k ⟨ x k , ζ ⟩ } ( μ − ν ) ≤ ⟨ ∇ l ( ν ) , ν − μ ⟩ ,

where the inequality follows from the positive semidefinition of matrix E exp { − y k ⟨ x k , ζ ⟩ } x k x k T 1 + exp { − y k ⟨ x k , ζ ⟩ } .

Similar to (5), we have

l ( θ k m d ) − [ ( 1 − α k ) l ( θ k − 1 a g ) + α k l ( θ ) ] ≤ α k ⟨ ∇ l ( θ k m d ) , θ k − 1 − θ ⟩ .

So we obtain

l ( θ k a g ) ≤ ( 1 − α k ) l ( θ k − 1 a g ) + α k l ( θ ) + α k ⟨ ∇ l ( θ k m d ) , θ k − 1 − θ ⟩ − β k ‖ ∇ l ( θ k m d ) ‖ 2 + β k ⟨ ∇ l ( θ k m d ) , ξ ¯ k ⟩ + β k 2 M ‖ ∇ l ( θ k m d ) + ξ ¯ k ‖ 2 .

It follows from (11) that

‖ θ k − θ ‖ 2 = ∥ θ k − 1 − λ k ∇ l ( θ k m d ) − θ ∥ 2 = ‖ θ k − 1 − θ ‖ 2 − 2 λ k ⟨ ∇ l ( θ k m d ) , θ k − 1 − θ ⟩ + ∥ ∇ l ( θ k m d ) ∥ 2 .

Then, we have

(14) ⟨ ∇ l ( θ k m d ) , θ k − 1 − θ ⟩ = 1 2 λ k [ ‖ θ k − 1 − θ ‖ 2 − ‖ θ k − θ ‖ 2 ] + λ k 2 ∥ ∇ l ( θ k m d ) ∥ 2 .

However,

(15) ∥ ∇ l ( θ k m d ) + ξ ¯ k ∥ 2 = ∥ ∇ f ( θ k m d ) ∥ 2 + ∥ ξ ¯ k ∥ 2 + 2 ⟨ ∇ l ( θ k m d ) , ξ ¯ k ⟩ .

Combining the aforementioned two equalities (14) and (15), we obtain

l ( θ k a g ) ≤ ( 1 − α k ) l ( θ k − 1 a g ) + α k l ( θ ) + α k 2 λ k [ ‖ θ k − 1 − θ ‖ 2 − ‖ θ k − θ ‖ 2 ] − β k 1 − λ k α k 2 β k − β k M ‖ ∇ l ( θ k m d ) ‖ 2 + M β k 2 ∥ ξ ¯ k ∥ 2 + ⟨ ξ ¯ k , ( 2 β k 2 M − β k ) ∇ l ( θ k m d ) ⟩ .

The aforementioned inequality is equal to

l ( θ k a g ) − l ( θ ) ≤ ( 1 − α k ) [ l ( θ k − 1 a g ) − l ( θ ) ] + α k 2 λ k [ ‖ θ k − 1 − θ ‖ 2 − ‖ θ k − θ ‖ 2 ] − β k 1 − λ k α k 2 β k − β k M ‖ ∇ l ( θ k m d ) ‖ 2 + M β k 2 ∥ ξ ¯ k ∥ 2 + ⟨ ξ ¯ k , ( 2 β k 2 M − β k ) ∇ l ( θ k m d ) ⟩ .

By using Lemma 1, we have

l ( θ n a g ) − l ( θ ) ≤ Γ n ∑ k = 1 n α k 2 λ k Γ k [ ‖ θ k − 1 − θ ‖ 2 − ‖ θ k − θ ‖ 2 ] − Γ n ∑ k = 1 n β k Γ k 1 − λ k α k 2 β k − β k M ‖ ∇ l ( θ k m d ) ‖ 2 + Γ n ∑ k = 1 n β k 2 M Γ k ∥ ξ ¯ k ∥ 2 + Γ n ∑ k = 1 n 1 Γ k ⟨ ξ ¯ k , ( 2 β k 2 M − β k ) ∇ l ( θ k m d ) ⟩ .

Since

α 1 λ 1 Γ 1 ≥ α 2 λ 2 Γ 2 ≥ ⋯ , α 1 = Γ 1 = 1 ,

then

∑ k = 1 n α k 2 λ k Γ k [ ‖ θ k − 1 − θ ‖ 2 − ‖ θ k − θ ‖ 2 ] ≤ α 1 2 λ 1 Γ 1 [ ‖ θ 0 − θ ‖ 2 ] = 1 2 λ 1 ‖ θ 0 − θ ‖ 2 .

So we obtain

(16) l ( θ n a g ) − l ( θ ) ≤ Γ n 2 λ 1 ‖ θ 0 − θ ‖ 2 + Γ n ∑ k = 1 n β k 2 M Γ k ∥ ξ ¯ k ∥ 2 + Γ n ∑ k = 1 n 1 Γ k ⟨ ξ ¯ k , ( 2 β k 2 M − β k ) ∇ l ( θ k m d ) ⟩ ,

where the inequality follows from the assumption

α k λ k ≤ β k ≤ 1 2 M .

Under assumption (d), we have

E ξ ¯ k = 1 k ∑ i = 1 k E ξ i = 0 , E ξ ¯ k 2 = E 1 k ∑ i = 1 k ξ i 2 ≤ σ 2 k .

Taking expectation on both sides of the inequality (16) with respect to ( x i , y i ) , we obtain for θ ∈ R d ,

E [ l ( θ n a g ) − l ( θ ) ] ≤ Γ n 2 λ 1 ‖ θ 0 − θ ‖ 2 + M σ 2 Γ n ∑ k = 1 n β k 2 k Γ k .

Now, fixing θ = θ ∗ , we have

E [ l ( θ n a g ) − l ( θ ∗ ) ] ≤ Γ n 2 λ 1 ‖ θ 0 − θ ∗ ‖ 2 + M σ 2 Γ n ∑ k = 1 n β k 2 k Γ k .

This finishes the proof of Theorem 2.□

Similar to Corollary 1, we specialize the results of Theorem 2 for some particular selections of { α k } , { β k } and λ k .

Corollary 2

Suppose that α k , β k , and λ k in the accelerated gradient algorithm for regression learning are set to

α k = 1 k + 1 , β k = 1 M ( k + 1 ) , and λ k = 1 2 M , ∀ k ≥ 1 ,

and then for any n ≥ 1 , we have

E [ l ( θ n a g ) − l ( θ ∗ ) ] ≤ M 2 ‖ θ 0 − θ ∗ ‖ 2 + σ 2 M ( n + 1 ) .

4 Comparisons with related work

In Sections 2 and 3, we have studied the AC-SA type algorithms for least-square regression and least-square learning problems, respectively. We have derived the upper bound of AC-SA learning algorithms by using the convexity of the aim function. In this section, we discuss how our results relate to other recent studies.

4.1 Comparison with convergence rate for stochastic optimization

Our convergence analysis of SA learning algorithms is based on a similar analysis for stochastic composite optimization by Ghadimi and Lan in [8]. There are two differences between our work and that of Ghadimi and Lan. The first difference in our convergence analysis of SA algorithms compared with the problems of stochastic optimization in [8] is for any iteration, rather than iteration limit, i.e., the parameters β k , λ k of Corollary 3 in [8] are in relation with iteration limit N , while we do not need this assumption. The second difference is in the two error bounds. Ghadimi and Lan obtained a rate of O ( 1 / n ) for stochastic composite optimization, while we obtain the rate of O ( 1 / n ) for the regression problem.

Our developed accelerated stochastic gradient algorithm (SA) for the least-square regression is summarized in (1)–(3). The algorithm takes a stream of data ( x k , y k ) as input, and an initial guess of the parameter θ 0 . The other requirements include { α k } , which satisfies α 1 = 1 and α k > 0 for any k ≥ 2 , β k > 0 , and λ k > 0 . The algorithm involves two intermediate variables θ k a g (which is initialized to be θ 0 ) and θ k m d . θ k m d is updated as a linear combination of θ k a g and the current estimation of the parameter θ k (3), where α k is the coefficient. The parameter θ k is estimated in (2) taking λ k as a parameter. The residue ξ k and the average residue ξ ¯ k of previous residues up to the k th data (i.e., ξ ¯ k = 1 k ∑ i = 1 k ξ i ) are computed in (3). θ k a g is then updated through a linear combination of θ k m d , where β k is taken as a parameter. The process continues whenever a new pair of data is seen.

The unbiased estimate of the gradient, i.e., ( ⟨ θ k m d , x k ⟩ x k − y k x k ) for each data point, ( x k , y k ) is used in (2). From this perspective, it is seen that the update of θ k is actually the same as in the SGD (also called least-mean-square) algorithm if we set α k = 1 . Across the training, the relative residue ξ k is computed. All the residues up to now are averaged, and the average relative residue takes effect on the update of θ k a g . It differs from the stochastic accelerated gradient algorithm in [22], where no residue is computed and used in the training.

4.2 Comparison with the work of Bach and Moulines

The work that is perhaps closely related to ours is that of Bach and Moulines [10], who studied the SA problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework that includes machine learning methods based on the minimization of the empirical risk. The sample setting considered by Bach and Moulines is similar to ours: the learner is given a sample set { ( x i , y i ) } i = 1 n , and the goal of the regression learning problem is to learn a liner function ⟨ θ , x ⟩ , which forecasts the other inputs in X according to random samples. Both we and Bach and Moulines obtained the rates of O ( 1 / n ) of SA algorithm for the least-square regression, without strong-convexity assumptions. To our knowledge, the convergence rate O ( 1 / n ) is optimal for least-square regression and logistic regression.

Although uniform convergence bounds for regression learning algorithms have replied on the assumptions of input x k and the residual ξ k , we have obtained the optimal upper bound O ( 1 / n ) of stochastic learning algorithms and the order of the upper bound is independent of the dimension of input space. There are some important differences between our work and that of [10]. Bach and Moulines considered generalization properties of stochastic learning algorithms under the assumption that the covariance operator E ( x k ⊗ x k ) is invertible. However, some covariance operators may not be invertible, such as the covariance operator E ( x k ⊗ x k ) in R 2 , which is defined by

E ( x k ⊗ x k ) = E x k 1 2 E x k 1 x k 2 E x k 1 x k 2 E x k 2 2 .

When two random components x k 1 and x k 2 in x k satisfies x k 1 = x k 2 , then the determinant of the covariance operator E ( x k ⊗ x k ) equals zero. However, only under the assumption of (a-d), the rate of our algorithm can reach O ( 1 / n ) .

5 Conclusion

In this article, we have considered two SA algorithms that can achieve rates of O ( 1 / n ) for the least-square regression and logistic regression, respectively, without strong-convexity assumptions. Without strong convexity, We focus on problems for which the well-known algorithms achieve a convergence rate for function values of O ( 1 / n ) . We consider and analyze accelerated SA algorithm that achieves a rate of O ( 1 / n ) for classical least-square regression and logistic regression problems. Comparing with the well-known results, we only need fewer conditions to obtain the tight convergence rate for least-square regression and logistic regression problems. For the accelerated SA algorithm, we provide a nonasymptotic analysis of the generalization error (in expectation) and experimentally study our theoretical analysis.

Funding information: The authors acknowledge the financial supports from the National Natural Science Foundation of China (No. 61573326), project for outstanding young talents in Colleges and universities in Anhui Province (No. gxyq2018076), Natural science research project of colleges and universities in Anhui Province (No. KJ2021A1033), scientific research project of Chaohu University (Nos. XLY-202103, XLY-202105, and XLZ-202202).
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved.
Conflict of interest: The authors state no conflict of interest.

References

[1] H. Robbins and S. Monro, A stochastic approximation method, In: The Annals of Mathematical Statistics, Institute of Mathematical Statistics, vol. 22, 1951, pp. 400–407. 10.1007/978-1-4612-5110-1_9Search in Google Scholar

[2] B. T. Polyak, New stochastic approximation type procedures, Automat. i Telemekh 7 (1990), 98–107. Search in Google Scholar

[3] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control Optim. 30 (1992), 838–855, https://doi.org/10.1137/0330046. Search in Google Scholar

[4] L. Bottou and O. Bousquet, The tradeoffs of large scale learning, Adv. Neural Inform. Process. Sys. 20 (2007), 1–8. Search in Google Scholar

[5] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, Pegasos: Primal estimated sub-gradient solver for SVM, Math. Program 127 (2011), 3–30, https://doi.org/10.1007/s10107-010-0420-4. Search in Google Scholar

[6] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim. 19 (2009), 1574–1609, https://doi.org/10.1137/070704277. Search in Google Scholar

[7] G. H. Lan and R. D. C. Monteiro, Iteration-complexity of first-order penalty methods for convex programming, Math. Program 138 (2013), 115–139, https://doi.org/10.1007/s10107-012-0588-x. Search in Google Scholar

[8] S. Ghadimi and G. H. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Math. Program 156 (2016), 59–99, https://doi.org/10.1007/s10107-015-0871-8. Search in Google Scholar

[9] F. Bach and E. Moulines, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, NIPS. 2011, 451–459.Search in Google Scholar

[10] F. Bach and E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate O(1∕n), 2013, https://doi.org/10.48550/arXiv.1306.2119. Search in Google Scholar

[11] J. C. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res. 12 (2010), 2121–2159. Search in Google Scholar

[12] Y. E. Nesterov, A method of solving a convex programming problem with convergence rate O(1∕k2), Dokl. Akad. Nauk 269 (1983), 543–547. Search in Google Scholar

[13] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci. 2 (2009), 183–202, https://doi.org/10.1137/080716542. Search in Google Scholar

[14] P. Tseng and S. Yun, Incrementally updated gradient methods for constrained and regularized optimization, J. Optim. Theory Appl. 160 (2014), 832–853.10.1007/s10957-013-0409-2Search in Google Scholar

[15] Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program 103 (2005), 127–152, https://doi.org/10.1007/s10107-004-0552-5. Search in Google Scholar

[16] Y. Nesterov, Gradient methods for minimizing composite functions, Math. Program 140 (2013), 125–161, https://doi.org/10.1007/s10107-012-0629-5. Search in Google Scholar

[17] G. H. Lan, An optimal method for stochastic composite optimization, Math. Program 133 (2012), 365–397, https://doi.org/10.1007/s10107-010-0434-y. Search in Google Scholar

[18] S. Ghadimi and G. H. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework, SIAM J. Optim. 22 (2012), 1469–1492, DOI: https://doi.org/10.1137/110848864. 10.1137/110848864Search in Google Scholar

[19] S. Ghadimi and G. H. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim. 23 (2013), 2341–2368, https://doi.org/10.1137/120880811. Search in Google Scholar

[20] S. Ghadimi, G. H. Lan, and H. C. Zhang, Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Math. Program 155 (2016), 267–305, https://doi.org/10.1007/s10107-014-0846-1. Search in Google Scholar

[21] G. H. Lan, Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization, Math. Program 149 (2015), 1–45, https://doi.org/10.1007/s10107-013-0737-x. Search in Google Scholar

[22] S. Ghadimi and G. H. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Math. Program 156 (2016), 59–99, https://doi.org/10.1007/s10107-015-0871-8. Search in Google Scholar

[23] L. Bottou, Large-scale machine learning with stochastic gradient descent, In: Proceedings of COMPSTAT’2010, Physica-Verlag HD. 2010, pp. 177–186, https://doi.org/10.1007/978-3-7908-2604-3_16. Search in Google Scholar

[24] D. P. Kingma and J. Pa, Adam: A method for stochastic optimization, 2012, DOI: https://doi.org/10.48550/arXiv.1412.6980. Search in Google Scholar

[25] Z. A. Zhu, Katyusha: the first direct acceleration of stochastic gradient methods, In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2017). 2017, Association for Computing Machinery, New York, USA, https://doi.org/10.1145/3055399.3055448. Search in Google Scholar

[26] X. Cao, BFE and AdaBFE: A new approach in learning rate automation for stochastic optimization, 2022, https://doi.org/10.48550/arXiv.2207.02763 Search in Google Scholar

Received: 2022-01-01

Revised: 2022-07-16

Accepted: 2022-09-11

Published Online: 2022-10-13

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/math-2022-0499

Keywords for this article

least-square regression; logistic regression; accelerated stochastic approximation; convergence rate

Creative Commons

BY 4.0