Home Fast iterative regularization by reusing data
Article Open Access

Fast iterative regularization by reusing data

  • Cristian Vega ORCID logo EMAIL logo , Cesare Molinari ORCID logo , Lorenzo Rosasco ORCID logo and Silvia Villa ORCID logo
Published/Copyright: October 27, 2023

Abstract

Discrete inverse problems correspond to solving a system of equations in a stable way with respect to noise in the data. A typical approach to select a meaningful solution is to introduce a regularizer. While for most applications the regularizer is convex, in many cases it is neither smooth nor strongly convex. In this paper, we propose and study two new iterative regularization methods, based on a primal-dual algorithm, to regularize inverse problems efficiently. Our analysis, in the noise free case, provides convergence rates for the Lagrangian and the feasibility gap. In the noisy case, it provides stability bounds and early stopping rules with theoretical guarantees. The main novelty of our work is the exploitation of some a priori knowledge about the solution set: we show that the linear equations determined by the data can be used more than once along the iterations. We discuss various approaches to reuse linear equations that are at the same time consistent with our assumptions and flexible in the implementation. Finally, we illustrate our theoretical findings with numerical simulations for robust sparse recovery and image reconstruction. We confirm the efficiency of the proposed regularization approaches, comparing the results with state-of-the-art methods.

MSC 2020: 90C25; 65K10; 49M29

1 Introduction

Many applied problems require the estimation of a quantity of interest from noisy linear measurements, for instance compressed sensing [24, 25, 33, 65, 77], image processing [67, 58, 66, 26, 28, 56, 79], matrix completion [19, 22, 23, 49], and various problems in machine learning [70, 53, 64, 34, 6, 79, 80]. In all these problems, we are interested in finding stable solutions to a system of equations where the accessible data are corrupted by noise. This is classically achieved by regularization. The most popular procedure in the literature is Tikhonov (or variational) regularization [35], and consists in minimizing the sum of a data fidelity term plus a regularizer, which is explicitly added to the objective function and entails some a priori knowledge or some desired property on the solutions that we want to select. A trade-off parameter is then introduced to balance the fidelity term and the regularizer. In practice, this implies that the optimization problem has to be solved many times for different values of the parameter. Finally, a parameter - and the correspondent solution - is chosen accordingly to the performance with respect to some criterion, such as Morozov discrepancy principle in inverse problems [35] or cross-validation on left-out data in machine learning [74, 38].

A computationally efficient alternative to explicit regularization is iterative regularization, also known as implicit regularization [35, 18, 13, 2]. The minimization of the regularizer under the noisy data constraints is considered and a numerical algorithm to solve the optimization problem is chosen and early stopped, to avoid convergence to the noisy solution. In this setting, it is known that the number of iterations plays the role of the regularization parameter [35]. As for Tikhonov regularization, the best performing iterate is chosen according to some a priori criterion and then considered as the regularized solution. Compared to Tikhonov regularization, this procedure is very efficient, since only one optimization problem is solved, and not even until convergence.

The main novelty of this work is the design and analysis of two new iterative regularization methods for convex regularizers, which are not necessarily smooth nor strongly convex. The new iterative regularization methods are based on primal-dual algorithms [29, 31, 78] combined with the idea of reusing the linear equations determined by the data at every iteration [16]. Primal-dual algorithms perform one minimization step on the primal variable followed by one on the dual and are well-suited for the large scale setting, as only matrix-vector multiplications and the calculation of a proximity operator are required. The idea of exploiting redundant information was presented in [16] and turned out to be very effective in practice. The first method that we propose is a primal-dual algorithm (PDA) with additional activations of the linear equations: we propose different variants, depending on the extra activation steps. For instance, we are able to exploit the data constraints more than once at every iteration via gradient descent, with a fixed or an adaptive step size. The second method is a dual-primal algorithm (DPA) where a subset containing the dual solutions is activated at each step. This subset is not affected by the noise in the data and is usually determined by a finite number of constraints.

These additional steps may seem artificial or inefficient. However, while maintaining an easy implementation, our methods achieve better numerical performances and considerable speed-ups with respect to the vanilla primal-dual algorithm. We extend to the noisy case the techniques studied in [17, 16] for the exact case. The assumptions on the noise are the classical ones in inverse problems, see, e.g., [47, 21, 18, 49], and the proposed results generalize the ones in [49], by including in the primal-dual procedure a diagonal preconditioning and an extra activation step. For the noisy case, we provide an early stopping criterion to recover a stable approximation of an ideal solution, in the same spirit of [47, 21, 18, 63, 83, 80, 12, 5]. The early stopping rule is derived from theoretical stability bounds and feasibility gap rates for both algorithms, obtaining implicit regularization properties similar to those stated in [49, 47]. Theoretical results are complemented by numerical experiments for robust sparse recovery and total variation, showing that state-of-the-art performances can be achieved with considerable computational speed-ups.

Related works.

In this section, we briefly discuss the literature about variational and iterative regularization techniques. Tikhonov regularization has been introduced in [76]; see also [35, 11] and the references therein for an extensive treatment of the topic. The most famous iterative regularization method is the Landweber algorithm [44, 35], namely gradient descent on the least squares problem. Duality theory in optimization gives another interpretation which sheds light on the regularizing properties of this procedure. Indeed, consider the problem of minimizing the squared norm under linear constraints. Running gradient descent on its dual problem and mapping back to the primal variable, we obtain exactly the Landweber method. This provides another explanation of why the iterates of the Landweber algorithm converge to the minimal norm solution of the linear equation [47]. Stochastic gradient descent on the previous problem is the generalization of the Kaczmarz method [46, 69, 42, 75], which consists in applying cyclic or random projections onto single equations of the linear system. Accelerated and diagonal versions are also discussed in [35, 55] and [4, 43, 68], respectively. The regularization properties of other optimization algorithms for more general regularizers have also been studied. If strong convexity is assumed, mirror descent [8, 54] can also be interpreted as gradient descent on the dual problem, and its regularization properties (and those of its accelerated variant) have been studied in [47]. Diagonal approaches [3] with a regularization parameter that vanishes along the iterations have been studied in [37]; see [21] for an accelerated version. Another common approach relies on the linearized Bregman iteration [82, 81, 79, 56], which has found applications in compressed sensing [20, 57, 82] and image deblurring [20]. However, this method requires to solve non-trivial minimization problems at each iteration. For convex, but not strongly convex regularizers, the regularization properties of primal-dual algorithms have been investigated in [49]. Other optimization techniques are available to solve this kind of minimization problem (for instance, [51, 52] and [15, 61, 48], see also [40, 71, 72]), but no iterative regularization properties have been studied so far for these algorithms.

The rest of the paper is organized as follows. In Section 2, we introduce the notation jointly with its mathematical background. In Section 3, we present the main problem and propose five classes of algorithms to solve it numerically. In Section 4, we derive stability and feasibility gap bounds and related early stopping rules. In Section 5, we verify the performance of the algorithm on two numerical applications: robust sparse recovery problem and image reconstruction by total variation. Finally, we provide some conclusions.

2 Notation and background

First we recall some well known concepts and properties used in the paper.

Let X, Y be two finite-dimensional real vector spaces equipped with an inner product and the induced norm . We denote the set of convex, lower semicontinuous, and proper functions on X by Γ 0 ( X ) . The subdifferential of F Γ 0 ( X ) is the set-valued operator defined by

F : X 2 X , x { u X : F ( x ) + y - x u F ( y )  for all  y X } .

If the function F is Gâteaux differentiable at the point x, then F ( x ) = { F ( x ) } (see [7, Proposition 17.31 (i)]). In general, for F Γ 0 ( X ) , it holds that ( F ) - 1 = F * (see [7, Corollary 16.30]), where F * Γ 0 ( X ) is the conjugate function of F, defined by

F * ( x ) := sup u X x u - F ( u ) .

For every self-adjoint positive definite matrix Σ, we define the proximity operator of F relative to the metric induced by Σ 2 := Σ as prox F Σ = ( Id + Σ F ) - 1 . If Σ = σ Id for some real number σ > 0 , it is customary to write prox σ F rather than prox F Σ . The projection operator onto a nonempty closed convex set C X is denoted by P C . If we define the indicator ι C Γ 0 ( X ) as the function that is 0 on C, and + otherwise, then prox ι C = P C . Moreover, if C is a singleton, say C = { b } , we have that ι { b } * ( u ) = u b (see [7, Example 13.3 (i)]). The relative interior of C is

ri ( C ) = { x C + + ( C - x ) = span ( C - x ) } ,

where

+ + C = { λ y ( λ > 0 ) ( y C ) }

and span ( C ) is the smallest linear subspace of X containing C.

Given α ] 0 , 1 [ , an operator T : X X is α-averaged non-expansive if

T x - T y 2 x - y 2 - 1 - α α ( Id - T ) x - ( Id - T ) y 2 for all  x X  and all  y X ,

and it is quasi-non-expansive if

T x - y 2 x - y 2 for all  x X  and all  y Fix T ,

where Fix T = { x X T x = x } is the set of fixed points of T. For further results on convex analysis and operator theory, the reader is referred to [7].

The operator norm of a real matrix A d × p is denoted by A , and the adjoint of A is A * . We define the Frobenius norm of A by A F 2 := i = 1 d a i 2 , where, for every i [ d ] := { 1 , , d } , a i denotes the i-th row of A. We also denote by A i the i-th column of A. We denote by ran ( A ) and ker ( A ) the range and the kernel of A, respectively.

3 Main problem and algorithm

Many applied problems require to estimate a quantity of interest x p based on linear measurements b = A x for some matrix A d × p . For simplicity, we do the analysis in the finite-dimensional case, but note that it can be easily extended to the infinite-dimensional setting. A standard approach to recover the desired solution is to assume that it is a minimizer of the following linearly constrained optimization problem:

(${\mathcal{P}}$) min x p { J ( x ) : A x = b } ,

where J Γ 0 ( p ) encodes a priori information on the solution and is usually hand-crafted. Typical choices are the squared norm [35], the elastic net regularization [84, 47, 32, 85, 45, 41], the 1 -norm [24, 25, 33, 77], and the total variation [67, 58, 66, 26]. Note that, in the previous examples, the first two regularizers are strongly convex, while the second two are just convex and non-smooth.

If we use the indicator function of { b } , the problem ((${\mathcal{P}}$)) can be written equivalently as

min x p J ( x ) + ι { b } ( A x ) .

We denote by μ the optimal value of ((${\mathcal{P}}$)) and by 𝒮 the set of its minimizers. We assume that 𝒮 . In order to build our regularization procedure, we consider the Lagrangian functional for problem ((${\mathcal{P}}$)):

: p × d { + } , ( x , u ) J ( x ) + u A x - b .

This approach allows us to split the contribution of the non-smooth term J and the one of the linear operator A, without requiring to compute the projection on the set

C := { x p A x = b } .

We define the set of saddle points of by

𝒵 = { ( x , u ) p × d : ( x , v ) ( x , u ) ( y , u )  for all  ( y , v ) p × d } .

The set 𝒵 is characterized by the first-order optimality condition:

𝒵 = { ( x , u ) p × d : 0 J ( x ) + A * u  and  A x = b } .

In the following, we always assume that 𝒵 .

Remark 3.1 (Saddle points and primal-dual solutions).

Observe that the objective function of ((${\mathcal{P}}$)) is the sum of two functions in Γ 0 ( p ) where one of the two is composed with a linear operator. This formulation is suitable to apply Fenchel–Rockafellar duality. Recalling that ι { b } * ( u ) = u b , the dual problem of ((${\mathcal{P}}$)) is given by

(${\mathcal{D}}$) min u d J * ( - A * u ) + u b .

We denote its optimal value by μ * and its set of minimizers by 𝒮 * . Then 𝒵 𝒮 × 𝒮 * , and equality holds if the qualification condition (see [7, Proposition 6.19] for special cases when it holds)

(3.1) b ri ( A ( dom J ) )

is satisfied [7, Proposition 19.21 (v)]. In addition, condition (3.1) implies that problem ((${\mathcal{D}}$)) has a solution. Then, under (3.1), since we assumed that S , we derive also that 𝒵 .

In practical situations, the exact data b is unknown and only a noisy version is accessible. Given a noise level δ 0 , we consider a worst case scenario, where the error is deterministic and the accessible data b δ is such that

b δ - b δ .

This is the classical model in inverse problems [35, 43]. The solution set of the inexact linear system A x = b δ is denoted by C δ . Analogously, we denote by 𝒮 δ and 𝒮 δ * the sets of primal and dual solutions with noisy data, respectively. It is worth pointing out that, if b δ ran ( A ) , then 𝒮 δ C δ = , but our analysis and bounds still hold.

3.1 Primal-dual splittings with a priori information

In this section, we propose an iterative regularization procedure to solve problem ((${\mathcal{P}}$)), based on a primal-dual algorithm with preconditioning and arbitrary activations of a predefined set of operators. While the use of primal-dual algorithms [29] as iterative regularization methods is somewhat established [49], in this paper we focus on the possibility of reusing the data constraints along the iterations. This idea was originally introduced in [16], where the authors studied the case in which the exact data are available, and consists in the activation of extra operators, which encode information about the solution set to improve the feasibility of the updates. In our setting, we reuse data constraints, and we project, in series or in parallel, onto some equations given by the (noisy) linear constraints. But we will show that other interesting choices are possible, as projections onto the set of dual constraints.

More formally, for i [ m ] , we consider a finite number of operators T i : p p or T i : d d such that the set of noisy primal (or dual) solutions is contained in Fix T i for every i [ m ] . We refer to this as redundant a priori information. A list of operators suitable to our setting (and with a cheap practical implementation) can be found in Section 5.

The primal-dual algorithms with reuse of data which are given in Table 1 are a preconditioned and deterministic version of the one proposed in [16] applied to the case of linearly constrained minimization.

Table 1

Proposed algorithms for iterative regularization.

Primal-dual splitting with activations
Input: ( p ¯ 0 , x 0 , u 0 ) 2 p × d .

For k = 1 , , N :

(PDA) u k + 1 = u k + Γ ( A p ¯ k - b δ ) x k + 1 = prox J Σ ( p k - Σ A * u k + 1 ) Choose  ϵ k + 1 [ m ]  and set p k + 1 = T ϵ k + 1 x k + 1 p ¯ k + 1 = p k + 1 + x k + 1 - p k ,

End

Dual-primal splitting with activations
Input: ( v ¯ 0 , u 0 , x 0 ) 2 d × p .

For k = 1 , , N :

(DPA) x k + 1 = prox J Σ ( x k - Σ A * v ¯ k ) u k + 1 = v k + Γ ( A x k + 1 - b δ ) Choose  ϵ k + 1 [ m ]  and set v k + 1 = T ϵ k + 1 u k + 1 v ¯ k + 1 = v k + 1 + u k + 1 - v k ,

End

We first focus on the primal-dual splitting. It is composed by four different steps, to be performed in series. The first step is the update of the dual variable, in which the residuals to the linear equation A x = b δ are accumulated after preconditioning by the operator Γ. The second step is an implicit prox-step, with function J and norm Σ - 1 , on the primal variable. The third one is the activation of the operator related to reusing data constraints, on the primal variable. Finally, the last step is an extrapolation again on the primal variable. Notice that, if no operator is activated, it corresponds simply to p ¯ k + 1 = 2 x k + 1 - x k , that is, the classical update in the primal-dual algorithm. On the other hand, the dual-primal splitting algorithm, except for permutation in the order of the steps, differs from the previous one because the activation of the operator is done not on the primal variable but on the dual one. Indeed, Lemma A.1 establishes that, without the activation of the operator, there is an equivalence between the primal variables generated by (PDA) and the ones generated by (DPA).

Remark 3.2.

Observe that in the proofs of convergence and stability (Theorems 4.1 and 4.2), we will never use that x belongs to a finite-dimensional space. This is in line with previous research on the convergence guarantees of the plain methods in Hilbert and Banach spaces, as outlined in [31, 78, 73]. It follows that the primal-dual algorithms above can be formulated exactly in the same way when the unknown vector x belongs to an infinite-dimensional Hilbert space, and our analysis can be extended to that setting. Another possible extension of the algorithm, which we do not analyze explicitly in this work, is related with the stochastic version of the primal-dual algorithm; see [27, 1, 39].

In the following, we list the assumptions that we require on the parameters and the operators involved in the algorithm.

Assumption 3.3.

Consider the setting of (PDA) or (DPA).

  1. The preconditioners Σ p × p and Γ d × d are two diagonal positive definite matrices such that

    (3.2) 0 < α := 1 - Γ 1 2 A Σ 1 2 2 .

  2. For every k , ϵ k [ m ] .

Consider the setting of (PDA).

  1. { T i } i [ m ] is a family of operators from p to p and, for every i [ m ] :

    1. Fix T i 𝒮 δ ;

    2. there exists e i 0 such that, for every x p and x ¯ 𝒮 ,

      (3.3) T i x - x ¯ Σ - 1 2 x - x ¯ Σ - 1 2 + e i δ 2 .

      We set e = max i [ m ] e i .

Now, consider the setting of (DPA).

  1. { T i } i [ m ] is a family of operators from d to d and, for every i [ m ] :

    1. Fix T i 𝒮 δ * .

    2. For every u d and u ¯ 𝒮 δ * ,

      (3.4) T i u - u ¯ Γ - 1 2 u - u ¯ Γ - 1 2 .

Remark 3.4 (Hypothesis about the operators).

If assumption (A3) (a) holds and δ = 0 , then assumption (A3) (b) is implied by the quasi-nonexpansivity of T i on 𝒮 . This is a weaker condition than the one proposed in [16], where, due to the generality of the setting, α-averaged non-expansive operators were needed. A similar reasoning applies to assumption (A4).

4 Main results

In this section, we present and discuss the main results of the paper. We derive convergence and stability properties of the primal-dual and dual-primal splitting algorithms for linearly constrained optimization with a priori information.

First, we define the averaged iterates and the square weighted norm induced by Σ and Γ on p × d , namely

( X n , U n ) := k = 1 n z k n and V ( z ) := x Σ - 1 2 2 + u Γ - 1 2 2 ,

where z k := ( x k , u k ) is the k-th iterate and z := ( x , u ) is a primal-dual variable. We also recall the definition of the Lagrangian as ( x , u ) = J ( x ) + u A x - b .

The first result establishes the stability properties of the algorithm (PDA), both in terms of the Lagrangian and the feasibility gap. We recall that here we use activation operators based on the noisy data and corresponding constraints in the primal space, namely the set C δ .

Theorem 4.1.

Consider the setting of (PDA) under assumptions (A1), (A2), and (A3). Let ( p ¯ 0 , x 0 , u 0 ) R 2 p × R d be such that x 0 = p ¯ 0 . Then, for every z = ( x , u ) Z and for every N N , we have

(4.1) ( X N , u ) - ( x , U N ) V ( z 0 - z ) N + 2 N Γ 1 2 2 δ 2 α + δ Γ 1 2 ( 2 V ( z 0 - z ) α ) 1 2 + δ Γ 1 2 ( N e δ 2 α ) 1 2 + e δ 2 2

and

(4.2)

A X N - b 2 16 N Γ Γ - 1 δ 2 α 2 + 8 δ Γ - 1 ( 2 Γ V ( z 0 - z ) α 3 ) 1 2 + 8 δ 2 Γ - 1 ( Γ e N α 3 ) 1 2
+ 8 Γ - 1 V ( z 0 - z ) N α + 2 δ 2 + 4 Γ - 1 e δ 2 α ,

where we recall that the constant α and e are defined in assumptions (A1) and (A3), respectively.

The proof of Theorem 4.1 is given in Section A.2. The proof combines and extends the techniques developed in [16, 49], based on the firm non-expansivity of the proximal point operator and discrete Bihari’s lemma to deal with the error, see also [62].

In the next result, we establish upper bounds for the Lagrangian and feasibility gap analogous to those proposed in Theorem 4.1, but for the algorithm (DPA). The main difference is that now the activation step is based on a priori information in the dual space d , and not on C δ . This set is represented by the intersection of fixed point sets of a finite number of operators and encodes some knowledge about the dual solution.

Theorem 4.2.

Consider the setting of (DPA) under assumptions (A1), (A2), and (A4). Let ( v ¯ 0 , u 0 , x 0 ) R 2 d × R p be such that u 0 = v ¯ 0 . Then, for every z = ( x , u ) Z and for every N N , we have that

(4.3) ( X N , u ) - ( x , U N ) V ( z 0 - z ) N + 2 Γ 1 2 2 N δ 2 + Γ 1 2 δ ( 2 V ( z 0 - z ) ) 1 2

and

(4.4) A X N - b 2 8 Γ 1 2 2 Γ - 1 N δ 2 α + 4 Γ 1 2 Γ - 1 δ ( 2 V ( z 0 - z ) ) 1 2 α + 4 Γ - 1 V ( z 0 - z ) N α + 2 δ 2 ,

where we recall that the constant α is defined in assumption (A1).

The proof is given in Section A.3.

First, we comment on the chosen optimality measures. As discussed in [62, 49, 50], the Lagrangian gap is equivalent to the Bregman distance of the iterates to the solution. If the penalty is strongly convex, the Bregman divergence is an upper bound of the squared norm of the difference between the reconstructed and the ideal solution, while if J is only convex, the Bregman divergence gives only limited information and in general it is a very weak convergence measure. For instance, in the exact case, a vanishing Lagrangian gap does not imply that cluster points of the generated sequence are primal solutions. However, as can be derived from [50], a vanishing Lagrangian gap coupled with vanishing feasibility gap implies that every cluster point of the primal sequence is a solution of the primal problem.

In both theorems, the established result ensures that the two optimality measures can be upper bounded with the sum of two terms. The first one, which can be interpreted as an optimization error, is of the order 𝒪 ( N - 1 ) , and so it goes to zero as N tends to + . Note that, in the exact case δ = 0 , only this term is present and both the Lagrangian and the feasibility gap are indeed vanishing, guaranteeing that every cluster point of the sequence is a primal solution. The second term, which can be interpreted as a stability control, collects all errors due to the perturbation of the exact datum and takes also into account the presence of the activation operators T, when the reuse data constraints are noisy. It is an increasing function of the number of iterations and the noise level δ.

Remark 4.3.

Theorems 4.1 and 4.2 are an extension of [16, Theorem 1], where it is proved that the sequence generated by the algorithms converges to an element in 𝒵 when δ = 0 , but no convergence rates neither stability bounds were given. In this work, we filled the gap for linearly constrained convex optimization problems. Moreover, in the noise free case, our assumptions on the additional operators T are weaker than those proposed in [16], where α-averagedness is required. For the noisy case, without the activation operators (and so with e = 0 ), our bounds are of the same order as in [49] in the number of iterations and noise level (δ).

As mentioned above, in (4.1) and (4.2), when δ > 0 and N + , the upper bounds for the (PDA) iterates tend to infinity and the iteration may not converge to the desired solution. The same comment can be made for the (DPA) iterates, based on (4.3) and (4.4). In both cases, to obtain a minimal reconstruction error, we need to impose a trade-off between convergence and stability. The next corollary introduces an early stopping criterion, depending only on the noise level and leading to stable reconstruction.

Corollary 4.4 (Early-stopping).

Under the assumptions of Theorem 4.1 or Theorem 4.2, choose N = c / δ for some c > 0 . Then, for every z = ( x , u ) Z , there exist constants C 1 , C 2 , and C 3 such that

( X N , u ) - ( x , U N ) C 1 δ ,
A X N - b 2 C 2 δ + C 3 δ 2 .

The early stopping rule prescribed above is computationally efficient, in the sense that the number of iterations is proportional to the inverse of the noise level. In particular, if the error δ is small, then more iterations are useful, while if δ is big, it is convenient to stop sooner. So, the number of iterations plays the role of a regularization parameter. Using the early stopping strategy proposed above, we can see that the error in the data transfers to the error in the solution with the same noise level, which is the best that one can expect for a general operator A.

Remark 4.5 (Comparison with Tikhonov regularization).

The reconstruction properties of the proposed algorithms are comparable to the ones obtained using Tikhonov regularization [35, 10], with the same dependence on the noise level. We underline that in [10, Theorem 5.1] only the Bregman divergence is considered, and not the feasibility. In addition, iterative regularization is way more efficient from the computational point of view, as it requires the solution of only one optimization problem, while Tikhonov regularization amounts to solve a family of problems indexed by the regularization parameter. Let us also note that, when δ is unknown, any principle used to determine a suitable λ can be used to determine the stopping time.

5 Implementation details

In this section, we discuss some standard choices to construct non-expansive operators T that satisfy our assumptions and encode some redundant information on the solution set. We first present examples for (PDA), and later for (DPA).

To define the operators, we first recall how to compute the projection on the constraint determined by each datum. For every j [ d ] , we denote by a j the j-th row of A, and by P j the projection onto the j-th linear equation, namely

P j : p p , x x + b j - a j x a j 2 a j * .

Analogously, for every j [ d ] , we denote by P j δ the projection operator as in the previous definition, but with the noisy data b δ instead of b.

We proceed to define the four families of operators proposed in this paper for (PDA).

Definition 5.1.

Consider the operator T : p p .

  1. T is a serial projection if

    T = P β l δ P β 1 δ ,

    where, for every j [ l ] , β j [ d ] .

  2. T is a parallel projection if

    (5.1) T = j = 1 l α j P β j δ ,

    where, for every j [ l ] , β j [ d ] and ( α j ) j = 1 l are real numbers in [ 0 , 1 ] such that j = 1 l α j = 1 .

  3. T is a Landweber operator with parameter α if

    (5.2) T : p p , x x - α A * ( A x - b δ ) ,

    where

    α ] 0 , 2 A 2 [ .

  4. T is a Landweber operator with adaptive step and parameter M if

    (5.3) T : p p , x { x - β ( x ) A * ( A x - b δ ) if  A * A x A * b δ , x otherwise.

    where, for M > 0 ,

    β ( x ) = min ( A x - b δ 2 A * ( A x - b δ ) 2 , M ) .

The next lemma states that the operators in Definition 5.1 satisfy assumption (A3).

Lemma 5.2.

Let T : R p R p be one of the operators given in Definition 5.1. Then assumption (A3) holds with the following error coefficients:

  1. If T is a serial projection, then

    e T = j = 1 l 1 a β j 2 .

  2. If T is a parallel projection, then

    e T = j = 1 l α j a β j 2 .

  3. If T is the Landweber operator with parameter α , then

    e T = α 2 - α A 2 .

  4. If T is the Landweber operator with adaptive step and parameter M , then e T = M .

Remark 5.3 (Relationship between parallel projection and Landweber operator).

A particular parallel projection is the one corresponding to l = d , β j = j , and

α j = a j 2 A F 2 .

Then (5.1) reduces to

(5.4) T ( x ) = x - 1 A F 2 A * ( A x - b δ ) .

Observe that, since A A F , the previous is a special case of the Landweber operator with α = 1 / A F 2 .

Remark 5.4 (Steepest descent).

Let x ¯ p be such that A x ¯ = b . Then, from (5.3), we derive (see also equation (A.37))

(5.5) T x - x ¯ 2 = x - x ¯ 2 - 2 β ( x ) b δ - b A x - b δ - 2 β ( x ) A x - b δ 2 + β ( x ) 2 A * ( A x - b δ ) 2 .

If δ = 0 , then the choice of β ( x ) given in (5.3) minimizes the right-hand side of (5.5) if the minimizer is smaller than M. In this case, β is chosen in order to maximize the contractivity with respect to a fixed point of T. While we cannot repeat the same procedure for δ > 0 , since we do not know b, we still keep the same choice. If b δ ran ( A ) , then

sup x p A x - b δ 2 / A * ( A x - b δ ) 2 < + .

However, in general, if δ > 0 , this is not true and M is needed to ensure that β ( x ) is bounded.

Remark 5.5.

From a computational point of view, parallel projections and Landweber operators are more efficient than serial projections. In particular, note that the quantity ( A x k - b δ ) needs to be computed anyway in the other steps of the algorithm.

While for the primal space the data constraints that we want to reuse are clearly given by the linear constraints, for the dual there is not always a natural choice. In the following, we present an example related to the 1 -norm regularization. A similar implementation can be extended to the case of 1-homogenous penalty functions, for which the Fenchel conjugate is the indicator of a closed and convex subset of the dual space [7, Proposition 14.11 (ii)].

Example 5.6.

Consider the noisy version of problem ((${\mathcal{P}}$)) with J ( x ) = x 1 . Then the dual is given by

min u d { b δ , u : | ( A * u ) i | 1  for every  i [ p ] } .

For every i [ p ] , set D i = { u d : | ( A * u ) i | 1 } and denote by T i the projection over D i . Note that this projection is easy to compute, see for example [7, Example 28.17], since it is the projection onto the intersection of two parallel half-hyperplanes. Clearly, assumption (A4) holds. Differently from the primal case, here we are projecting on exact constraints which are independent of the noisy data b δ .

6 Numerical results

In this section, to test the efficiency of the proposed algorithms, we perform numerical experiments in two settings: sparse reconstruction with 1 -norm regularization, and image denoising and deblurring with total variation regularization. For the 1 -norm regularization, we compare our results with other regularization techniques. In the more complex problem of total variation, we explore the properties of different variants of our procedure.

Code statement:

All numerical examples are implemented in MATLAB on a laptop. In the second experiment, we also use the library Numerical tours [59]. The corresponding code can be downloaded at: https://github.com/cristianvega1995/L1-TV-Experiments-of-Fast-iterative-regularization-by-reusing-data-constraints

6.1 1 -norm regularization

In this section, we apply the routines (PDA) and (DPA) when J is equal to the 1 -norm. We compare the results given by our method with two state-of-the-art regularization procedures: iterative regularization by vanilla primal-dual [49], and Tikhonov explicit regularization, solving each problem by using the forward-backward algorithm [30]. In addition, we compare to another classical optimization algorithm for the minimization of the sum of two non-differentiable functions, namely the Douglas–Rachford algorithm [14]. In the noise free case, this algorithm is very effective in terms of number of iterations, but at each iteration it requires the explicit projection on the feasible set. In the noisy case, a stability analysis of the latter is not available.

We use the four variants of the algorithm (PDA) corresponding to the different choices of the operators T in Definition 5.1 and the version of (DPA) described in Example 5.6. Unless otherwise stated, in all experiments we use as preconditioners Σ = Γ = 0.99 A Id , which both satisfy (3.2).

Let d = 2260 , p = 3000 , and let A d × p be such that every entry of the matrix is an independent sample from 𝒩 ( 0 , 1 ) , then normalized column by column. We set b := A x * , where x * p is a sparse vector with approximately 300 nonzero entries uniformly distributed in the interval [ 0 , 1 ] . It follows from [36, Theorem 9.18] that x * is the unique minimizer of the problem with probability bigger than 0.99. Let b δ be such that b δ = b + b u , where the vector u is distributed, entry-wise, as U [ - 0.2 , 0.2 ] . In this experiment, to test the reconstruction capabilities of our method, we use the exact datum x * to establish the best stopping time, i.e. the one minimizing x k - x * . The exact solution is also used for the other regularization techniques. In a real practical situation, if δ is unknown, we would need to use parameter tuning techniques in order to select the optimal stopping time when both x * and δ are unknown, but we do not address this aspect here.

We detail the used algorithms and their parameters below:

  1. Tikhonov regularization (Tik). We consider a grid of penalty parameters

    G = { ( 1 - l - 1 5 ) 10 1 - d A b δ : l [ 5 ] , d [ 6 ] }

    and, for each value λ G , the optimization problem

    (6.1) min x p { λ x 1 + 1 2 A x - b δ 2 } .

    We solve each one of the previous problems with 300 iterations of the forward-backward algorithm, unless the stopping criterion x k + 1 - x k 10 - 3 is satisfied for k < 300 . Moreover, to deal efficiently with the sequence of problems, we use warm restart [9]. We first solve problem (6.1) for the biggest value of λ in G. Then we initialize the algorithm for the next value of λ, in decreasing order, with the solution reached for the previous one, and so on.

  2. Douglas–Rachford (DR). See [14, Theorem 3.1].

  3. Primal-dual (PD). This corresponds to PDA with m = 1 and T 1 = Id .

  4. Primal-dual with serial projections (PDS). At every iteration, we compute a serial projection using all equations of the noisy system, where the order of the projections is given by a random shuffle.

  5. Primal-dual with parallel projections (PDP). Set m = 1 and

    T 1 x = x - 1 A F 2 A * ( A x - b δ ) ;

    see Remark 5.3.

  6. Primal-dual Landweber (PDL). Set m = 1 and

    T 1 x = x - 2 A 2 A * ( A x - b δ ) .

  7. Primal-dual Landweber with adaptive step (PDAL). Set m = 1 and T 1 x = x - β ( x ) A * ( A x - b δ ) , where

    β ( x ) = min ( A x - b δ 2 A * ( A x - b δ ) 2 , M ) for  M = 10 6 .

  8. Dual primal with serial projections (DPS). At every iteration, we compute a serial projection over every inequality of A * u 1 , where the order is given by a random shuffle of the rows of A * .

Table 2

Run-time and number of iterations of each method until it reaches the best reconstruction error. We compare the proposed algorithms with Tikhonov regularization (Tik), Douglas–Rachford (DR), and iterative regularization (PD).

Time [in seconds] Iteration Reconstruction error
Tik 1.89 109 3.07
DR 3.08 5 5.01
PD 0.36 14 3.11
PDS 1.41 11 2.58
PDP 0.35 14 3.11
PDL 0.28 12 2.60
PDAL 0.27 11 2.56
DPS 0.54 17 2.83
Figure 1 
                   Graphical representation of early stopping. Note that the reconstruction error decreases and then increases, since the iterates first approach the exact solution and then converge to the noisy solution.
Figure 1

Graphical representation of early stopping. Note that the reconstruction error decreases and then increases, since the iterates first approach the exact solution and then converge to the noisy solution.

Figure 2 
                   Early stopping with respect to the feasibility. Note that their behavior with respect to k is similar to that in
Figure 1.
Figure 2

Early stopping with respect to the feasibility. Note that their behavior with respect to k is similar to that in Figure 1.

Figure 3 
                   Reconstruction error of Tikhonov method with different penalties.
Figure 3

Reconstruction error of Tikhonov method with different penalties.

In Table 2, we reported also the number of iterations needed to achieve the best reconstruction error, but it is important to note that the iteration of each method has a different computational cost, so the run-time is a more appropriate comparison criterion.

Douglas–Rachford with early stopping is the regularization method performing worst on this example, both in terms of time and reconstruction error. This behavior may be explained by the fact that this algorithm converges fast (meaning in few iterations) to the noisy solution, from which we infer that Douglas–Rachford is not a good algorithm for iterative regularization. Moreover, since we project on the noisy feasible set at every iteration, the resolution of a linear system is needed at every step. This also explains the cost of each iteration in terms of time. Note in addition that in our example b δ is in the range of A, and so the noisy feasible set is nonempty. Tikhonov’s regularization performs similarly in terms of time, but it requires many more (cheaper) iterations (see Figure 3). The achieved error is smaller than the one of DR, but bigger than the minimal one achieved by other methods.

Regarding our proposals, we observe that in Table 2 the proposed methods perform better than (PD). This supports the idea that reusing the constraints determined by the data is beneficial with respect to vanilla primal-dual. The benefit is not evident for (PDP), which achieves the worst reconstruction error, since A F 2 is very big and so T 1 is very close to the identity. All other methods give better results in terms of reconstruction error. On the other hand, (PDS) is the slowest since it requires computing several projections at each iteration in a serial manner. We also observe that (PDL) and (PDAL) have better performance improving 22.2 % and 25.0 % in reconstruction error and 16.4 % and 17.7 % in run-time.

Figure 1 empirically shows the existence of the trade-off between convergence and stability for all algorithms, and therefore the advantage of early stopping. Similar results were obtained for the feasibility gap (see Figure 2).

6.2 Total variation

In this section, we perform several numerical experiments using the proposed algorithms for image denoising and deblurring. As done in the classical image denoising method introduced by Rudin, Osher, and Fantemi in [67], we rely on the total variation regularizer. See also [67, 58, 66, 26, 28, 56, 79]. We compare (PD) with (PDL) and (PDAL) algorithms, which were the algorithms performing the best in the previous application. In this section, we use two different preconditioners, which have been proved to be very efficient in practice [60].

Let x * N 2 represent an image with N × N pixels in [ 0 , 1 ] . We want to recover x * from a blurry and noisy measurement y, i.e. from

y = K x * + ζ ,

where K is a linear blurring operator and ζ is a random noise vector. A standard approach is to assume that the original image is well approximated by the solution of the following constrained minimization problem:

(TV) min u N × N { D u 1 , 2 : K u = y } .

Here,

1 , 2 : ( 2 ) N × N , p i = 1 N j = 1 N p i j ,

and D : N 2 ( 2 ) N 2 is the discrete gradient operator for images, which is defined by

( D u ) i j = ( ( D x u ) i j , ( D y u ) i j )

with

( D y u ) i j = { u i + 1 , j - u i , j if  1 i N - 1 , 0 if  i = N ,
( D x u ) i j = { u i , j + 1 - u i , j if  1 j N - 1 , 0 if  j = N .

In order to avoid the computation of the proximity operator of D 1 , 2 , we introduce an auxiliary variable

v = D u Y := ( 2 ) N 2 .

Since the value in each pixel must belong to [ 0 , 1 ] , we add the constraint u X := [ 0 , 1 ] N 2 . In this way, (TV) becomes

(TV) min ( u , v ) X × Y { v 1 , 2 : K u = y , D u = v } .

6.2.1 Formulation and algorithms

Problem (TV) is a special instance of ((${\mathcal{P}}$)), with

{ J : N 2 × ( 2 ) N 2 { + } , x := ( u , v ) v 1 , 2 + ι X ( u ) , A = [ K 0 D - Id ] , b δ = [ y 0 ] , p = d = 3 N 2 .

Clearly, A is a linear nonzero operator, and J Γ 0 ( N 2 × ( 2 ) N 2 ) .

Table 3

General form of the algorithms.

Primal-dual for total variation
Input: ( p 0 , p - 1 , x 0 , v 0 ) 6 N 2 × ( 2 ) N 2 and ( q 0 , q - 1 , z 0 , w 0 ) 3 N 2 × N 2 .

For k = 1 , , L :

(6.2) v k + 1 = v k + Γ ( K ( p k + x k - p k - 1 ) k - y ) w k + 1 = w k - Γ ( q k + z k - q k - 1 ) + Γ D ( p k + x k - p k - 1 ) x k + 1 = P X ( p k - Σ K * v k + 1 + Σ w k + 1 ) z k + 1 = prox Σ 1 , 2 ( q k - Σ D * w k + 1 ) p k + 1 = x k - α ( x k ) ( K * ( K x k - y ) + ( D x k - z k ) ) q k + 1 = q k - α ( x k ) D * ( D x k - z k )

End

We compare the algorithms listed below. Note that all proposed algorithms are different instances of the general routine described in Table 3, and each of them corresponds to a different choice of α ( x k ) :

  1. PD, the vanilla primal-dual algorithm, corresponding to α ( x k ) = 0 .

  2. PPD, the preconditioned primal-dual algorithm, obtained by α ( x k ) = 0 and Σ and Γ as in [60, Lemma 2].

  3. PDL, corresponding to α ( x k ) = 1 / A 2 .

  4. PDAL, corresponding to α ( x k ) = β ( x k ) as (5.3).

Initializing by p 0 = p ¯ 0 = x 0 and q 0 = q ¯ 0 = z 0 , we recover the results of Theorem 4.1 and Corollary 4.4.

Remark 6.1.

In order to implement the algorithm in (6.2), we first need to compute some operators:

  1. It follows from [7, Proposition 24.11] and [7, Example 24.20] that

    prox 1 , 2 Σ ( v ) = ( prox Σ i ( v i ) ) i = 1 N 2 = ( ( 1 - Σ max { Σ , v } ) v i ) i = 1 N 2 ,

    where v i 2 . The projection onto X can be computed as

    P X ( u ) = ( P [ 0 , 1 ] ( u i ) ) i = 1 N 2 ,

    where

    P [ 0 , 1 ] ( u i ) = min { 1 , max { u i , 0 } } .

  2. It follows from [26] that

    - D * p = div p = { ( p 1 ) i , j - ( p 1 ) i - 1 , j if  1 < i < N ( p 1 ) i , j if  i = 1 , - ( p 1 ) i - 1 , j if  i = N ,
    + { ( p 2 ) i , j - ( p 2 ) i , j - 1 if  1 < j < N , ( p 2 ) i , j if  j = 1 , - ( p 2 ) i , j - 1 if  j = N .

6.2.2 Numerical results

Figure 4 
                     Qualitative comparison of the four proposed methods.
Figure 4

Qualitative comparison of the four proposed methods.

Set N = 256 and let x * be the image “boat” in the library Numerical tours [59]. We suppose that K is an operator assigning to every pixel the average of the pixels in a neighborhood of radius 8 and that ζ U ( [ - 0.025 , 0.025 ] ) N 2 . We use the original image as exact solution. For denoising and deblurring, we early stop the procedure at the iteration minimizing the mean square error (MSE), namely x k - x * 2 / N 2 , and we measure the time and the number of iterations needed to reach it. Another option for early stopping could be to consider the image with minimal structural similarity (SSIM). Numerically, in our experiments, this gives the same results. Additionally, we use the peak signal-to-noise ratio (PSNR) to compare the images. Note that the primal-dual algorithm with preconditioning is the method that needs less time and iterations among all procedures. Moreover, due to [29, Lemma 2], the condition (3.2) is automatically satisfied, while for the other methods we need to check it explicitly, which is computationally costly. However, (PPD) is the worst in terms of SSIM, PNSR, and MSE. We verify that all other algorithms have a superior performance in terms of reconstruction, with a small advantage for the Landweber with fixed and adaptive step-sizes, reducing the MSE of 94 % with respect to the noisy image. In addition, compared to (PD), the algorithms (PDL) and (PDAL) require less iterations and time to satisfy the early stopping criterion. We believe that this is due to the fact that the extra Landweber operator improves the feasibility of the primal iterates. Visual assessment of the denoised and deblurred images are shown in Figure 4, which highlights the regularization properties achieved by the addition of the Landweber operator and confirms the previous conclusions.

Table 4

Quantitative comparison of the algorithms in terms of structural similarity (SSIM), peak signal-to-noise ratio (PSNR), mean square error (MSE), time, and iterations to reach the early stopping.

Iterations Time SSIM PNSR MSE
Noisy image 0.4468 21.4801 0.0071
PD 54 8.9773 0.8928 32.3614 0.0006
PD (precondition) 5 1.5515 0.8581 27.3753 0.0018
PDL 46 7.1846 0.9066 34.2174 0.0004
PDAL 31 5.4542 0.9112 34.3539 0.0004

7 Conclusion and future work

In this paper, we studied two new iterative regularization methods for solving a linearly constrained minimization problem, based on an extra activation step reusing the data constraints. The analysis was carried out in the context of convex functions and worst-case deterministic noise. We proposed five instances of our algorithm and compared their numerical performance with state-of-the-art methods, and we observed considerable improvement in run-time.

In the future, we would like to extend Theorem 4.1 to structured convex problems and other algorithms. Possible extensions are: (1) the study of problems including, in the objective function, a L-smooth term and a composite linear term; (2) the analysis of random updates in the dual variable (see [27]) and stochastic approximations for the gradient; (3) the theoretical study of the impact of different preconditioners; (4) the improvement of the convergence and stability rates for strongly convex objective functions.

Award Identifier / Grant number: 861137

Award Identifier / Grant number: NoMADS-DLV-777826

Award Identifier / Grant number: SLING 819789

Award Identifier / Grant number: FA9550-18-1-7009

Award Identifier / Grant number: FA8655-22-1-7034

Award Identifier / Grant number: CCF-1231216

Award Identifier / Grant number: CUP D33C23001110001

Funding statement: This project has been supported by the TraDE-OPT project, which received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 861137. The third author acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR projects FA9550-18-1-7009 (European Office of Aerospace Research and Development), the EU H2020-MSCA-RISE project NoMADS-DLV-777826, and the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. The third and fourth authors acknowledge the support of the AFOSR project FA8655-22-1-7034. The research by the first, second, and fourth authors has been supported by the MIUR Excellence Department Project awarded to Dipartimento di Matematica, Università di Genova, CUP D33C23001110001. The second and fourth authors are members of the Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di Alta Matematica (INdAM). This work represents only the view of the authors. The European Commission and the other organizations are not responsible for any use that may be made of the information it contains.

A Proofs

A.1 Equivalence between primal-dual and dual-primal algorithms

In the following lemma, we establish that, if T = Id and the initialization is the same, then there is an equivalence between the k-th primal variable of (PDA) and (DPA), denoted by u P D k and u D P k , respectively.

Lemma A.1.

Let

( p ¯ P D 0 , x P D 0 , u P D 0 ) 2 p × d 𝑎𝑛𝑑 ( v ¯ D P 0 , u D P 0 , x D P 0 ) 2 d × p

be the initialization (PDA) and (DPA), respectively, in the case when m = 1 and T = Id . Suppose that x P D 0 = p ¯ P D 0 , u D P 0 = v ¯ D P 0 , u P D 0 = v D P 0 , and x P D 1 = x D P 1 . Then, for every k N , x P D k = x D P k .

Proof.

Since m = 1 and T = Id in both algorithms, for every k , we have x P D k = p P D k and u D P k = v D P k . On one hand, by the definition of (PDA), we have that

(A.1)

u P D k + 1 = u P D 1 + Γ i = 1 k ( A p ¯ P D i - b δ )
= u P D 1 + i = 1 k Γ A ( p P D i - p P D i - 1 ) + Γ i = 1 k ( A x P D i - b δ )
= u P D 1 + Γ A ( p P D k - p P D 0 ) + Γ i = 1 k ( A x P D i - b δ )
= u P D 0 + Γ ( A x P D k - b δ ) + Γ i = 1 k ( A x P D i - b δ ) ,

where the last equality is obtained since p P D 0 = p ¯ P D 0 . Replacing (A.1) in the definition of x P D k + 1 , we obtain

(A.2) x P D k + 1 = prox J Σ ( x P D k - Σ A * ( u P D 0 + Γ ( A x P D k - b δ ) + Γ i = 1 k ( A x P D i - b δ ) ) ) .

On the other hand, by (DPA) we have that

u D P k + 1 = v D P k + 1 = v D P 0 + Γ i = 1 k + 1 ( A x D P i - b δ )

and

(A.3) v ¯ D P k = v D P k + u D P k - v D P k - 1 = v D P 0 + Γ ( A x D P k - b δ ) + Γ i = 1 k ( A x D P i - b δ ) .

Replacing (A.3) in (DPA), for every k > 1 , we can deduce that

x D P k + 1 = prox J Σ ( x D P k - Σ A * ( v D P 0 + Γ ( A x D P k - b δ ) + Γ i = 1 k ( A x D P i - b δ ) ) ) .

Since u P D 0 = v D P 0 and x P D 1 = x D P 1 , the result follows by induction. ∎

Remark A.2.

An analysis similar to that in the proof of Lemma A.1 shows that

x P D k + 1 = prox J Σ ( x P D k - Σ A * ( u P D 0 + Γ ( A T ϵ k x P D k - b δ ) + Γ i = 1 k ( A x P D i - b δ ) ) ) ,

which implies that the algorithm can be written in one step if we only care about the primal variable.

A.2 Proof of Theorem 4.1

Proof.

From (PDA), we deduce that

Σ - 1 ( p k - x k + 1 ) - A * u k + 1 J ( x k + 1 ) ,
(A.4) Γ - 1 ( u k - u k + 1 ) + A p ¯ k = b δ .

Therefore, we have

(A.5) J ( x k + 1 ) + Σ - 1 ( p k - x k + 1 ) - A * u k + 1 x - x k + 1 J ( x ) for all  x p ,

and (A.5) yields

(A.6)

0 J ( x k + 1 ) - J ( x ) + Σ - 1 ( p k - x k + 1 ) - A * u k + 1 x - x k + 1
= J ( x k + 1 ) - J ( x ) + p k - x k + 1 Σ - 1 2 2 + x k + 1 - x Σ - 1 2 2 - p k - x Σ - 1 2 2 + x k + 1 - x A * u k + 1 .

Analogously, by (A.4), we get

(A.7)

0 = Γ - 1 ( u k - u k + 1 ) + A p ¯ k - b δ u - u k + 1
= u k + 1 - u k Γ - 1 2 2 + u k + 1 - u Γ - 1 2 2 - u k - u Γ - 1 2 2 + b δ - A p ¯ k u k + 1 - u .

Recall that

z := ( x , u ) 𝒵 C × d , z k := ( x k , u k ) , V ( z ) := x Σ - 1 2 2 + u Γ - 1 2 2 .

Summing (A.6) and (A.7), and by assumption (A3), we obtain

(A.8)

J ( x k + 1 ) - J ( x ) + x k + 1 - p k Σ - 1 2 2 + u k + 1 - u k Γ - 1 2 2 + V ( z k + 1 - z ) - V ( z k - z )
+ A ( x k + 1 - x ) u k + 1 + b δ - A p ¯ k u k + 1 - u - e δ 2 2
0 .

Now, compute

(A.9)

J ( x k + 1 ) - J ( x ) + A ( x k + 1 - x ) u k + 1 + b δ - A p ¯ k u k + 1 - u
= ( x k + 1 , u ) - ( x , u k + 1 ) - A x k + 1 - b u + A x - b u k + 1
+ A ( x k + 1 - x ) u k + 1 + b δ - A p ¯ k u k + 1 - u
= ( x k + 1 , u ) - ( x , u k + 1 ) - A x k + 1 u + b u + A x u k + 1 - b u k + 1
+ A x k + 1 u k + 1 - A x u k + 1 + b δ u k + 1 - u - A p ¯ k u k + 1 - u
= ( x k + 1 , u ) - ( x , u k + 1 ) + b δ - b u k + 1 - u + A x k + 1 - A p ¯ k u k + 1 - u
( x k + 1 , u ) - ( x , u k + 1 ) - δ Γ 1 2 u k + 1 - u Γ - 1 + A x k + 1 - A p ¯ k u k + 1 - u .

From (A.9) and (A.8), we obtain

( x k + 1 , u ) - ( x , u k + 1 ) + x k + 1 - p k Σ - 1 2 2 + u k + 1 - u k Γ - 1 2 2
    + V ( z k + 1 - z ) - V ( z k - z ) - δ Γ 1 2 u k + 1 - u Γ - 1 - e δ 2 2
(A.10) - A ( x k + 1 - p ¯ k ) u k + 1 - u
= - A ( x k + 1 - p k ) u k + 1 - u + A ( x k - p k - 1 ) u k - u + A ( x k - p k - 1 ) u k + 1 - u k
= - A ( x k + 1 - p k ) u k + 1 - u + A ( x k - p k - 1 ) u k - u
    + Γ 1 2 A Σ 1 2 Σ - 1 2 ( x k - p k - 1 ) Γ - 1 2 ( u k + 1 - u k )
- A ( x k + 1 - p k ) u k + 1 - u + A ( x k - p k - 1 ) u k - u
(A.11)     + Γ 1 2 A Σ 1 2 2 u k + 1 - u k Γ - 1 2 2 + x k - p k - 1 Σ - 1 2 2 .

Then, recalling that α = 1 - Γ 1 2 A Σ 1 2 2 , we have the following estimate:

( x k + 1 , u ) - ( x , u k + 1 ) + x k + 1 - p k Σ - 1 2 2 - x k - p k - 1 Σ - 1 2 2 + α 2 u k + 1 - u k Γ - 1 2 + V ( z k + 1 - z ) - V ( z k - z )
δ Γ 1 2 u k + 1 - u Γ - 1 - A ( x k + 1 - p k ) u k + 1 - u + A ( x k - p k - 1 ) u k - u + e δ 2 2 .

Summing from 1 to N - 1 , we obtain

(A.12)

k = 1 N - 1 ( ( x k + 1 , u ) - ( x , u k + 1 ) ) + x N - p N - 1 Σ - 1 2 2 - x 1 - p 0 Σ - 1 2 2
+ α 2 k = 1 N - 1 u k + 1 - u k Γ - 1 2 + V ( z N - z ) - V ( z 1 - z ) - A ( x 1 - p 0 ) u 1 - u
δ Γ 1 2 k = 1 N u k - u Γ - 1 - Γ 1 2 A Σ 1 2 Σ - 1 2 ( x N - p N - 1 ) Γ - 1 2 ( u N - u ) + ( N - 1 ) e δ 2 2
δ Γ 1 2 k = 1 N u k - u Γ - 1 + x N - p N - 1 Σ - 1 2 2 + Γ 1 2 A Σ 1 2 2 u N - u Γ - 1 2 2 + ( N - 1 ) e δ 2 2 .

Now, by choosing k = 0 in (A.10), we get

(A.13)

( x 1 , u ) - ( x , u 1 ) + x 1 - p 0 Σ - 1 2 2 + α 2 u 1 - u 0 Γ - 1 2 + V ( z 1 - z ) - V ( z 0 - z ) + A ( x 1 - p ¯ 0 ) u 1 - u
δ Γ 1 2 u 1 - u Γ - 1 + e δ 2 2 .

Adding (A.12) and (A.13), we obtain

(A.14)

k = 0 N - 1 ( ( x k + 1 , u ) - ( x , u k + 1 ) ) + α 2 u N - u Γ - 1 2 + k = 1 N α 2 u k - u k - 1 Γ - 1 2 + x N - x Σ - 1 2 2
δ Γ 1 2 k = 1 N u k - u Γ - 1 + V ( z 0 - z ) + N e δ 2 2 .

Next, by (A.11), we have the following estimate:

x k + 1 - p k Σ - 1 2 2 - A ( x k - p k - 1 ) u k + 1 - u k + u k + 1 - u k Γ - 1 2 2
+ ( x k + 1 , u ) - ( x , u k + 1 ) + V ( z k + 1 - z ) - V ( z k - z )
δ Γ 1 2 u k + 1 - u Γ - 1 - A ( x k + 1 - p k ) u k + 1 - u + A ( x k - p k - 1 ) u k - u + e δ 2 2 .

Summing from 1 to N - 1 , we obtain

(A.15)

k = 1 N - 1 ( x k + 1 - p k Σ - 1 2 2 - A ( x k - p k - 1 ) u k + 1 - u k + u k + 1 - u k Γ - 1 2 2 )
+ k = 1 N - 1 ( ( x k + 1 , u ) - ( x , u k + 1 ) ) + V ( z N - z ) - V ( z 1 - z ) - A ( x 1 - p 0 ) u 1 - u
δ Γ 1 2 k = 1 N - 1 u k + 1 - u Γ - 1 - A ( x N - p N - 1 ) u N - u + ( N - 1 ) e δ 2 2
= δ Γ 1 2 k = 1 N - 1 u k + 1 - u Γ - 1 - Γ 1 2 A Σ 1 2 Σ - 1 2 ( x N - p N - 1 ) Γ - 1 2 ( u N - u ) + ( N - 1 ) e δ 2 2
δ Γ 1 2 k = 1 N - 1 u k + 1 - u Γ - 1 + Γ 1 2 A Σ 1 2 2 2 x N - p N - 1 Σ - 1 2 + u N - u Γ - 1 2 2 + ( N - 1 ) e δ 2 2 .

Now, since

u k + 1 - u k = Γ ( A p ¯ k - b δ ) ,

we derive that

k = 1 N - 1 ( x k + 1 - p k Σ - 1 2 2 - A ( x k - p k - 1 ) u k + 1 - u k + u k + 1 - u k Γ - 1 2 2 )
= k = 1 N - 1 ( x k - p k - 1 Σ - 1 2 2 - A ( x k - p k - 1 ) u k + 1 - u k + u k + 1 - u k Γ - 1 2 2 )
+ x N - p N - 1 Σ - 1 2 2 - x 1 - p 0 Σ - 1 2 2
= k = 1 N - 1 ( Γ 1 2 A ( x k - p k - 1 ) 2 2 - Γ 1 2 A ( x k - p k - 1 ) Γ 1 2 ( A p ¯ k - b δ ) + Γ 1 2 ( A p ¯ k - b δ ) 2 2 )
+ x N - p N - 1 Σ - 1 2 2 - x 1 - p 0 2 2 + k = 1 N - 1 ( x k - p k - 1 Σ - 1 2 2 - Γ 1 2 A ( x k - p k - 1 ) 2 2 )
= k = 1 N - 1 Γ 1 2 ( A p k - b δ ) 2 2 + x N - p N - 1 Σ - 1 2 2 - x 1 - p 0 Σ - 1 2 2
+ k = 1 N - 1 ( x k - p k - 1 Σ - 1 2 2 - Γ 1 2 A ( x k - p k - 1 ) 2 2 ) .

Furthermore, since

α = 1 - Γ 1 2 A Σ 1 2 2 > 0 ,

we obtain

k = 1 N - 1 Γ 1 2 ( A p k - b δ ) 2 2 + x N - p N - 1 Σ - 1 2 2 - x 1 - p 0 Σ - 1 2 2
+ k = 1 N - 1 ( x k - p k - 1 Σ - 1 2 2 - Γ 1 2 A ( x k - p k - 1 ) 2 2 )
k = 1 N - 1 Γ 1 2 ( A p k - b δ ) 2 2 + x N - p N - 1 Σ - 1 2 2 - x 1 - p 0 Σ - 1 2 2 + α 2 k = 1 N - 1 Γ 1 2 A ( x k - p k - 1 ) 2
k = 1 N - 1 Γ 1 2 ( A p k - b δ ) 2 2 + x N - p N - 1 Σ - 1 2 2 - x 1 - p 0 Σ - 1 2 2
- α 2 Γ 1 2 A ( x N - p N - 1 ) 2 + α 2 Γ 1 2 A ( x 1 - p 0 ) 2 + α 2 k = 1 N - 1 Γ 1 2 A ( x k + 1 - p k ) 2 .

In turn, by the convexity of 2 , we obtain

(A.16)

k = 1 N - 1 Γ 1 2 ( A p k - b δ ) 2 2 + x N - p N - 1 Σ - 1 2 2 - x 1 - p 0 Σ - 1 2 2
- α 2 Γ 1 2 A ( x N - p N - 1 ) 2 + α 2 Γ 1 2 A ( x 1 - p 0 ) 2 + α 2 k = 1 N - 1 Γ 1 2 A ( x k + 1 - p k ) 2
α 4 k = 1 N - 1 Γ 1 2 ( A x k + 1 - b δ ) 2 - x 1 - p 0 Σ - 1 2 2 + α 2 Γ 1 2 A ( x 1 - p 0 ) 2 + α 2 + Γ 1 2 A Σ 1 2 2 2 x N - p N - 1 Σ - 1 2
α 4 k = 2 N Γ 1 2 ( A x k - b δ ) 2 - x 1 - p 0 Σ - 1 2 2 + α 2 Γ 1 2 A ( x 1 - p 0 ) 2 + α 2 + Γ 1 2 A Σ 1 2 2 2 x N - p N - 1 Σ - 1 2 .

On the other hand, we get

(A.17) Γ 1 2 ( A x k - b δ ) 2 A x k - b δ 2 Γ - 1 1 Γ - 1 ( A x k - b 2 2 - b δ - b 2 ) .

Combining (A.13), (A.15), (A.16), and (A.17), we have that

(A.18)

k = 0 N - 1 ( ( x k + 1 , u ) - ( x , u k + 1 ) ) + α 2 2 x N - p N - 1 Σ - 1 2 k = 1 N α 8 Γ - 1 A x k + 1 - b 2 + x N - x Σ - 1 2 2
δ Γ 1 2 k = 1 N u k - u Γ - 1 + V ( z 0 - z ) + N e δ 2 2 + N α 4 Γ - 1 δ 2 .

It remains to bound δ Γ 1 2 k = 1 N u k - u Γ - 1 . From (A.14) and since ( x , u ) is a saddle-point of the Lagrangian, we deduce that

(A.19) u N - u 2 2 Γ 1 2 δ α k = 1 N u k - u + 2 V ( z 0 - z ) α + N e δ 2 α .

Applying [62, Lemma A.1] to equation (A.19) with

λ k := 2 Γ 1 2 δ α and S N := 2 V ( z 0 - z ) α + N e δ 2 α ,

we get

u N - u N Γ 1 2 δ α + ( 2 V ( z 0 - z ) α + N e δ 2 α + ( N Γ 1 2 δ α ) 2 ) 1 2
2 N Γ 1 2 δ α + ( 2 V ( z 0 - z ) α ) 1 2 + ( N e δ 2 α ) 1 2 .

Insert the previous in equation (A.14) to obtain

k = 0 N - 1 ( ( x k + 1 , u ) - ( x , u k + 1 ) )
2 ( N Γ 1 2 δ ) 2 α + N Γ 1 2 δ ( V ( z 0 - z ) α ) 1 2 + N Γ 1 2 δ ( N e δ 2 α ) 1 2 + V ( z 0 - z ) + N e δ 2 2 .

Analogously, from (A.18),

k = 1 N A x k - b 2 16 N 2 Γ Γ - 1 δ 2 α 2 + 8 N δ Γ - 1 ( 2 Γ V ( z 0 - z ) α 3 ) 1 2 + 8 N δ 2 Γ - 1 ( Γ e N α 3 ) 1 2
+ 8 Γ - 1 V ( z 0 - z ) α + 2 N δ 2 + 4 N Γ - 1 e δ 2 α ,

and both results are straightforward from Jensen’s inequality. ∎

A.3 Proof of Theorem 4.2

Proof.

It follows from (DPA) that

Σ - 1 ( x k - x k + 1 ) - A * v ¯ k J ( x k + 1 ) ,
(A.20) Γ - 1 ( v k - u k + 1 ) + A x k + 1 = b δ .

Thus,

(A.21) J ( x k + 1 ) + Σ - 1 ( x k - x k + 1 ) - A * v ¯ k x - x k + 1 J ( x ) ,

and (A.21) yields

(A.22)

0 J ( x k + 1 ) - J ( x ) + Σ - 1 ( x k - x k + 1 ) - A * v ¯ k x - x k + 1
= J ( x k + 1 ) - J ( x ) + x k - x k + 1 Σ - 1 2 2 + x k + 1 - x Σ - 1 2 2 - x k - x Σ - 1 2 2 + x k + 1 - x A * v ¯ k .

From (A.20), it follows that

(A.23)

0 = Γ - 1 ( v k - u k + 1 ) + A x k + 1 - b δ u - u k + 1
= u k + 1 - v k Γ - 1 2 2 + u k + 1 - u Γ - 1 2 2 - v k - u Γ - 1 2 2 + b δ - A x k + 1 u k + 1 - u .

Recall that

z := ( x , u ) 𝒵 C × d , z k := ( x k , u k ) , V ( z ) := x Σ - 1 2 2 + u Γ - 1 2 2 .

Summing (A.22) and (A.23), we obtain

(A.24)

J ( x k + 1 ) - J ( x ) + x k + 1 - x k Σ - 1 2 2 + u k + 1 - v k Γ - 1 2 2 + V ( z k + 1 - z ) - V ( z k - z )
+ A ( x k + 1 - x ) v ¯ k + b δ - A x k + 1 u k + 1 - u
0 .

Now, compute

(A.25)

J ( x k + 1 ) - J ( x ) + A ( x k + 1 - x ) v ¯ k + b δ - A x k + 1 u k + 1 - u
= ( x k + 1 , u ) - ( x , u k + 1 ) - A x k + 1 - b u + A x - b u k + 1
+ A ( x k + 1 - x ) v ¯ k + b δ - A x k + 1 u k + 1 - u
= ( x k + 1 , u ) - ( x , u k + 1 ) - A x k + 1 u + b u + A x u k + 1 - b u k + 1
+ A ( x k + 1 - x ) v ¯ k + b δ u k + 1 - u - A x k + 1 u k + 1 + A x k + 1 u
= ( x k + 1 , u ) - ( x , u k + 1 ) + b δ - b u k + 1 - u + A ( x k + 1 - x ) v ¯ k - u k + 1
= ( x k + 1 , u ) - ( x , u k + 1 ) + b δ - b u k + 1 - u + A ( x k + 1 - x ) v k - u k + 1
+ A ( x k + 1 - x ) u k - v k - 1
= ( x k + 1 , u ) - ( x , u k + 1 ) + b δ - b u k + 1 - u + A ( x k + 1 - x ) v k - u k + 1
+ A ( x k - x ) u k - v k - 1 + A ( x k + 1 - x k ) u k - v k - 1
= ( x k + 1 , u ) - ( x , u k + 1 ) + b δ - b u k + 1 - u + A ( x k + 1 - x ) v k - u k + 1
+ A ( x k - x ) u k - v k - 1 + Γ 1 2 A ( x k + 1 - x k ) Γ - 1 2 ( u k - v k - 1 ) .

From (A.25) and (A.24), we obtain

( x k + 1 , u ) - ( x , u k + 1 ) + x k + 1 - x k Σ - 1 2 2 + u k + 1 - v k Γ - 1 2 2 + V ( z k + 1 - z ) - V ( z k - z )
- b δ - b u k + 1 - u - A ( x k + 1 - x ) v k - u k + 1 + A ( x k - x ) v k - 1 - u k
- Γ 1 2 A ( x k + 1 - x k ) Γ - 1 2 ( u k - v k - 1 )
δ Γ 1 2 u k + 1 - u Γ - 1 - A ( x k + 1 - x ) v k - u k + 1 + A ( x k - x ) v k - 1 - u k
+ x k + 1 - x k Σ - 1 2 2 + Γ 1 2 A Σ 1 2 2 u k - v k - 1 Γ - 1 2 2 .

Therefore, we have that

( x k + 1 , u ) - ( x , u k + 1 ) + u k + 1 - v k Γ - 1 2 2 - Γ 1 2 A Σ 1 2 2 u k - v k - 1 Γ - 1 2 2 + V ( z k + 1 - z ) - V ( z k - z )
δ Γ 1 2 u k + 1 - u Γ - 1 - A ( x k + 1 - x ) v k - u k + 1 + A ( x k - x ) v k - 1 - u k .

Summing from 1 to N - 1 , we obtain

(A.26)

k = 1 N - 1 ( ( x k + 1 , u ) - ( x , u k + 1 ) ) + α 2 k = 1 N - 1 u k + 1 - v k Γ - 1 2 + V ( z N - z ) + Γ 1 2 A Σ 1 2 2 u N - v N - 1 Γ - 1 2 2
δ Γ 1 2 k = 1 N - 1 u k + 1 - u - A ( x N - x ) v N - 1 - u N + A ( x 1 - x ) v 0 - u 1 + V ( z 1 - z )
δ Γ 1 2 k = 1 N - 1 u k + 1 - u + Γ 1 2 A Σ 1 2 2 u N - v N - 1 Γ - 1 2 2 + x N - x Σ - 1 2 2
+ A ( x 1 - x ) v 0 - u 1 + V ( z 1 - z ) .

Reordering (A.26), we obtain

(A.27)

k = 1 N - 1 ( ( x k + 1 , u ) - ( x , u k + 1 ) ) + α 2 k = 1 N - 1 u k + 1 - v k Γ - 1 2 + u N - u Γ - 1 2 2
δ Γ 1 2 k = 1 N - 1 u k + 1 - u + A ( x 1 - x ) v 0 - u 1 + V ( z 1 - z ) .

On the other hand, from (A.24) and (A.25) we get

(A.28) ( x 1 , u ) - ( x , u 1 ) + α 2 u 1 - v 0 2 δ u 1 - u - A ( x 1 - x ) v ¯ 0 - u 1 V ( z 0 - z ) - V ( z 1 - z ) .

Summing (A.27) and (A.28) yields

(A.29) k = 1 N ( ( x k , u ) - ( x , u k ) ) + α 2 k = 1 N u k - v k - 1 Γ - 1 2 + u N - u Γ - 1 2 2 δ Γ 1 2 k = 1 N u k - u + V ( z 0 - z ) .

Moreover, since u k + 1 - v k = Γ ( A x k + 1 - b δ ) , we have

(A.30)

u k + 1 - v k Γ - 1 2 = Γ ( A x k + 1 - b δ ) A x k + 1 - b δ
A x k + 1 - b δ 2 Γ - 1
1 Γ - 1 ( A x k + 1 - b 2 2 - b δ - b 2 ) .

From (A.29) and (A.30), we obtain

(A.31)

k = 0 N - 1 ( ( x k + 1 , u ) - ( x , u k + 1 ) ) + α 4 Γ - 1 k = 1 N A x k - b 2 + u N - u Γ - 1 2 2
δ Γ 1 2 k = 1 N u k - u Γ - 1 + V ( z 0 - z ) + α N δ 2 2 Γ - 1 .

From (A.29), it follows that

(A.32) u N - u Γ - 1 2 2 δ Γ 1 2 k = 1 N u k - u Γ - 1 + 2 V ( z 0 - z ) .

Apply [62, Lemma A.1] to equation (A.32) with λ k := 2 δ Γ 1 2 and S k := 2 V ( z 0 - z ) to get

(A.33) u k - u Γ - 1 N Γ 1 2 δ + ( 2 V ( z 0 - z ) + ( N Γ 1 2 δ ) 2 ) 1 2 2 N Γ 1 2 δ + ( 2 V ( z 0 - z ) ) 1 2 .

Insert (A.33) into equation (A.29) to obtain

k = 0 N - 1 ( ( x k + 1 , u ) - ( x , u k + 1 ) ) 2 Γ 1 2 2 N 2 δ 2 + N Γ 1 2 δ ( 2 V ( z 0 - z ) ) 1 2 + V ( z 0 - z ) .

By (A.31) and (A.33), we have

k = 1 N A x k - b 2 4 Γ - 1 α ( 2 Γ 1 2 2 N 2 δ 2 + N Γ 1 2 δ ( 2 V ( z 0 - z ) ) 1 2 + V ( z 0 - z ) + α N δ 2 2 Γ - 1 ) ,

and both results follows from Jensen’s inequality. ∎

A.4 Proof of Lemma 5.2

Proof.

Let us first recall that

(A.34) P δ : x x + b j δ - a j x a j 2 a j * .

Note that the j-th equation of C and C δ are parallel. Then, for every j [ d ] and x ¯ C , we get

(A.35)

P j δ x - x ¯ 2 = P j x - x ¯ 2 + 2 P j x - x ¯ P j δ x - P j x + P j x - P j δ x 2
= P j x - x ¯ 2 + P j x - P j δ x 2 .

Analogously, we have that

(A.36) x - x ¯ 2 = x - P j x 2 + P j x - x ¯ 2 .

It follows from (A.35) and (A.36) that

P j δ x - x ¯ 2 + x - P j x 2 = x - x ¯ 2 + P j δ x - P j x 2 .

Hence,

P j δ x - x ¯ 2 x - x ¯ 2 + P j δ x - P j x 2
x - x ¯ 2 + ( b j δ - b j ) 2 a j 2
x - x ¯ 2 + δ 2 a j 2 .

  1. Since T = P β l δ P β 1 δ , it is clear that C δ Fix T and, by induction, we have that

    T x - x ¯ 2 x - x ¯ 2 + e δ 2 ,

    where

    e = l max i = 1 , , d a i .

  2. The proof follows from the convexity of 2 , which is obtained with

    e = 1 max i = 1 , , d a i .

  3. Let x ¯ C . By (5.2), we have

    T x - x ¯ 2 = x - x ¯ 2 - 2 α x - x ¯ A * ( A x - b δ ) + α 2 A * ( A x - b δ ) 2
    = x - x ¯ 2 - 2 α A x - b A x - b δ + α 2 A * ( A x - b δ ) 2
    x - x ¯ 2 - 2 α b δ - b A x - b δ + ( α 2 A 2 - 2 α ) A x - b δ 2 .

    Now, using the Young inequality with parameter 2 - α A 2 , we have that

    T x - x ¯ 2 x - x ¯ 2 + α 2 - α A 2 b δ - b 2 x - x ¯ 2 + α δ 2 2 - α A 2 .

    It remains to prove that, if C δ 0 , then C δ Fix T , which is clear from (5.2).

  4. Let x ¯ C and x p . If A * A x = A * b δ , then (3.3) immediately holds. Otherwise, we have

    (A.37)

    T x - x ¯ 2 = x - x ¯ 2 - 2 β ( x ) x - x ¯ A * ( A x - b δ ) + β ( x ) 2 A * ( A x - b δ ) 2
    = x - x ¯ 2 - 2 β ( x ) A x - b A x - b δ + β ( x ) 2 A * ( A x - b δ ) 2
    = x - x ¯ 2 - 2 β ( x ) b δ - b A x - b δ - 2 β ( x ) A x - b δ 2 + β ( x ) 2 A * ( A x - b δ ) 2 .

    Now, using the Young inequality with parameter

    2 - β ( x ) A * ( A x - b δ ) 2 A x - b δ 2 ,

    we have that

    T x - x ¯ 2 x - x ¯ 2 + β ( x ) 2 - β ( x ) ( A * ( A x - b δ ) 2 / A x - b δ 2 ) b δ - b 2 x - x ¯ 2 + M δ 2 .

    Finally, it is clear from (5.3) that, if C δ 0 , then C δ Fix T .

References

[1] A. Alacaoglu, O. Fercoq and V. Cevher, On the convergence of stochastic primal-dual hybrid gradient, SIAM J. Optim. 32 (2022), no. 2, 1288–1318. 10.1137/19M1296252Search in Google Scholar

[2] M. Bachmayr and M. Burger, Iterative total variation schemes for nonlinear inverse problems, Inverse Problems 25 (2009), no. 10, Article ID 105004. 10.1088/0266-5611/25/10/105004Search in Google Scholar

[3] M. A. Bahraoui and B. Lemaire, Convergence of diagonally stationary sequences in convex optimization, Set-Valued Anal. 2 (1994), 49–61. 10.1007/BF01027092Search in Google Scholar

[4] A. B. Bakushinsky and M. Y. Kokurin, Iterative Methods for Approximate Solution of Inverse Problems, Math. Appl. (New York) 577, Springer, Dordrecht, 2005. 10.1007/978-1-4020-3122-9Search in Google Scholar

[5] P. L. Bartlett and M. Traskin, AdaBoost is consistent, J. Mach. Learn. Res. 8 (2007), 2347–2368. Search in Google Scholar

[6] F. Bauer, S. Pereverzev and L. Rosasco, On regularization algorithms in learning theory, J. Complexity 23 (2007), no. 1, 52–72. 10.1016/j.jco.2006.07.001Search in Google Scholar

[7] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer, Cham, 2017. 10.1007/978-3-319-48311-5Search in Google Scholar

[8] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, Oper. Res. Lett. 31 (2003), no. 3, 167–175. 10.1016/S0167-6377(02)00231-6Search in Google Scholar

[9] S. Becker, J. Bobin and E. J. Candès, NESTA: A fast and accurate first-order method for sparse recovery, SIAM J. Imaging Sci. 4 (2011), no. 1, 1–39. 10.1137/090756855Search in Google Scholar

[10] M. Benning and M. Burger, Error estimates for general fidelities, Electron. Trans. Numer. Anal. 38 (2011), 44–68. Search in Google Scholar

[11] M. Benning and M. Burger, Modern regularization methods for inverse problems, Acta Numer. 27 (2018), 1–111. 10.1017/S0962492918000016Search in Google Scholar

[12] G. Blanchard and N. Krämer, Optimal learning rates for kernel conjugate gradient regression, Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1, ACM, New York (2010), 226–234. Search in Google Scholar

[13] R. I. Boţ and T. Hein, Iterative regularization with a general penalty term—theory and application to L 1 and TV regularization, Inverse Problems 28 (2012), no. 10, Article ID 104010. 10.1088/0266-5611/28/10/104010Search in Google Scholar

[14] L. M. Briceño Arias, A Douglas–Rachford splitting method for solving equilibrium problems, Nonlinear Anal. 75 (2012), no. 16, 6053–6059. 10.1016/j.na.2012.06.014Search in Google Scholar

[15] L. M. Briceño Arias, Forward-Douglas–Rachford splitting and forward-partial inverse method for solving monotone inclusions, Optimization 64 (2015), no. 5, 1239–1261. 10.1080/02331934.2013.855210Search in Google Scholar

[16] L. M. Briceño Arias, J. Deride and C. Vega, Random activations in primal-dual splittings for monotone inclusions with a priori information, J. Optim. Theory Appl. 192 (2022), no. 1, 56–81. 10.1007/s10957-021-01944-6Search in Google Scholar

[17] L. M. Briceño Arias and S. López Rivera, A projected primal-dual method for solving constrained monotone inclusions, J. Optim. Theory Appl. 180 (2019), no. 3, 907–924. 10.1007/s10957-018-1430-2Search in Google Scholar

[18] M. Burger, E. Resmerita and L. He, Error estimation for Bregman iterations and inverse scale space methods in image restoration, Computing 81 (2007), no. 2–3, 109–135. 10.1007/s00607-007-0245-zSearch in Google Scholar

[19] J.-F. Cai, E. J. Candès and Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (2010), no. 4, 1956–1982. 10.1137/080738970Search in Google Scholar

[20] J.-F. Cai, S. Osher and Z. Shen, Linearized Bregman iterations for frame-based image deblurring, SIAM J. Imaging Sci. 2 (2009), no. 1, 226–252. 10.1137/080733371Search in Google Scholar

[21] L. Calatroni, G. Garrigos, L. Rosasco and S. Villa, Accelerated iterative regularization via dual diagonal descent, SIAM J. Optim. 31 (2021), no. 1, 754–784. 10.1137/19M1308888Search in Google Scholar

[22] E. J. Candès, Matrix completion with noise, Proc. IEEE 98 (2010), no. 6, 925–936. 10.1109/JPROC.2009.2035722Search in Google Scholar

[23] E. J. Candès and B. Recht, Exact matrix completion via convex optimization, Found. Comput. Math. 9 (2009), no. 6, 717–772. 10.1007/s10208-009-9045-5Search in Google Scholar

[24] E. J. Candès, J. Romberg and T. Tao, Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information, IEEE Trans. Inform. Theory 52 (2006), no. 2, 489–509. 10.1109/TIT.2005.862083Search in Google Scholar

[25] E. J. Candes and T. Tao, Near-optimal signal recovery from random projections: Universal encoding strategies?, IEEE Trans. Inform. Theory 52 (2006), no. 12, 5406–5425. 10.1109/TIT.2006.885507Search in Google Scholar

[26] A. Chambolle, An algorithm for total variation minimization and applications, J. Math. Imaging Vision 20 (2004), 89–97. 10.1023/B:JMIV.0000011321.19549.88Search in Google Scholar

[27] A. Chambolle, M. J. Ehrhardt, P. Richtárik and C.-B. Schönlieb, Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications, SIAM J. Optim. 28 (2018), no. 4, 2783–2808. 10.1137/17M1134834Search in Google Scholar

[28] A. Chambolle and P.-L. Lions, Image recovery via total variation minimization and related problems, Numer. Math. 76 (1997), no. 2, 167–188. 10.1007/s002110050258Search in Google Scholar

[29] A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging, J. Math. Imaging Vision 40 (2011), no. 1, 120–145. 10.1007/s10851-010-0251-1Search in Google Scholar

[30] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Model. Simul. 4 (2005), no. 4, 1168–1200. 10.1137/050626090Search in Google Scholar

[31] L. Condat, A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms, J. Optim. Theory Appl. 158 (2013), no. 2, 460–479. 10.1007/s10957-012-0245-9Search in Google Scholar

[32] C. De Mol, E. De Vito and L. Rosasco, Elastic-net regularization in learning theory, J. Complexity 25 (2009), no. 2, 201–230. 10.1016/j.jco.2009.01.002Search in Google Scholar

[33] D. L. Donoho, Compressed sensing, IEEE Trans. Inform. Theory 52 (2006), no. 4, 1289–1306. 10.1109/TIT.2006.871582Search in Google Scholar

[34] J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting, J. Mach. Learn. Res. 10 (2009), 2899–2934. Search in Google Scholar

[35] H. W. Engl, M. Hanke and A. Neubauer, Regularization of Inverse Problems, Math. Appl. 375, Kluwer Academic, Dordrecht, 1996. 10.1007/978-94-009-1740-8Search in Google Scholar

[36] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing, Appl. Numer. Harmon. Anal., Birkhäuser/Springer, New York, 2013. 10.1007/978-0-8176-4948-7Search in Google Scholar

[37] G. Garrigos, L. Rosasco and S. Villa, Iterative regularization via dual diagonal descent, J. Math. Imaging Vision 60 (2018), no. 2, 189–215. 10.1007/s10851-017-0754-0Search in Google Scholar

[38] G. H. Golub, M. Heath and G. Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979), no. 2, 215–223. 10.1080/00401706.1979.10489751Search in Google Scholar

[39] E. B. Gutiérrez, C. Delplancke and M. J. Ehrhardt, Convergence properties of a randomized primal-dual algorithm with applications to parallel mri, International Conference on Scale Space and Variational Methods in Computer Vision, Springer, New York (2021), 254–266. 10.1007/978-3-030-75549-2_21Search in Google Scholar

[40] M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization, International conference on machine learning, Proc. Mach. Learn. Res. (PMLR) 28 (2013), 427–435. Search in Google Scholar

[41] B. Jin, D. A. Lorenz and S. Schiffler, Elastic-net regularization: Error estimates and active set methods, Inverse Problems 25 (2009), no. 11, Article ID 115022. 10.1088/0266-5611/25/11/115022Search in Google Scholar

[42] S. Kaczmarz, Angenäherte Auflösung von Systemen linearer Gleichungen, Bull. Int. Acad. Pol. Sic. Let. Cl. Sci. Math. Nat. 35 (1937), 335–357. Search in Google Scholar

[43] B. Kaltenbacher, A. Neubauer and O. Scherzer, Iterative Regularization Methods for Nonlinear Ill-Posed Problems, Radon Ser. Comput. Appl. Math. 6, Walter de Gruyter, Berlin, 2008. 10.1515/9783110208276Search in Google Scholar

[44] L. Landweber, An iteration formula for Fredholm integral equations of the first kind, Amer. J. Math. 73 (1951), 615–624. 10.2307/2372313Search in Google Scholar

[45] H. Li, N. Chen and L. Li, Error analysis for matrix elastic-net regularization algorithms, IEEE Trans. Neural Netw. Learn. Syst. 23 (2012), no. 5, 737–748. 10.1109/TNNLS.2012.2188906Search in Google Scholar PubMed

[46] D. A. Lorenz, Convergence rates and source conditions for Tikhonov regularization with sparsity constraints, J. Inverse Ill-Posed Probl. 16 (2008), no. 5, 463–478. 10.1515/JIIP.2008.025Search in Google Scholar

[47] S. Matet, L. Rosasco, S. Villa and B. L. Vu, Don’t relax: Early stopping for convex regularization, preprint (2017), https://arxiv.org/abs/1707.05422. Search in Google Scholar

[48] C. Molinari, J. Liang and J. Fadili, Convergence rates of forward-Douglas–Rachford splitting method, J. Optim. Theory Appl. 182 (2019), no. 2, 606–639. 10.1007/s10957-019-01524-9Search in Google Scholar PubMed PubMed Central

[49] C. Molinari, M. Massias, L. Rosasco and S. Villa, Iterative regularization for convex regularizers, Proc. Mach. Learn. Res. (PMLR) 130 (2021), 1684–1692. Search in Google Scholar

[50] C. Molinari, M. Massias, L. Rosasco and S. Villa, Iterative regularization for low complexity regularizers, preprint (2022), https://arxiv.org/abs/2202.00420. Search in Google Scholar

[51] C. Molinari and J. Peypouquet, Lagrangian penalization scheme with parallel forward-backward splitting, J. Optim. Theory Appl. 177 (2018), no. 2, 413–447. 10.1007/s10957-018-1265-xSearch in Google Scholar

[52] C. Molinari, J. Peypouquet and F. Roldan, Alternating forward-backward splitting for linearly constrained optimization problems, Optim. Lett. 14 (2020), no. 5, 1071–1088. 10.1007/s11590-019-01388-ySearch in Google Scholar

[53] E. Moulines and F. Bach, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, Advances in Neural Information Processing Systems 24, Morgan Kaufmann, Burlington (2011), 451–459. Search in Google Scholar

[54] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization, John Wiley & Sons, New York, 1983. Search in Google Scholar

[55] A. Neubauer, On Nesterov acceleration for Landweber iteration of linear ill-posed problems, J. Inverse Ill-Posed Probl. 25 (2017), no. 3, 381–390. 10.1515/jiip-2016-0060Search in Google Scholar

[56] S. Osher, M. Burger, D. Goldfarb, J. Xu and W. Yin, An iterative regularization method for total variation-based image restoration, Multiscale Model. Simul. 4 (2005), no. 2, 460–489. 10.1137/040605412Search in Google Scholar

[57] S. Osher, Y. Mao, B. Dong and W. Yin, Fast linearized Bregman iteration for compressive sensing and sparse denoising, Commun. Math. Sci. 8 (2010), no. 1, 93–111. 10.4310/CMS.2010.v8.n1.a6Search in Google Scholar

[58] S. Osher and L. I. Rudin, Feature-oriented image enhancement using shock filters, SIAM J. Numer. Anal. 27 (1990), no. 4, 919–940. 10.1137/0727053Search in Google Scholar

[59] G. Peyré, The numerical tours of signal processing-advanced computational signal and image processing, IEEE Comput. Sci. Eng. 13 (2011), no. 4, 94–97. 10.1109/MCSE.2011.71Search in Google Scholar

[60] T. Pock and A. Chambolle, Diagonal preconditioning for first order primal-dual algorithms in convex optimization, 2011 International Conference on Computer Vision, IEEE Press, Piscataway (2011), 1762–1769. 10.1109/ICCV.2011.6126441Search in Google Scholar

[61] H. Raguet, J. Fadili and G. Peyré, A generalized forward-backward splitting, SIAM J. Imaging Sci. 6 (2013), no. 3, 1199–1226. 10.1137/120872802Search in Google Scholar

[62] J. Rasch and A. Chambolle, Inexact first-order primal-dual algorithms, Comput. Optim. Appl. 76 (2020), no. 2, 381–430. 10.1007/s10589-020-00186-ySearch in Google Scholar

[63] G. Raskutti, M. J. Wainwright and B. Yu, Early stopping and non-parametric regression: An optimal data-dependent stopping rule, J. Mach. Learn. Res. 15 (2014), 335–366. Search in Google Scholar

[64] L. Rosasco and S. Villa, Learning with incremental iterative regularization, Advances in Neural Information Processing Systems 28, Curran Associates, Red Hook (2015), 1630–1638. Search in Google Scholar

[65] M. Rudelson and R. Vershynin, Geometric approach to error-correcting codes and reconstruction of signals, Int. Math. Res. Not. IMRN 2005 (2005), no. 64, 4019–4041. 10.1155/IMRN.2005.4019Search in Google Scholar

[66] L. I. Rudin and S. Osher, Total variation based image restoration with free local constraints, Proceedings of 1st International Conference on Image Processing, IEEE Press, Piscataway (1994), 31–35. 10.1109/ICIP.1994.413269Search in Google Scholar

[67] L. I. Rudin, S. Osher and E. Fatemi, Nonlinear total variation based noise removal algorithms, Phys. D 60 (1992), 259–268. 10.1016/0167-2789(92)90242-FSearch in Google Scholar

[68] O. Scherzer, A modified Landweber iteration for solving parameter estimation problems, Appl. Math. Optim. 38 (1998), no. 1, 45–68. 10.1007/s002459900081Search in Google Scholar

[69] F. Schöpfer and D. A. Lorenz, Linear convergence of the randomized sparse Kaczmarz method, Math. Program. 173 (2019), no. 1, 509–536. 10.1007/s10107-017-1229-1Search in Google Scholar

[70] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University, Cambridge, 2014. 10.1017/CBO9781107298019Search in Google Scholar

[71] A. Silveti-Falls, C. Molinari and J. Fadili, Generalized conditional gradient with augmented Lagrangian for composite minimization, SIAM J. Optim. 30 (2020), no. 4, 2687–2725. 10.1137/19M1240460Search in Google Scholar

[72] A. Silveti-Falls, C. Molinari and J. Fadili, Inexact and stochastic generalized conditional gradient with augmented Lagrangian and proximal step, J. Nonsmooth Anal. Optim. 2 (2021), 1–41. 10.46298/jnsao-2021-6480Search in Google Scholar

[73] A. Silveti-Falls, C. Molinari and J. Fadili, A stochastic Bregman primal-dual splitting algorithm for composite optimization, Pure Appl. Funct. Anal. 8 (2023), no. 3, 921–964. Search in Google Scholar

[74] I. Steinwart and A. Christmann, Support Vector Machines, Inform. Sci. Stat., Springer, New York, 2008. Search in Google Scholar

[75] T. Strohmer and R. Vershynin, A randomized Kaczmarz algorithm with exponential convergence, J. Fourier Anal. Appl. 15 (2009), no. 2, 262–278. 10.1007/s00041-008-9030-4Search in Google Scholar

[76] A. N. Tikhonov, Solution of incorrectly formulated problems and the regularization method, Soviet Math. 4 (1963), 1035–1038. Search in Google Scholar

[77] Y. Tsaig and D. L. Donoho, Extensions of compressed sensing, Signal Process. 86 (2006), no. 3, 549–571. 10.1016/j.sigpro.2005.05.029Search in Google Scholar

[78] B. C. Vũ, A splitting algorithm for dual monotone inclusions involving cocoercive operators, Adv. Comput. Math. 38 (2013), no. 3, 667–681. 10.1007/s10444-011-9254-8Search in Google Scholar

[79] L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, J. Mach. Learn. Res. 11 (2010), 2543–2596. Search in Google Scholar

[80] Y. Yao, L. Rosasco and A. Caponnetto, On early stopping in gradient descent learning, Constr. Approx. 26 (2007), no. 2, 289–315. 10.1007/s00365-006-0663-2Search in Google Scholar

[81] W. Yin, Analysis and generalizations of the linearized Bregman model, SIAM J. Imaging Sci. 3 (2010), no. 4, 856–877. 10.1137/090760350Search in Google Scholar

[82] W. Yin, S. Osher, D. Goldfarb and J. Darbon, Bregman iterative algorithms for l 1 -minimization with applications to compressed sensing, SIAM J. Imaging Sci. 1 (2008), no. 1, 143–168. 10.1137/070703983Search in Google Scholar

[83] T. Zhang and B. Yu, Boosting with early stopping: Convergence and consistency, Ann. Statist. 33 (2005), no. 4, 1538–1579. 10.1214/009053605000000255Search in Google Scholar

[84] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005), no. 2, 301–320. 10.1111/j.1467-9868.2005.00503.xSearch in Google Scholar

[85] H. Zou and H. H. Zhang, On the adaptive elastic-net with a diverging number of parameters, Ann. Statist. 37 (2009), no. 4, 1733–1751. 10.1214/08-AOS625Search in Google Scholar PubMed PubMed Central

Received: 2023-01-27
Revised: 2023-06-23
Accepted: 2023-08-07
Published Online: 2023-10-27
Published in Print: 2024-08-01

© 2024 Walter de Gruyter GmbH, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 22.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/jiip-2023-0009/html
Scroll to top button