Markov decision processes approximation with coupled dynamics via Markov deterministic control systems

Gustavo Portillo-Ramírez; Hugo Cruz-Suárez; Ruy López-Ríos; Rubén Blancas-Rivera

doi:10.1515/math-2023-0129

Article Open Access

Markov decision processes approximation with coupled dynamics via Markov deterministic control systems

, , and

Published/Copyright: October 24, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

$Open Mathematics$

From the journal Open Mathematics Volume 21 Issue 1

Abstract

This article presents an approximation of discrete Markov decision processes with small noise on Borel spaces with an infinite horizon and an expected total discounted cost by the corresponding deterministic Markov process. In both cases, the dynamics evolve through a system consisting of two coupled difference equations. It is assumed that the difference equations of the system are perturbed by a small noise. Under our assumptions, a bound for the stability index is given, and the optimal cost convergence rate is estimated using a small perturbation parameter. Moreover, the convergence of the optimal policy on compact subsets is verified. Finally, two examples are presented to illustrate the developed theory.

Keywords: Markov decision processes; small noise disturbance parameter; total discounted cost; stability index; convergence rate

MSC 2010: 90C40; 93C55; 93C73; 93E20

1 Introduction

This article deals with the so-called discrete-time Markov decision processes (MDPs) with an infinite horizon and total discounted cost [1–5]. The importance of working with MDPs lies in the wide range of applications in various disciplines, e.g., engineering, computer science, communications, and economics [6,7]. The main problem in MDPs is to determine an optimal policy and the optimal value function. To characterize and determine the solutions of MDPs, the dynamic programming (DP) approach [2,8] is available.

In this work, the MDPs of interest are those that evolve through dynamics consisting of two coupled difference equations, as shown in equations (1) and (2). Equation (1) models the transitions of system states, where the set of all states is denoted by X and its elements are called x -states. Similarly, equation (2) models the change of the system’s parameters; the set of all parameters is denoted by Γ , and its elements are called α -states. Let ε 0 and δ 0 be positive numbers and let ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] , then we consider disturbances { ξ t ( ε ) } and { η t ( δ ) } , which are sequences of independent and identically distributed random elements with values in some Borel spaces ( S 1 , r 1 ) and ( S 2 , r 2 ) (metric perturbation spaces [9] or noise spaces [10]), respectively. Moreover, suppose that there exist s 1 ∈ S 1 and s 2 ∈ S 2 such that s 1 = ξ t ( 0 ) and s 2 = η t ( 0 ) , each element of the above sequences depends on numerical parameters ε and δ such that E r 1 ( ξ ( ε ) , s 1 ) → 0 when ε → 0 and E r 2 ( η ( δ ) , s 2 ) → 0 when δ → 0 , where ξ and η are generic elements of { ξ t ( ε ) } and { η t ( δ ) } , respectively. In this framework, we are interested in the following problems:

To study approximations of MDPs by the deterministic control process (see equations (3) and (4)). In particular, we are interested in ensuring that the optimal policy of the deterministic system is asymptotically optimal for the random system (see Theorem 1 and Remark 4).
To analyze the convergence of the optimal value function and the optimal policy of the stochastic system when ε → 0 and δ → 0 (see Theorem 2).

The following briefly describes work related to the problems discussed in this manuscript. In a study by Liptser et al. [11], the problem of approximating a continuous-time stochastic control process by a deterministic process was considered. In this article, the authors demonstrate that the stochastic problem can be approximated by a deterministic one when the noise is small and the fluctuations become fast. In this context, it is shown that the optimal control of the deterministic problem is asymptotically optimal for stochastic problems. In the continuous case, Dupuis and Kushner [12] addressed a similar problem, i.e., when the effects of noise in a physical system are small, these authors performed an asymptotic analysis of the diffusion approximation and used it for the desired estimates in the original system. For discrete-time MDPs, these classes of problems were studied by Cruz-Suarez and Ilhuicatzi-Roldan [9] and Cruz-Suarez et al. [13], where the dynamic of the system is described by a single difference equation. Convergence between models was also addressed in by Kara and Yuksel [14]. However, convergence is studied using sequences belonging to the set of admissible state-action pairs, which is assumed to be a subset of a given Euclidean space. Moreover, this study is carried out under the assumption that the action space is a compact set and that the cost function is bounded. Now, when considering MDPs that are developed with respect to equations (1) and (2), the results found in the study by Cruz-Suarez et al. [13] are generalized. The approach of using coupled equations can be applied, e.g., by considering a random discount factor [15–18], where the second difference equation refers to the evolution of the random discount factor.

The methodology for solving the problems described above is to impose Lipschitz continuity [19,20] constraints on the components of the control model and to apply DP techniques. Specifically, we assume Lipschitz conditions for the functions c , F , and G involved in the dynamic system composed of two coupled difference equations (see equations (1) and (2)). A direct consequence of this assumption is the Lipschitz continuity of the optimal cost, which corresponds to an additional contribution to the present manuscript. This approach ensures the following three important aspects:

The existence of an upper bound for the stability index [10,21,22] when we apply the optimal policy of the deterministic system. Consequently, it results that the optimal policy of the deterministic system is asymptotically optimal for the stochastic system (see Theorem 1 and Remark 4).
A convergence rate of the optimal cost function for the random system with respect to the deterministic system.
The uniform convergence of the optimal stochastic policy to the deterministic policy when ε → 0 and δ → 0 on compact subsets of the state space.

This article is structured as follows: Section 2 presents the basic theory of MDPs with states evolving with dynamics consisting of two coupled difference equations; Section 3 provides the approximation problem statement for the value function and the optimal policy; Section 4 presents the results that provide the bound for the stability index δ ˆ ε , δ , the convergence rate of the optimal cost, and the convergence of the optimal policy on compact subsets; Section 5 illustrates the developed theory with two examples. The first relates to a consumption-investment problem [15,23]. The second example is a control problem with small additive noise. For both problems, the upper bounds for the stability index and the convergence rate of the optimal value function are given explicitly; and finally, in Section 6, concluding remarks are given.

2 Markov control model

Consider the following Markov model:

ℳ ≔ ( X × Γ , A , { A ( x , α ) ∣ ( x , α ) ∈ X × Γ } , Q , c ) ,

where X × Γ and A are Borel spaces and are called the state space and the action space, respectively; { A ( x , α ) ∣ ( x , α ) ∈ X × Γ } is a family of non-empty measurable subsets A ( x , α ) of A , where A ( x , α ) denotes the set of feasible actions (controls) when the system is in state ( x , α ) ∈ X × Γ . The set K of feasible state-actions is defined as follows:

K ≔ { ( x , α , a ) ∣ ( x , α ) ∈ X × Γ , a ∈ A ( x , α ) } ,

which is a measurable subset of X × Γ × A ; the next component is a stochastic kernel Q on X × Γ given K , i.e., Q ( ⋅ ∣ x , α , a ) is a probability measure on X × Γ for each ( x , α , a ) ∈ K and Q ( B ∣ ⋅ ) is a measurable function on K for each B ∈ ℬ ( X × Γ ) , where ℬ ( X × Γ ) denotes the Borel σ -algebra of X × Γ ; c : K → R is a measurable function called the one-stage cost function.

Remark 1

In the subsequent development, the metrics of the spaces X , Γ , and A will be denoted by d x , d α , and d 2 , respectively. Consequently, the following metric is defined on X × Γ :

d 1 ( ( x , α ) , ( x ′ , α ′ ) ) = max { d x ( x , x ′ ) , d α ( α , α ′ ) }

for all ( x , α ) , ( x ′ , α ′ ) ∈ X × Γ . Furthermore, on K the metric d is defined as follows:

d ( ( x , α , a ) , ( x ′ , α ′ , a ′ ) ) = max { d 1 ( ( x , α ) , ( x ′ , α ′ ) ) , d 2 ( a , a ′ ) }

for all ( x , α , a ) , ( x ′ , α ′ , a ′ ) ∈ K .

The dynamic of the system is described below. Suppose that at time t , t ∈ { 0 , 1 , … } , the system occupies state ( x t , α t ) = ( x , α ) ∈ X × Γ . Then, the decision-maker (or controller) chooses a control a t = a ∈ A ( x , α ) . Consequently, two things happen:

a cost c ( x t , α t , a t ) is incurred, and
the system jumps to a state ( x t + 1 , α t + 1 ) = ( x ′ , α ′ ) according to the transition law Q ( ⋅ ∣ x , α , a ) (i.e., Q ( B ∣ x , α , a ) = P r ( ( x t + 1 , α t + 1 ) ∈ B ∣ x t = x , α t = α , a t = a ) , B ∈ ℬ ( X × Γ ) , and ( x , α , a ) ∈ K ).

Then, the system moves to a state ( x t + 1 , α t + 1 ) , and the process is repeated.

In this manuscript, the transition law Q is assumed to be induced by a system of difference equations, i.e.,

(1) x t + 1 = F ( x t , α t , a t , ξ t ( ε ) ) ,

(2) α t + 1 = G ( α t , η t ( δ ) )

where t = 0 , 1 , … , with ( x 0 , α 0 ) ∈ X × Γ given, where F : K × S 1 → X and G : Γ × S 2 → Γ are measurable functions. Let ε 0 and δ 0 be fixed positive numbers and let ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] and the disturbances { ξ t ( ε ) } and { η t ( δ ) } are sequences of independent and identically distributed (i.i.d.) random elements with values in some Borel spaces ( S 1 , r 1 ) and ( S 2 , r 2 ) , respectively.

Remark 2

It is assumed that the random variables ξ : Ω 1 → S 1 and η : Ω 2 → S 2 are considered to be defined in the probability spaces ( Ω 1 , ℱ 1 , P 1 ) and ( Ω 2 , ℱ 2 , P 2 ) , where ξ and η are generic elements of { ξ t ( ε ) } and { η t ( δ ) } , respectively. Moreover, ( Ω 1 × Ω 2 , ℱ 1 ⊗ ℱ 2 , P ) denotes the product probability space, where ℱ 1 ⊗ ℱ 2 is the σ -algebra product and P is the product measure induced by the Ionescu-Tulcea theorem [3]. The expected value with respect to the probability measure P will be denoted by E .

In the following, we will consider the space S ≔ S 1 × S 2 with the metric:

r ( ( ξ ( ω 1 ) , η ( ω 2 ) ) , ( ξ ˆ ( ω 1 ′ ) , η ˆ ( ω 2 ′ ) ) ) = max { r 1 ( ξ ( ω 1 ) , ξ ˆ ( ω 1 ′ ) ) , r 2 ( η ( ω 2 ) , η ˆ ( ω 2 ′ ) ) }

for all ( ξ ( ω 1 ) , η ( ω 2 ) ) , ( ξ ˆ ( ω 1 ′ ) , η ˆ ( ω 2 ′ ) ) ∈ S , where ω 1 , ω 1 ′ ∈ Ω 1 and ω 2 , ω 2 ′ ∈ Ω 2 .

Now, consider the random vector χ t ( ε , δ ) ≔ ( ξ t ( ε ) , η t ( δ ) ) for all t ≥ 0 , then the difference equations (1) and (2) can be expressed as follows:

( x t + 1 , α t + 1 ) = H ( x t , α t , a t , χ t ( ε , δ ) ) ≔ ( F ( x t , α t , a t , ξ t ( ε ) ) , G ( α t , η t ( δ ) ) ) .

Suppose that there exist s 1 ∈ S 1 and s 2 ∈ S 2 such that s 1 = ξ ( 0 ) and s 2 = η ( 0 ) . Each element of the above sequences depends on numerical parameters ε and δ such that E r 1 ( ξ ( ε ) , s 1 ) → 0 when ε → 0 and E r 2 ( η ( δ ) , s 2 ) → 0 when δ → 0 . On the other hand, a deterministic MDP is considered whose dynamics evolve according to the difference equations shown in equations (3) and (4):

(3) x t + 1 = F ( x t , α t , a t , s 1 ) ,

(4) α t + 1 = G ( α t , s 2 ) ,

for all t = 0 , 1 , … . Note that χ ( 0 , 0 ) = ( ξ ( 0 ) , η ( 0 ) ) = ( s 1 , s 2 ) , so the joint dynamics given by equations (3) and (4) is denoted as follows:

( x t + 1 , α t + 1 ) = H ( x t , α t , a t , χ t ( 0 , 0 ) ) ≔ ( F ( x t , α t , a t , ξ t ( 0 ) ) , G ( α t , η t ( 0 ) ) ) .

Under this framework, we are interested in the approximation of MDPs that evolve through (1) and (2) by the deterministic control process given by equations (3) and (4).

When the processes of x -states and α -states are specified by the dynamical model given by equations (1) and (2), the transition law takes the form

(5) Q ( B ∣ x , α , a ) ≔ P r [ ( x t + 1 , α t + 1 ) ∈ B ∣ x t = x , α t = α , a t = a ] = ∫ S 1 × S 2 1 B ( H ( x , α , a , s ε , δ ) ) μ ( d s ) = μ ( { s ∈ S 1 × S 2 : H ( x , α , a , s ε , δ ) ∈ B } ) ,

where B ∈ ℬ ( X × Γ ) , 1 B ( ⋅ ) denotes the indicator function of B , and μ is the common distribution of the random vector χ t ( ε , δ ) .

On the other hand, when the processes of x -states and α -states are specified by the dynamical model equations (3) and (4), the transition law takes the form

(6) Q H ( B ∣ x , α , a ) ≔ 1 B ( H ( x , α , a , χ ( 0 , 0 ) ) ) ,

where B ∈ ℬ ( X × Γ ) and ( x , α , a ) ∈ K . Thus, the Markov control model is given by ( X × Γ , A , { A ( x , α ) : ( x , α ) ∈ X × Γ } , Q H , c ) .

A control policy π is a sequence { π t : t = 0 , 1 , … } , where for each t = 0 , 1 , … , π t ( ⋅ ∣ h t ) is a conditional probability on the Borel σ -algebra ℬ ( A ) , given the history h t ≔ ( x 0 , α 0 , a 0 , … , x t − 1 , α t − 1 , a t − 1 , x t , α t ) , such that π t ( A ( x t , α t ) ∣ h t ) = 1 . The set of all policies is denoted by Π .

Let F ≔ { ϕ : X × Γ → A ∣ ϕ is measurable and ϕ ( x , α ) ∈ A ( x , α ) , ( x , α ) ∈ X × Γ } . A sequence π = { ϕ t ∣ t = 0 , 1 , … } of functions ϕ t ∈ F is called a Markov policy. A Markov policy π = { ϕ t ∣ t = 0 , 1 , … } is called a stationary policy if ϕ t = ϕ ∈ F for all t = 0 , 1 , … .

Given initial states ( x 0 = x , α 0 = α ) ∈ X × Γ and arbitrary policy π ∈ Π , there exists a probability measure P ( x , α ) π induced by the triplet ( x , α , π ) over the space Ω = ( X × Γ × A ) ∞ , with ℱ as the σ -algebra product. The existence of this probability measure is verified in an analogous way as in the study by Gonzalez-Hernandez et al. [18]. The corresponding expectation operator is denoted by E ( x , α ) π . The triplet ( x , α , π ) determines a stochastic process ( Ω , ℱ , P ( x , α ) π , { ( x t , α t ) } ) called the Markov decision process. Subsequently, we denote y = ( x , α ) and Y = X × Γ .

3 Problem statement

Consider a deterministic Markov control model ( Y , A , { A ( y ) : y ∈ Y } , Q H , c ) as presented in Section 2. In addition, consider a stochastic control system with the same state space Y , control space A , admissible sets A ( y ) , y ∈ Y , and cost function c , but with the dynamical system described as follows:

y t + 1 = H ( y t , a t , χ t ( ε , δ ) ) , t = 0 , 1 , … .

Note that when the system is controlled by a deterministic policy, in the stochastic transition law (5), the stochastic system becomes a deterministic system with the transition law (6), when ε → 0 and δ → 0 .

For each policy π ∈ Π and initial state ( x , α ) ∈ Y , consider the expected total discounted cost, denoted by V ˆ ε , δ ( x , α , π ) , and defined as follows:

V ˆ ε , δ ( x , α , π ) = E ( x , α ) π ∑ t = 0 ∞ β t c ( x t , α t , a t ) ,

where β ∈ ( 0 , 1 ) is a discount factor.

Thus, the optimal control problem is to find a policy π * ∈ Π such that

V ˆ ε , δ ( x , α , π * ) = inf π ∈ Π V ˆ ε , δ ( x , α , π ) ,

( x , α ) ∈ X × Γ . Then, the optimal value function (optimal cost) is defined as V ε , δ ( x , α ) ≔ inf π ∈ Π V ˆ ε , δ ( x , α , π ) , ( x , α ) ∈ X × Γ . π * is called the optimal policy, while V ε , δ ( x , α ) is called the optimal value function, for ( x , α ) ∈ Y . In the deterministic case, when ε = 0 and δ = 0 , V ε , δ will be denoted by V .

In the next section, we establish conditions to perform an asymptotic analysis of the optimal solution for the stochastic system.

4 Conditions and results

In this section, we introduce three blocks of conditions to study the convergence of the stochastic system defined by equations (1) and (2). In addition, a bound is given for the stability index, which depends on a small noise disturbance parameter δ ˆ ε , δ . In the following, χ ( ε , δ ) denotes a generic element of { χ t ( ε , δ ) } .

Condition 1

A ( x , α ) is a compact set for each ( x , α ) ∈ Y and the set-valued mapping ( x , α ) → A ( x , α ) is upper semicontinuous with respect to the Hausdorff metric.
The cost function c ( y , ⋅ ) is lower semicontinuous on A ( y ) for every y ∈ Y .
For every bounded continuous function U : Y → R ,
U ′ ( y , a ) ≔ E U [ H ( y , a , χ ( ε , δ ) ) ] ,
where ( y , a ) ∈ K is a continuous function on K and E is introduced in Remark 2.

Condition 1 is necessary to ensure the existence of minimizers in the corresponding optimality equation. Condition 1(a) is similar to Assumption 1 presented in the study by Gordienko et al. [10].

Let Z : X × Γ → [ 1 , ∞ ) be a measurable function. If U is a real-valued function over X × Γ , then its weighted norm is defined as follows:

‖ U ‖ Z ≔ sup ( x , α ) ∈ X × Γ ∣ U ( x , α ) ∣ Z ( x , α ) ,

where Z denotes the weight function. Let B Z be the Banach space of measurable functions U : Y → R such that ‖ U ‖ Z < ∞ .

Condition 2

There exist a constant γ such that γ ∈ ( β , 1 ) and a weight function W on Y such that for all ε ∈ [ 0 , ε 0 ] , δ ∈ [ 0 , δ 0 ] :

∣ c ( y , a ) ∣ ≤ W ( y ) , ( y , a ) ∈ K .
E W [ H ( y , a , χ ( ε , δ ) ) ] ≤ γ β W ( y ) , ( y , a ) ∈ K .
For every state y ∈ Y , the function
W ′ ( y , a ) ≔ E W [ H ( y , a , χ ( ε , δ ) ) ]
is continuous in a ∈ A ( y ) .

Condition 2 is used to provide the existence of solutions of the optimality equation [10]. In addition, under Conditions 1 and 2, the DP approach is valid. Thus, for each ( x , α ) ∈ X × Γ , the following relation holds:

V ε , δ ( x , α ) = inf a ∈ A ( x , α ) c ( x , α , a ) + β ∫ S 1 × S 2 V ε , δ ( y ) Q ( y ∣ x , α , a ) .

One method of approximating the value function is to use value iterations, which are defined as follows:

V ε , δ n ( x , α ) = inf a ∈ A ( x , α ) c ( x , α , a ) + β ∫ S 1 × S 2 V ε , δ n − 1 ( y ) Q ( y ∣ x , α , a ) ,

where ( x , α ) ∈ X × Γ and n = 1 , 2 , … , with V ε , δ 0 ( ⋅ ) = 0 .

Condition 3

There exist constants L 0 , L 1 , L 2 , x , and L 2 , α such that

∣ c ( y , a ) − c ( y ′ , a ) ∣ ≤ L 0 d 1 ( y , y ′ ) for each ( y , a ) , ( y ′ , a ) ∈ K .
d 1 ( H ( y , a , ( s 1 , s 2 ) ) , H ( y ′ , a , ( s 1 , s 2 ) ) ) ≤ L 1 d 1 ( y , y ′ ) for ( y , a ) , ( y ′ , a ) ∈ K , and for all ( s 1 , s 2 ) ∈ S 1 × S 2 , with L 1 ≤ 1 .
The functions F and G satisfy: d x ( F ( x , α , a , s 1 ) , F ( x , α , a , s 1 ′ ) ) ≤ L 2 , x r 1 ( s 1 , s 1 ′ ) for each ( x , α , a ) ∈ K and for all s 1 , s 1 ′ ∈ S 1 .
d α ( G ( α , s 2 ) , G ( α , s 2 ′ ) ) ≤ L 2 , α r 2 ( s 2 , s 2 ′ ) , for any α ∈ Γ and for all s 2 , s 2 ′ ∈ S 2 .

Remark 3

Under Condition 3, the cost function and the function H involved in the dynamics of the states are Lipschitz functions with respect to the variable y ∈ Y . Furthermore, the functions F and G are Lipschitz functions with respect to ξ and η , respectively.

If Conditions 1 and 2 are satisfied, with similar arguments to those in [4] (taking into account the respective changes), the existence of a stationary optimal policy π ε , δ = { f ε , δ , f ε , δ , … } is guaranteed, where f ε , δ ∈ F . Moreover, the following facts hold:

V ˆ ε , δ ( x , α , π ε , δ ) = V ε , δ ( x , α ) ∈ B W .
E V ε , δ [ H ( y , a , χ ( ε , δ ) ) ] < ∞ , for each ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] , ( y , a ) ∈ K .

Moreover, the optimal policy for the deterministic control problem is denoted by π 0 * = { f * , f * , … } with f * ∈ F .

Let L be the Kantorovich metric defined on ( S , B s ) :

(7) L ( χ , χ ′ ) = sup { ∣ E φ ( χ ) − E φ ( χ ′ ) ∣ ∣ φ such that ∣ φ ( s ) − φ ( s ′ ) ∣ ≤ r ( s , s ′ ) , s , s ′ ∈ S } .

On the other hand, the stability index Δ ε , δ is defined as follows:

Δ ε , δ ( y , π ) ≔ V ˆ ε , δ ( y , π ) − V ε , δ ( y ) , y ∈ Y , π ∈ Π .

The index Δ ε , δ ( y , π ) expresses the excess of the discounted cost when the policy π is applied to the stochastic control process with respect to equations (1) and (2) for ε , δ > 0 and y ∈ Y . The quality of the approximation for the stochastic system by the policy π 0 * will be measured by the stability index Δ ε , δ ( y , π 0 * ) (see [10,13]), i.e.,

Δ ε , δ ( y , π 0 * ) ≔ V ˆ ε , δ ( y , π 0 * ) − V ε , δ ( y ) , y ∈ Y .

In addition, we define a small-noise disturbance parameter δ ˆ ε , δ as follows:

δ ˆ ε , δ ≔ E max { r 1 ( ξ ( ε ) , ξ ( 0 ) ) , r 2 ( η ( δ ) , η ( 0 ) ) }

for ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] . Theorem 1 provides an upper bound for Δ ε , δ ( y , π 0 * ) involving the noise parameter δ ˆ ε , δ .

The following lemmas are applied to prove Theorems 1 and 2.

Lemma 1

Under Conditions 1,2, and 3(a) and (b) for each ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] fixed, we have that V ε , δ n is a Lipschitz function for all n = 1 , 2 , … . Consequently, V ε , δ is a Lipschitz function with Lipschitz constant L 0 1 − β L 1 .

Proof

Let be ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] . The proof will be made by induction. For n = 1 , note that ( x , α ) , ( x ′ , α ′ ) ∈ Y

∣ V ε , δ 1 ( x , α ) − V ε , δ 1 ( x ′ , α ′ ) ∣ = ∣ inf a ∈ A { c ( x , α , a ) } − inf a ∈ A { c ( x ′ , α ′ , a ) } ∣ ≤ sup a ∈ A ∣ c ( x , α , a ) − c ( x ′ , α ′ , a ) ∣ ≤ L 0 d 1 ( ( x , α ) , ( x ′ , α ′ ) ) .

For n > 1 , suppose that V ε , δ n − 1 is a Lipschitz function with constant L 0 ∑ i = 0 n − 2 ( β L 1 ) i . Then,

∣ V ε , δ n ( x , α ) − V ε , δ n ( x ′ , α ′ ) ∣ = ∣ inf a ∈ A { c ( x , α , a ) + β E V ε , δ n − 1 [ H ( x , α , a , χ ( ε , δ ) ) ] } − inf a ∈ A { c ( x ′ , α ′ , a ) + β E V ε , δ n − 1 [ H ( x ′ , α ′ , a , χ ( ε , δ ) ) ] } ∣ ≤ sup a ∈ A { ∣ c ( x , α , a ) − c ( x ′ , α ′ , a ) ∣ + ∣ β E V ε , δ n − 1 [ H ( x , α , a , χ ( ε , δ ) ) ] − β E V ε , δ n − 1 [ H ( x ′ , α ′ , a , χ ( ε , δ ) ) ] ∣ } ≤ L 0 d 1 ( ( x , α ) , ( x ′ , α ′ ) ) + β sup a ∈ A E L 0 ∑ i = 0 n − 2 ( β L 1 ) i d 1 ( H ( x , α , a , χ ( ε , δ ) ) , H ( x ′ , α ′ , a , χ ( ε , δ ) ) ) ≤ L 0 d 1 ( ( x , α ) , ( x ′ , α ′ ) ) + β L 0 ∑ i = 0 n − 2 ( β L 1 ) i L 1 d 1 ( ( x , α ) , ( x ′ , α ′ ) ) = L 0 + L 0 ∑ i = 1 n − 1 ( β L 1 ) i d 1 ( ( x , α ) , ( x ′ , α ′ ) ) = L 0 ∑ i = 0 n − 1 ( β L 1 ) i d 1 ( ( x , α ) , ( x ′ , α ′ ) ) .

Therefore, V ε , δ n is a Lipschitz function with constant L 0 ∑ i = 0 n − 1 ( β L 1 ) i , for n ∈ N .

Now, to verify the second part, note that β L 1 < 1 , thus ∑ i = 0 ∞ ( β L 1 ) i = 1 1 − β L 1 . In addition, since V ε , δ n → V ε , δ when n → ∞ , it follows that V ε , δ is a Lipschitz function with a Lipschitz constant L 0 1 − β L 1 .□

Lemma 2

Under Conditions 1 and 2(b) for each ε ∈ [ 0 , ε 0 ] , δ ∈ [ 0 , δ 0 ] , and t ≥ 1 , it holds that

(8) E y π 0 * sup a ∈ A ( y t − 1 ) { E W [ H ( y t − 1 , a , χ t − 1 ( ε , δ ) ) ] } ≤ γ β t − 1 W ( y ) .

Proof

Consider ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] . From Condition 2(b), we have that

E W [ H ( y t − 1 , a , χ t − 1 ( ε , δ ) ) ] ≤ γ β W ( y t − 1 )

for any t ≥ 1 fixed. Then, it is obtained that

E y π 0 * sup a ∈ A ( y t − 1 ) { E W [ H ( y t − 1 , a , χ t − 1 ( ε , δ ) ) ] } ≤ γ β E y π 0 * W ( y t − 1 )

for any t ≥ 1 fixed. Now, consider h ˆ t = { y , a 1 , y 1 , a 2 , … , y t − 1 , a t } , the history of the joint process described by equations (1) and (2) under policy π 0 * = { f * , f * , … } , then

E y π 0 * W ( y t − 1 ) = E y π 0 * W ( H ( y t − 2 , a t − 2 , χ t − 2 ( ε , δ ) ) ) = E y π 0 * [ E W ( H ( y t − 2 , a t − 2 , χ t − 2 ( ε , δ ) ) ) ∣ h ˆ t − 2 ] ≤ γ β E y π 0 * [ W ( y t − 2 ) ∣ h ˆ t − 2 ] = γ β E y π 0 * [ W ( H ( y t − 3 , a t − 3 , χ t − 3 ( ε , δ ) ) ) ∣ h ˆ t − 2 ] = γ β E y π 0 * W ( H ( y t − 3 , a t − 3 , χ t − 3 ( ε , δ ) ) ) .

Thus,

E y π 0 * sup a ∈ A ( y t − 1 ) { E W [ H ( y t − 1 , a , χ t − 1 ( ε , δ ) ) ] } ≤ γ β 2 E y π 0 * W ( H ( y t − 3 , a t − 3 , χ t − 3 ( ε , δ ) ) ) .

Continuing with this procedure, it is obtained that

□ E y π 0 * sup a ∈ A ( y t − 1 ) { E W [ H ( y t − 1 , a , χ t − 1 ( ε , δ ) ) ] } ≤ γ β t − 1 W ( y ) .

The proof of the following theorem is based on Theorem 1 in [10].

Theorem 1

Under Conditions 1–3, it holds that

Δ ε , δ ( y , π 0 * ) ≤ C ˆ ( y ) δ ˆ ε , δ , y ∈ Y ,

where

C ˆ ( y ) = 2 β L 0 max { L 2 , x , L 2 , α } 1 − β L 1 1 1 − β + β ( 1 − γ ) 2 W ( y ) ,

for each ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] .

Proof

Note that for ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] , V ε , δ and f ε , δ satisfy the following optimality equation:

(9) V ε , δ ( y ) = inf a ∈ A ( y ) { c ( y , a ) + β E V ε , δ [ H ( y , a , χ ( ε , δ ) ) ] } = c ( y , f ε , δ ( y ) ) + β E V ε , δ [ H ( y , f ε , δ ( y ) , χ ( ε , δ ) ) ] .

Denote

(10) R ε , δ ( y , a ) ≔ c ( y , a ) + β E V ε , δ [ H ( y , a , χ ( ε , δ ) ) ] , ( y , a ) ∈ K ,

and consider h ˆ t = { y , a 1 , y 1 , a 2 , … , y t − 1 , a t } as in the proof of Lemma 2. By the Markov property, it is proved that

(11) E π 0 * [ β V ε , δ ( y t ) ∣ h ˆ t ] = R ε , δ ( y t − 1 , a t ) − c ( y t , a t ) − inf a ∈ A ( y t − 1 ) R ε , δ ( y t − 1 , a ) + inf a ∈ A ( y t − 1 ) R ε , δ ( y t − 1 , a ) .

Denoting Λ t ε , δ ≔ R ε , δ ( y t − 1 , a t ) − inf a ∈ A ( y t − 1 ) R ε , δ ( y t − 1 , a ) , then by equation (11), it is obtained that

(12) E π 0 * [ β V ε , δ ( y t ) ∣ h ˆ t ] = Λ t ε , δ − c ( y t , a t ) + V ε , δ ( y t − 1 ) .

If we take the expected value in equation (12), we obtain that

(13) E y π 0 * [ β V ε , δ ( y t ) ] = E y π 0 * [ V ε , δ ( y t − 1 ) ] − E y π 0 * [ c ( y t − 1 , a t ) ] + E y π 0 * [ Λ t ε , δ ] .

Summing equation (13) on t = 1 , 2 , … , n with weights β t − 1 , we obtain

(14) ∑ t = 1 n β t − 1 E y π 0 * [ c ( y t − 1 , a t ) ] = ∑ t = 1 n β t − 1 [ E y π 0 * V ε , δ ( y t − 1 ) − β E y π 0 * V ε , δ ( y t ) ] + ∑ t = 1 n β t − 1 E y π 0 * [ Λ t ε , δ ] = V ε , δ ( y ) − β n E y π 0 * V ε , δ ( y n ) + ∑ t = 1 n β t − 1 E y π 0 * Λ t ε , δ .

Since V ε , δ ∈ B W , lim n → ∞ β n E y π 0 * V ε , δ ( y n ) = 0 . Thus, when n → ∞ , it follows from equation (14) that

(15) Δ ε , δ 0 ( y , π 0 * ) = ∑ t = 1 ∞ β t − 1 E y π 0 * Λ t ε , δ = ∑ t = 1 ∞ β t − 1 E y π 0 * c ( y t − 1 , a t ) − V ε , δ ( y ) .

Now, by equations (9) and (10), it follows that

R 0 , 0 ( y , f * ( y ) ) = inf a ∈ A ( y ) R 0 , 0 ( y , a ) ,

then

Λ t ε , δ = R ε , δ ( y t − 1 , a ) − R 0 , 0 ( y t − 1 , f * ( y t − 1 ) ) + inf a ∈ A ( y t − 1 ) { R 0 , 0 ( y t − 1 , a ) } − inf a ∈ A ( y t − 1 ) { R ε , δ ( y t − 1 , a ) } ,

which implies that

Λ t ε , δ ≤ R ε , δ ( y t − 1 , a ) − R 0 , 0 ( y t − 1 , f * ( y t − 1 ) ) + sup a ∈ A ( y t − 1 ) { R 0 , 0 ( y t − 1 , a ) − R ε , δ ( y t − 1 , a ) } .

Therefore,

∣ Λ t ε , δ ∣ ≤ 2 sup a ∈ A ( y t − 1 ) ∣ R ε , δ ( y t − 1 , a ) − R 0 , 0 ( y t − 1 , a ) ∣ ≤ 2 β sup a ∈ A ( y t − 1 ) ∣ E V ε , δ ( H ( y t − 1 , a , χ ( ε , δ ) ) ) − E V ( H ( y t − 1 , a , χ ( 0 , 0 ) ) ) ∣ ,

where the expected value in the last term is taken with respect to the random vector χ ( ε , δ ) at fixed t . It follows from the last inequality that

(16) ∣ Λ t ε , δ ∣ ≤ 2 β sup a ∈ A ( y t − 1 ) ∣ E V ε , δ ( H ( y t − 1 , a , χ ( ε , δ ) ) ) − E V ε , δ ( H ( y t − 1 , a , χ ( 0 , 0 ) ) ) ∣ + 2 β sup a ∈ A ( y t − 1 ) ∣ E V ε , δ ( H ( y t − 1 , a , χ ( 0 , 0 ) ) ) − E V ( H ( y t − 1 , a , χ ( 0 , 0 ) ) ) ∣ ≤ 2 β μ 1 ( χ ( ε , δ ) , χ ( 0 , 0 ) ) + 2 β ‖ V ε , δ − V ‖ W sup a ∈ A ( y t − 1 ) E W ( H ( y t − 1 , a , χ ( 0 , 0 ) ) ) ,

where

μ 1 ( χ ( ε , δ ) , χ ( 0 , 0 ) ) = sup ( y , a ) ∈ K ∣ E V ε , δ ( H ( y , a , χ ( ε , δ ) ) ) − E V ε , δ ( H ( y , a , χ ( 0 , 0 ) ) ) ∣ .

From Proposition 8.3.9 part (a) of [4], it can be shown that

T ε , δ u ( y ) ≔ inf a ∈ A ( y ) { c ( y , a ) + β E u ( H ( y , a , χ ( ε , δ ) ) ) }

is a contractive operator in B W with module γ , for each ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] . Since V ε , δ and V are fixed points for the operator T ε , δ , we obtain that

‖ V ε , δ − V ‖ W ≤ ‖ T ε , δ V ε , δ − T 0 , 0 V ε , δ ‖ W + ‖ T 0 , 0 V ε , δ − T 0 , 0 V ‖ W .

This last relation implies that

(17) ‖ V ε , δ − V ‖ W ≤ ( 1 − γ ) − 1 ‖ T ε , δ V ε , δ − T 0 , 0 V ε , δ ‖ W ≤ β ( 1 − γ ) − 1 sup y ∈ y { W − 1 ( y ) sup a ∈ A ( y ) ∣ E V ε , δ [ H ( y , a , χ ( ε , δ ) ) ] − E V ε , δ [ H ( y , a , χ ( 0 , 0 ) ) ] ∣ } .

Combining inequality (8) from Lemma 2 and expressions (16) and (17), it follows that

E y π 0 * ∣ Λ t ε , δ ∣ ≤ 2 β 1 + β 1 − γ γ β t − 1 W ( y ) μ 1 ( χ ( ε , δ ) , χ ( 0 , 0 ) ) .

Finally, by equation (15), we obtain that

Δ ε , δ ( y , π 0 * ) ≤ 2 β 1 1 − β + β ( 1 − γ ) 2 W ( y ) μ 1 ( χ ( ε , δ ) , χ ( 0 , 0 ) ) .

By Lemma 1, it yields that

(18) Δ ε , δ ( y , π 0 * ) ≤ 2 β L 0 1 − β L 1 1 1 − β + β ( 1 − γ ) 2 W ( y ) L ( χ ( ε , δ ) , χ ( 0 , 0 ) ) .

Consider the particular case χ ′ = χ ( 0 , 0 ) on equation (7), then it follows that

L ( χ ( ε , δ ) , χ ( 0 , 0 ) ) = E r ( χ ( ε , δ ) , χ ( 0 , 0 ) ) = δ ˆ ε , δ .

Therefore, by substituting the previous equality in equation (18) the result follows.□

Remark 4

Observe that Theorem 1 guarantees that the optimal policy of the deterministic system (see equations (3) and (4)) π 0 * ∈ F is asymptotically optimal for the stochastic system (see equations (1) and (2)), i.e.,

lim ε , δ → 0 ∣ V ˆ ε , δ ( y , π 0 * ) − V ε , δ ( y ) ∣ = 0 .

In the following lemma, we verify the continuity of the function f * , under the assumption that there exists a unique optimal policy π 0 * . The uniqueness of the optimal policy is a restrictive assumption, but in [24] three blocks of conditions are provided for the components of the decision model to guarantee this assumption. In particular, Cruz-Suarez et al. [24] provided conditions for uniqueness when the state space is a subset of R n .

Lemma 3

Under Conditions 1 and 2, and if, in addition, the stationary optimal policy for the deterministic problem π 0 * = { f * , f * , … } is unique, then f * is a continuous function.

Proof

By contradiction, it will be shown that for ε = 0 and δ = 0 , the optimal policy f * : Y → A is a continuous function. Under Conditions 1 and 2, we have that

(19) V ( x , α ) = inf a ∈ A { c ( x , α , a ) + β V ( F ( x , α , a , ξ ( 0 ) ) , G ( α , η ( 0 ) ) ) } = c ( x , α , f * ( x , α ) ) + β V ( F ( x , α , f * ( x , α ) , s 1 ) , G ( α , s 2 ) ) ,

( x , α ) ∈ Y . Suppose there exists ( x ˆ , α ˆ ) ∈ Y where f * is not continuous. Then, there exists a sequence { ( x n , α n ) } such that ( x n , α n ) → ( x ˆ , α ˆ ) , but d 2 ( f * ( x n , α n ) , f * ( x ˆ , α ˆ ) ) ↛ 0 , when n → ∞ . After taking a subsequence – if necessary – without loss of generality, there exists τ > 0 such that d 2 ( f * ( x n , α n ) , f * ( x ˆ , α ˆ ) ) ≥ τ . Since A is compact, there exists a subsequence { z n k } of { z n = f * ( x n , α n ) } that converges to z ∈ A , where z ≠ f * ( x ˆ , α ˆ ) . Consider ( x n k , α n k ) and f * ( x n k , α n k ) instead of ( x , α ) and f * ( x , α ) in equation (19), then we obtain that

V ( x n k , α n k ) = c ( x n k , α n k , f * ( x n k , α n k ) ) + β V ( F ( x n k , α n k , f * ( x n k , α n k ) , ξ ( 0 ) ) , G ( α n k , η ( 0 ) ) ) .

By the continuity of the functions c , F , G , and V and if k → ∞ , we obtain that

V ( x ˆ , α ˆ ) = c ( x ˆ , α ˆ , z ) + β V ( F ( x ˆ , α ˆ , z , ξ ( 0 ) ) , G ( α ˆ , η ( 0 ) ) ) ,

( x ˆ , α ˆ ) ∈ Y . From Conditions 1 and 2, there exists an optimal policy f ¯ with z = f ¯ ( x ˆ , α ˆ ) . But f ¯ ( x ˆ , α ˆ ) ≠ f * ( x ˆ , α ˆ ) , which contradicts the uniqueness of an optimal policy. Therefore, f * is a continuous function.□

Next, the main theorem is stated and proved.

Theorem 2

Under Conditions 1–3, for each ε ∈ [ 0 , ε 0 ] and δ ∈ [ 0 , δ 0 ] , the following statements hold:

‖ V ε , δ − V ‖ W ≤ C 1 δ ˆ ε , δ , where C 1 = β 1 − γ L 0 max { L 2 , x , L 2 , α } 1 − β L 1 .
Let K be a compact subset of Y. If the stationary optimal policy for the deterministic problem π 0 * = { f * , f * , … } is unique, then f ε , δ → f * uniformly on K when ε and δ go to zero.

Proof

(a) Observe that equation (17) implies that

‖ V ε , δ − V ‖ W ≤ β ( 1 − γ ) − 1 sup y ∈ Y { W − 1 ( y ) sup a ∈ A ( y ) ∣ E V ε , δ ( H ( y , a , χ ( ε , δ ) ) ) − E V ε , δ ( H ( y , a , χ ( 0 , 0 ) ) ) ∣ } .

Now, by Lemma 1 and since W ( y ) ≥ 1 , y ∈ Y , it yields that

‖ V ε , δ − V ‖ W ≤ β ( 1 − γ ) − 1 L 0 1 − β L 1 sup y ∈ Y sup a ∈ A ( y ) ∣ E d 1 ( H ( y , a , χ ( ε , δ ) ) , H ( y , a , χ ( 0 , 0 ) ) ) ∣ .

On the other hand, the following expressions hold:

∣ E d 1 ( H ( y , a , χ ( ε , δ ) ) , H ( y , a , χ ( 0 , 0 ) ) ) ∣ = ∣ E max { d x ( F ( x , α , a , ξ ( ε ) ) , F ( x , α , a , ξ ( 0 ) ) ) , d α ( G ( α , η ( δ ) ) , G ( α , η ( 0 ) ) ) } ∣ ≤ ∣ E L 2 , x r 1 ( ξ ( ε ) , ξ ( 0 ) ) ∣ 1 { d α ≤ d x } ( ( F ( x , α , a , ξ ( ε ) ) , G ( α , η ( δ ) ) ) , ( F ( x , α , a , ξ ( 0 ) ) , G ( α , η ( 0 ) ) ) ) + ∣ E L 2 , α r 2 ( η ( δ ) , η ( 0 ) ) ∣ 1 { d x < d α } ( ( F ( x , α , a , ξ ( ε ) ) , G ( α , η ( δ ) ) ) , ( F ( x , α , a , ξ ( 0 ) ) , G ( α , η ( 0 ) ) ) ) ,

where 1 { d α ≤ d x } ( ⋅ ) denotes the indicator function of { d α ≤ d x } and 1 { d x < d α } ( ⋅ ) denotes the indicator function of { d x < d α } . Then, we conclude that

‖ V ε , δ − V ‖ W ≤ β ( 1 − γ ) − 1 L 0 max { L 2 , x , L 2 , α } 1 − β L 1 δ ˆ ε , δ .

(b) Suppose that there exist a compact set K ⊂ Y , a real number τ > 0 , and sequences { ε n } , { δ n } convergent to 0 such that

(20) d 2 ( f ε n , δ n ( x n , α n ) , f * ( x n , α n ) ) ≥ τ 2 , n = 1 , 2 , … ,

for some convergent sequence { ( x n , α n ) } ⊂ K , such that ( x n , α n ) → ( x , α ) ∈ K , when n → ∞ . Since A is compact, take a subsequence { ( x m , α m ) } of { ( x n , α n ) } such that f ε n , δ n ( x m , α m ) → a ∈ A . For the continuity of f* given by Lemma 1 and equation (20), we now obtain that d 2 ( a , f * ( x , α ) ) ≥ τ 2 .

Since f ε n , δ n is an optimal policy, we obtain by equation (19) that

(21) V ε m , δ m ( x m , α m ) = c ( x m , α m , f ε m , δ m ( x m , α m ) ) + β E V ε m , δ m ( F ( x m , α m , f * ( x m , α m ) , ξ ( ε m ) ) , G ( α m , η ( δ m ) ) )

for m = 1 , 2 , … .

Now, note that

(22) ∣ E V ε m , δ m ( F ( x m , α m , f * ( x m , α m ) , ξ ( ε m ) ) , G ( α m , η ( δ m ) ) ) − V ( F ( x , α , a , ξ ( 0 ) ) , G ( α , η ( 0 ) ) ) ∣ ≤ ∣ E V ε m , δ m ( F ( x m , α m , f * ( x m , α m ) , ξ ( ε m ) ) , G ( α m , η ( δ m ) ) ) − V ε m , δ m ( F ( x m , α m , f * ( x m , α m ) , ξ ( 0 ) ) , G ( α m , η ( 0 ) ) ) ∣ + ∣ V ε m , δ m ( F ( x m , α m , f * ( x m , α m ) , ξ ( 0 ) ) , G ( α m , η ( 0 ) ) ) − V ( F ( x m , α m , f ε m , δ m ( x m , α m ) , ξ ( 0 ) ) , G ( α m , η ( 0 ) ) ) ∣ + ∣ V ( F ( x m , α m , f ε m , δ m ( x m , α m ) , ξ ( 0 ) ) , G ( α m , η ( 0 ) ) ) − V ( F ( x , α , a , ξ ( 0 ) ) , G ( α , η ( 0 ) ) ) ∣ .

By Lemma 1, the first term on the right-hand side of equation (22) is less than or equal to L 0 1 − β L 1 δ ˆ ε m , δ m . The remaining terms converge to 0 when m → ∞ , by the continuity of the functions F , G , and V . Therefore, when m → ∞ , equation (21) becomes

V ( x , α ) = c ( x , α , a ) + β V ( F ( x , α , a , ξ ( 0 ) ) , G ( α , η ( 0 ) ) ) .

By similar arguments to those in the proof of Lemma 3, it is known that there exists an optimal policy f ¯ with a = f ¯ ( x , α ) , but f ¯ ( x , α ) ≠ f * ( x , α ) . This fact is a contradiction to the uniqueness of the optimal policy. Therefore, f ε n , δ n converges uniformly to f * .□

5 Examples

In this section, we present two examples that illustrate the developed theory and two examples that do not satisfy any of the conditions of Theorem 1, so that we arrive at conclusions quite different from those provided by this result. In this section, we consider d x , d α , d 2 , r 1 , and r 2 as the usual metric on R .

5.1 Consumption-investment problem

We consider a consumption-investment problem [15,23] in which an investor must allocate his current wealth, say x t , between investment a t and consumption x t − a t , in each stage t = 0 , 1 , 2 , … . In addition, at each stage t , a discount factor exp ( − α t ) is imposed, which depends on the real bank interest rate α t . The state and action spaces will be X = A = [ 0 , ∞ ) . Assuming that borrowing is not allowed, the set of admissible controls takes the form: A ( x , α ) = [ 0 , x ] . Furthermore, it is assumed that the bank receives at least an interest rate of exp ( α * ) − 1 for α * > 0 . Thus, the discount rate space is Γ = [ α * , ∞ ) .

The state process { x t } and the discounting process { α t } satisfy the following difference equations:

(23) x t + 1 = ξ t ( ε ) ( x t − a t ) , α t + 1 = α t + η t ( δ ) ,

t = 0 , 1 , 2 , … , with ( x 0 , α 0 ) ∈ Y given, { ξ t } and { η t } are sequences of independent and identically distributed discrete random variables, independent of ( x 0 , α 0 ) , and S 1 = S 2 = [ 0 , 1 ] .

Remark 5

In particular, if η t ( 0 ) = s 2 = 0 , t = 0 , 1 , … , in equation (23) the corresponding deterministic MDP has a constant discount factor.

The objective is to maximize the utility of consumption of the investor for all π ∈ Π ,

V ˆ ε , δ ( x , α , π ) = E ( x , α ) π ∑ t = 0 ∞ e − S t u ( x t , α t , a t ) ,

where S t = α 0 + α 1 + ⋯ + α t − 1 and u is an utility function. In particular, consider the utility function u defined by

u ( x , α , a ) = b γ 1 a γ 1 ,

( x , α , a ) ∈ K , where b > 0 and γ 1 ∈ ( 0 , 1 ) . In addition, suppose that μ γ 1 ≔ E [ ξ γ 1 ] < ∞ with 0 < β μ γ 1 < 1 , where β = e − α 0 . By the definition of utility function, Conditions 1(b) and 3(a) are immediately satisfied with L 0 ≔ 1 .

Note that A ( x , α ) is compact for all ( x , α ) ∈ X × Γ . Now, consider H a the Hausdorff metric, then for ( x , α ) , ( x ′ , α ′ ) ∈ X × Γ , we have that

H a ( A ( x , α ) , A ( x ′ , α ′ ) ) = H a ( [ 0 , x ] , [ 0 , x ′ ] ) = ∣ x − x ′ ∣ ≤ max { ∣ x − x ′ ∣ , ∣ α − α ′ ∣ } = d 1 ( ( x , α ) , ( x ′ , α ′ ) ) .

Then the set-value mapping ( x , α ) → A ( x , α ) is continuous with respect to the Hausdorff metric, so that Condition 1(a) is satisfied. Furthermore, due to the continuity of H , the function U ′ ( x , α , a ) is also continuous, then Condition 1(c) is valid. Consider W : X × Γ → [ 1 , ∞ ) defined by

W ( x , α ) = b μ γ 1 γ 1 ( 1 − β μ γ 1 ) x γ 1 + 1 , ( x , α ) ∈ X × Γ .

Cruz-Suarez et al. [23] verified that the function W satisfies Conditions 2(a) and (b). In addition, note that

W ′ ( x , α , a ) = b μ γ 1 γ 1 ( 1 − β μ γ 1 ) ( x − a ) E ξ ( ε ) + 1

is continuous on K , so Condition 2 holds.

On the other hand, for ( x , α , a ) , ( x ′ , α ′ , a ) ∈ K and for all χ ( ε , δ ) ( ω 1 , ω 2 ) ∈ S 1 × S 2 , we have that

d 1 ( H ( x , α , a , χ ( ε , δ ) ( ω 1 , ω 2 ) ) , H ( x ′ , α ′ , a , χ ( ε , δ ) ( ω 1 , ω 2 ) ) ) = max { ξ ∣ x − x ′ ∣ , ∣ α − α ′ ∣ } ≤ max { ∣ x − x ′ ∣ , ∣ α − α ′ ∣ } = d 1 ( ( x , α ) , ( x ′ , α ′ ) ) ,

then Condition 3(b) is valid for L 1 = 1 . Finally, Condition 3(c) is satisfied due to the following expressions:

d x ( F ( x , α , a , ξ ( ω 1 ) ) , F ( x , α , a , ξ ′ ( ω 1 ′ ) ) ) = ∣ F ( x , α , a , ξ ( ω 1 ) ) − F ( x , α , a , ξ ′ ( ω 1 ′ ) ) ∣ = ∣ ( ξ ( ω 1 ) − ξ ′ ( ω 1 ′ ) ) ( x − a ) ∣ ≤ x ∣ ξ ( ω 1 ) − ξ ′ ( ω 1 ′ ) ∣ = L 2 , x r 1 ( ξ ( ω 1 ) , ξ ′ ( ω 1 ′ ) )

for each ( x , α , a ) ∈ K and for all ξ ( ω 1 ) , ξ ′ ( ω 1 ′ ) ∈ S 1 , where L 2 , x ≔ x .

We also obtain that

d α ( G ( α , η ( ω 2 ) ) , G ( α , η ′ ( ω 2 ′ ) ) ) = ∣ G ( α , η ( ω 2 ) ) − G ( α , η ′ ( ω 2 ′ ) ) ∣ = ∣ α + η ( ω 2 ) − ( α + η ′ ( ω 2 ′ ) ) ∣ = δ ∣ η ( ω 2 ) − η ′ ( ω 2 ′ ) ∣ ≤ δ 0 ∣ η ( ω 2 ) − η ′ ( ω 1 ′ ) ∣ = L 2 , α r 2 ( η ( ω 2 ) , η ′ ( ω 2 ′ ) )

for each α ∈ Γ and for all η ( ω 2 ) , η ′ ( ω 2 ′ ) ∈ S 2 , where L 2 , α ≔ δ 0 .

By Theorem 1, the next inequality holds

Δ ε , δ ( ( x , α ) , π 0 * ) ≤ 2 β max { x , δ 0 } 1 − β 1 1 − β + β ( 1 − γ ) 2 b μ γ 1 γ 1 ( 1 − β μ γ 1 ) x γ 1 + 1 R ˆ ,

where R ˆ = E max { ∣ ξ ( ε ) − ξ ( 0 ) ∣ , ∣ η ( δ ) − η ( 0 ) ∣ } . By Theorem 2, the convergence rate of the optimal value function is

C 1 = β max { x , δ 0 } ( 1 − γ ) ( 1 − β ) .

On the other hand, by Theorem 2(b), we have that

sup ( x , α ) ∈ K ∣ f ε , δ ( x , α ) − f * ( x , α ) ∣ → 0 ,

when ε → 0 and δ → 0 .

5.2 Control problem with small additive noise

Assume that the dynamic of the system is given by the difference equation:

(24) x t + 1 = 1 2 ( α t x t + a t + ξ t ( ε ) ) , α t + 1 = h α t + η t ( δ ) ,

t = 0 , 1 , 2 , … , where 0 < h ≤ 1 and { ξ t ( ε ) } , { η t ( δ ) } are sequences of independent and identically distributed random variables (i.i.d.r.v.) that take values in S 1 = [ 0 , B 3 ] and S 2 = [ 0 , 1 2 ] , respectively. The x -states space is X = [ 0 , B ] , where 0 < B < 6 1 β − 1 , with β as the discount factor and the α -states space is Γ = [ 0 , 1 ] , i.e., 0 ≤ α ≤ 1 . The control space is A = [ 0 , B 3 ] . The set of feasible controls in the states ( x , α ) is A ( x , α ) = [ 0 , x α ] , and the cost function is

c ( x , α , a ) = x α − a , ( x , α , a ) ∈ K .

Remark 6

In particular, if η t ( 0 ) = s 2 = 0 , t = 0 , 1 , … , and h = 1 in equation (24), the corresponding deterministic MDP has constant parameter α .

For this example, Condition 1 is verified immediately. Next, Conditions 2 and 3 are checked.

Consider W : X × Γ → [ 1 , ∞ ) defined by W ( x , α ) = x + 1 for all ( x , α ) ∈ X × Γ . Then, it follows that

∣ c ( x , α , a ) ∣ = ( x α − a ) ≤ x − a ≤ x < x + 1 ≔ W ( x , α )

for all ( x , α , a ) ∈ K .

We also obtain that

(25) E W [ H ( x , α , a , ( ξ ( ε ) , η ( δ ) ) ) ] = E 1 2 ( α x + a + ξ ( ε ) ) + 1 = 1 2 ( α x + a + E ξ ( ε ) ) + 1 ≤ 1 2 2 α x + B 3 + 1 ≤ 1 2 2 x + 2 + B 3 ≤ x + 1 + B 6 ≤ 1 + B 6 ( x + 1 ) = γ β W ( x , α )

for all ( x , α , a ) ∈ K , where γ = β 1 + B 6 . It is clear that γ > β . Moreover, since B < 6 1 β − 1 , it yields that γ ∈ ( β , 1 ) .

From the second equality in equation (25), we observe that the function E W [ H ( x , α , a , χ ( ε , δ ) ) ] is continuous on K . Therefore, Condition 2 is satisfied.

Finally, Condition 3 is verified. Note that

∣ c ( x , α , a ) − c ( x ′ , α ′ , a ) ∣ = ∣ x α − a − ( x ′ α ′ − a ) ∣ = ∣ x α − ( x ′ α ′ ) ∣ ≤ ∣ x − x ′ ∣ ≤ max { ∣ x − x ′ ∣ , ∣ α − α ′ ∣ } = L 0 d 1 ( ( x , α ) , ( x ′ , α ′ ) )

for all ( x , α , a ) , ( x ′ , α ′ , a ) ∈ K , where L 0 ≔ 1 .

The following inequalities are valid for the joint dynamics of the states:

d 1 ( H ( x , α , a , χ ( ε , δ ) ( ω 1 , ω 2 ) ) , H ( x ′ , α ′ , a , χ ( ε , δ ) ( ω 1 , ω 2 ) ) ) = max 1 2 α x − 1 2 α ′ x ′ , ∣ h ( α − α ′ ) ∣ ≤ max 1 2 ∣ x − x ′ ∣ , h ∣ α − α ′ ∣ ≤ L 1 d 1 ( ( x , α ) , ( x ′ , α ′ ) )

for all ( x , α , a ) ∈ K and χ ( ε , δ ) ( ω 1 , ω 2 ) ∈ S 1 × S 2 , where L 1 ≔ max 1 2 , h ≤ 1 .

Finally, we verify Lipschitz conditions for the functions F and G with respect to the disturbance variables:

d x ( F ( x , α , a , ξ ( ω 1 ) ) , F ( x , α , a , ξ ′ ( ω 1 ′ ) ) ) = 1 2 ( α x + a + ξ ( ω 1 ) ) − 1 2 ( α x + a + ξ ′ ( ω 1 ′ ) ) = 1 2 ξ ( ω 1 ) − 1 2 ξ ′ ( ω 1 ′ ) ≤ 1 2 ∣ ξ ( ω 1 ) − ξ ′ ( ω 1 ′ ) ∣ ≤ L 2 , x r 1 ( ξ ( ω 1 ) , ξ ′ ( ω 1 ′ ) )

for each ( x , α , a ) ∈ K , and for all ξ ( ω 1 ) , ξ ′ ( ω 1 ′ ) ∈ S 1 , where L 2 , x ≔ 1 2 and

d α ( G ( α , η ( ω 2 ) ) , G ( α , η ′ ( ω 2 ′ ) ) ) = ∣ ( h α + η ( ω 2 ) ) − ( h α + η ′ ( ω 2 ′ ) ) ∣ = ∣ η ( ω 2 ) − η ′ ( ω 2 ′ ) ∣ ≤ L 2 , α r 2 ( η ( ω 2 ) , η ′ ( ω 2 ′ ) )

for each α ∈ Γ and for all η ( ω 2 ) , η ′ ( ω 2 ′ ) ∈ S 2 , where L 2 , α ≔ 1 . Therefore, Condition 3 is satisfied.

By Theorem 1, it yields that

Δ ε , δ ( ( x , α ) , π 0 * ) ≤ 2 β 1 − β max 1 2 , h 1 1 − β + β ( 1 − ( β ( 1 + B 6 ) ) ) 2 ( x + 1 ) R ˆ ,

where R ˆ ≔ E max { ∣ ξ ( ε ) − ξ ( 0 ) ∣ , ∣ η ( δ ) − η ( 0 ) ∣ } . On the other hand, the convergence rate of the optimal value function by Theorem 2 is as follows:

C 1 = β 1 − β ( 1 + B 6 ) 1 − β max 1 2 , h .

In addition, by part (b) of Theorem 2, we have that f ε , δ ( x , α ) → f * ( x , α ) , when ε → 0 and δ → 0 .

5.3 Importance of conditions

Finally, we present two examples where Conditions 2 and 3 are not satisfied and, therefore, the conclusions of Theorem 1 are not reached.

Example 1

Let X = [ 0 , ∞ ) , Γ = [ 0 , 1 ] , A = [ 1 , 1 β ] , and ε , δ ∈ [ 0 , 1 ] and the one-stage cost function is given as follows:

c ( x , α , 0 ) = 1 , ( x , α ) ∈ X × Γ , c ( x , α , a ) = a , x ∈ [ 0 , 1 ] , α ∈ Γ , a + x − 1 , x > 1 , a ∈ ( 0 , 1 β ) , α ∈ Γ , c ( x , α , 1 β ) = 0 , x ∈ [ 0 , 1 ] , α ∈ Γ , x − 1 , x > 1 , α ∈ Γ .

Consider the difference equations:

(26) x t + 1 = a t x t + ε α t ξ t , α t + 1 = k α t + δ η t ,

t = 0 , 1 , … , where { ξ t } is a sequence of i.i.d.r.v. with uniform distribution over ( 0 , 1 ) , η t = 0 , t = 1 , 2 , … and k < 1 . The deterministic approximation to the process (26) is given by the following equations:

(27) x t + 1 = a t x t , α t + 1 = k α t ,

t = 1 , 2 , … and k < 1 . Consider x 0 = 0 and α 0 = 1 , then for any control policy in (27), we have that ( x t , α t ) = ( 0 , k t ) , t = 1 , 2 , … . Therefore, the policy π 0 * = { 1 β , 1 β , … } corresponds to the minimum value of V ˆ ( ( 0 , 1 ) , π 0 * ) = V ( ( 0 , 1 ) ) = 0 . Now, if the policy π 0 * is applied in equation (26) with initial state ( 0 , 1 ) , it is obtained that

(28) x t = 1 β t + ε k β t − 1 ∑ i = 0 t − 1 ( β k ) i ξ i + 1 ,

t = 1 , 2 , … . Note that the first term on the right-hand side of equation (28) is greater than 1, so x t > 1 for t = 1 , 2 , … . Since c ( x , α , 1 β ) = x − 1 for x > 1 and using equation (28), it yields that

β t E ( 0 , 1 ) π 0 * c x t , α t , 1 β = β t E ( 0 , 1 ) π 0 * 1 β t + ε k β t − 1 ∑ i = 0 t − 1 ( β k ) i ξ i + 1 − 1 = 1 + ε k β 2 ∑ i = 0 t − 1 ( β k ) i − β t ≥ ε k β 2 ∑ i = 0 t − 1 ( β k ) i = ε k β 2 1 − ( k β ) t 1 − k β ,

t = 1 , 2 , … . Now, for each ε 1 ∈ ( 0 , 1 2 ) , choose the stationary policy π 1 = { ε 1 , ε 1 , … } , and by equation (26), we obtain that x t ∈ [ 0 , 1 ] , t = 0 , 1 , 2 , … . Note that

V ˆ ε , δ ( ( 0 , 1 ) , π 1 ) = E ( 0 , 1 ) π 1 ∑ t = 0 ∞ β t c ( y t , α t , a t ) = E ( 0 , 1 ) π 1 ∑ t = 0 ∞ β t ε 1 = ε 1 1 − β .

Thus,

0 ≤ V ε , δ ( ( 0 , 1 ) ) ≤ V ˆ ε , δ ( ( 0 , 1 ) , π 1 ) → 0

when ε 1 → 0 . On the other hand, observe that ( k β ) t ≤ k β for t ≥ 1 , then

Δ ε , δ ( ( 0 , 1 ) , π 1 ) = V ˆ ε , δ ( ( 0 , 1 ) , π 1 ) − V ε , δ ( ( 0 , 1 ) ) = E ( 0 , 1 ) π 1 ∑ t = 0 ∞ β t c ( y t , α t , a t ) ≥ ε k β 2 ∑ t = 0 ∞ 1 − ( k β ) t 1 − k β = ε k β 2 ∑ t = 1 ∞ 1 − ( k β ) t 1 − k β ≥ k β 2 ∑ t = 1 ∞ 1 − k β 1 − k β = ∞ .

Therefore, Δ ε , δ ( ( 0 , 1 ) , π 1 ) = ∞ .

In this example, Condition 2 is not satisfied, in particular, there does not exist a continuous function W : Y → [ 1 , ∞ ) such that ∣ c ( y , a ) ∣ ≤ W ( y ) , for ( y , a ) ∈ K . In this case, it happens that Δ ε , δ ( ( 0 , 1 ) , π 1 ) = ∞ , for any ε , δ > 0 .

Example 2

Let X = R , Γ = [ 0 , ∞ ) , A = { 0 , 1 } , and ε , δ ∈ [ 0 , 1 ] , and for i ∈ { 0 , 1 } , the one-stage cost function is defined as follows:

c ( x , α , i ) = 1 , x ≤ 0 , α ∈ Γ , 3 , in another case.

In addition, consider the difference equations:

(29) x t + 1 = x t α t ( a t − ε ξ t ) , α t + 1 = h α t − δ η t ,

t = 0 , 1 , … , where { ξ t } is a sequence of random variables with standard normal distribution and { η t } is a sequence of random variables with exponential distribution with parameter 1 and h > 0 . The deterministic approximation to the process (29) is given by the following equations:

(30) x t + 1 = α t x t a t , α t + 1 = h α t ,

t = 0 , 1 , … . Consider initial states x 0 = 1 and α 0 = 1 . Under this framework, the policy π 0 * = { 0 , 0 , … } is optimal for the deterministic process (30) and for ε , δ > 0 , V ˆ ε , δ ( ( 1 , 1 ) , π 0 * ) = 1 1 − β . On the other hand, V ˆ ε , δ ( ( 1 , 1 ) , π 1 ) = 3 1 − β , with π 1 = { 1 , 1 , … } . Therefore,

Δ ε , δ ( ( 1 , 1 ) , π 1 ) = V ˆ ε , δ ( ( 1 , 1 ) , π 1 ) − V ε , δ ( ( 1 , 1 ) ) ≥ V ˆ ε , δ ( ( 1 , 1 ) , π 1 ) − V ˆ ε , δ ( ( 1 , 1 ) , π 0 * ) = 3 1 − β − 1 1 − β = 2 1 − β .

In this example, Condition 3 is not satisfied, in particular, the cost function c is not a Lipschitz function. In this case, we conclude that Δ ε , δ ( ( 1 , 1 ) , π 1 ) ≥ 2 1 − α even though δ ˆ ε , δ → 0 when ε → 0 and δ → 0 .

Finally, note that if L 1 > 1 in Condition 3(b), it is not possible to guarantee the existence of an upper bound for Δ ε , δ ( y , π 0 * ) . Moreover, it is also not possible to determine a convergence rate for the optimal value function, when β L 1 > 1 .

6 Conclusions

In this article, we established conditions under which uniform convergence of the optimal value function and the optimal policy of a family of MDPs indexed by parameters ε and δ converges to the optimal value function and the optimal policy of an adequate deterministic MDP when ε → 0 and δ → 0 . These MDPs evolve according to coupled difference equations. The first equation is related to the evolution of the x -states by the function F appearing in equation (1), while the second equation is related to the evolution of some parameter of the model (see equation (2)). The main results of the article are Theorems 1 and 2. Theorem 1 provides an upper bound for the stability index. On the other hand, Theorem 2 establishes the convergence of the sequences { V ε , δ } and { f ε , δ } to V and f * , respectively, when ε and δ go to zero. Finally, the developed theory was illustrated with two examples showing the conclusions of the main results. A direct consequence of Theorem 1 is that the optimal policy of the deterministic problem is asymptotically optimal for the stochastic problem. On the other hand, the results presented in Theorem 2 allow us to perform approximations for stochastic systems using the perturbation method. Such methodology is well established in the literature of economic growth models for stochastic systems whose dynamics are described only by an equation of x -states, see, e.g., [25]. However, for stochastic systems with two coupled difference equations, the research is still ongoing.

Acknowledgments

The authors are deeply grateful to the reviewers and the Associate Editor for their careful reading of the original manuscript and for their advice to improve the paper.

Conflict of interest: The authors state that there are no conflicts of interest.

References

[1] D. P. Bertsekas and S. E. Shreve, Stochastic Optimal Control: The Discrete-Time Case, Athena Scientific, United States of America, 1996. Search in Google Scholar

[2] E. A. Feinberg and A. Shwartz, Handbook of Markov Decision Processes: Methods and Applications, Springer Science & Business Media, New York, 2012. Search in Google Scholar

[3] O. Hernández-Lerma and J. B. Lasserre, Discrete-Time Markov Control, Processes: Basic Optimality Criteria, Springer-Verlag, New York, 1996. Search in Google Scholar

[4] O. Hernández-Lerma and J. B. Lasserre, Further Topics on Discrete-Time Markov Control Processes, Springer-Verlag, New York, 1999. Search in Google Scholar

[5] M. L. Puterman, Markov Decision Processes, Wiley Interscience, Hoboken, New Jersey, 1994. Search in Google Scholar

[6] R. J. Boucherie and N. M. Van Dijk, Markov Decision Processes in Practice, Springer International Publishing, Cham, Switzerland, 2017. Search in Google Scholar

[7] D. Hernández-Hernández and J. A. Minjárez-Sosa, Optimization, Control, and Applications of Stochastic Systems, Springer Science & Business Media, New York Heidelberg Dordrecht London, 2012. Search in Google Scholar

[8] R. Bellman, Dynamic Programming, Dover Publications, United States of America, 2003. Search in Google Scholar

[9] H. Cruz-Suárez and R. Ilhuicatzi-Roldán, Stochastic optimal control for small noise intensities: The discrete-time case, WSEAS Trans. Math. 9 (2010), no. 2, 120–129. Search in Google Scholar

[10] E. Gordienko, E. Lemus-Rodríguez, and R. Montes-de-Oca, Discounted cost optimality problem: stability with respect to weak metrics, Math. Methods Oper. Res. 68 (2008), no. 1, 77–96, DOI: https://doi.org/10.1007/s00186-007-0171-z. Search in Google Scholar

[11] R. S. Liptser, W. J. Runggaldier, and M. Taksar, Deterministic approximation for stochastic control problems, SIAM J. Control Optim. 34 (1996), no. 1, 161–178, DOI: https://doi.org/10.1137/S0363012993254540. Search in Google Scholar

[12] P. Dupuis and H. J. Kushner, Stochastic systems with small noise, analysis and simulation; a phase locked loop example, SIAM J. Appl. Math. 47 (1987), no. 3, 643–661, https://www.jstor.org/stable/2101805. Search in Google Scholar

[13] H. Cruz-Suárez, E. Gordienko, and R. Montes-de-Oca, A note on deterministic approximation of discounted Markov decision processes, Appl. Math. Lett. 22 (2009), no. 8, 1252–1256, DOI: https://doi.org/10.1016/j.aml.2009.01.039. Search in Google Scholar

[14] A. D. Kara and S. Yüksel, Robustness to incorrect system models in stochastic control, arXiv:1803.06046, 2020, https://doi.org/10.48550/arXiv.1803.06046.Search in Google Scholar

[15] J. González-Hernández, R. R. López-Martínez, and J. A. Minjárez-Sosa, Adaptive policies for stochastic systems under a randomized discount criterion, Bol. Soc. Mat. Mex. 14 (2008), no. 1, 149–163. Search in Google Scholar

[16] J. González-Hernández, R. R. López-Martínez, J. A. Minjárez-Sosa, and J. A. Gabriel-Arguelles, Constrained Markov control processes with randomized discounted cost criteria: occupation measures and extremal points, Risk Decis. Anal. 4 (2013), no. 3, 163–176, DOI: https://doi.org/10.3233/RDA-2012-0063. Search in Google Scholar

[17] J. González-Hernández, R. R. López-Martínez, and J. A. Minjárez-Sosa, Approximation, estimation and control of stochastic systems under a randomized discounted cost criterion, Kybernetika 45 (2009), no. 5, 737–754, http://eudml.org/doc/37698. Search in Google Scholar

[18] J. González-Hernández, R. R. López-Martínez, and J. R. Pérez-Hernández, Markov control processes with randomized discounted cost in Borel space, Math. Methods Oper. Res. 65 (2007), no. 1, 27–44, DOI: https://doi.org/10.1007/s00186-006-0092-2. Search in Google Scholar

[19] K. Hinderer, Lipschitz continuity of value functions in Markovian decision processes, Math. Methods Oper. Res. 62 (2005), 3–22, DOI: https://doi.org/10.1007/s00186-005-0438-1. Search in Google Scholar

[20] R. Miculescu, Approximations by Lipschitz functions generated by extensions, Real Anal. Exchange 28 (2003), no. 1, 33–40, DOI: https://doi.org/10.14321/realanalexch.28.1.0033. Search in Google Scholar

[21] E. I. Gordienko, An estimate of the stability of optimal control of certain stochastic and deterministic systems, J. Sov. Math. 59 (1992), no. 4, 891–899, https://link.springer.com/content/pdf/10.1007/BF01099115.pdf. Search in Google Scholar

[22] E. I. Gordienko and F. S. Salem, Robustness inequality for Markov control processes with unbounded costs, Systems Control Lett. 33 (1998), no. 2, 125–130, DOI: https://doi.org/10.1016/S0167-6911(97)00077-7. Search in Google Scholar

[23] H. Cruz-Suárez, R. Montes-De-Oca, and G. Zacarías, A consumption-investment problem modelled as a discounted Markov decision process, Kybernetika 47 (2011), no. 6, 909–929, http://dml.cz/dmlcz/141734. Search in Google Scholar

[24] D. Cruz-Suárez, R. Montes-de-Oca, and F. Salem-Silva, Conditions for the uniqueness of optimal policies of discounted Markov decision processes, Math. Methods Oper. Res. 60 (2004), no. 3, 415–436, DOI: https://doi.org/10.1007/s001860400372. Search in Google Scholar

[25] K. L. Judd, Numerical Methods in Economics, MIT Press, United States of America, 1998. Search in Google Scholar

Received: 2023-01-31

Revised: 2023-09-13

Accepted: 2023-09-14

Published Online: 2023-10-24

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/math-2023-0129

Keywords for this article

Markov decision processes; small noise disturbance parameter; total discounted cost; stability index; convergence rate

Creative Commons

BY 4.0