Stability estimation of some Markov controlled processes

Evgueni Gordienko; Juan Ruiz de Chavez

doi:10.1515/math-2022-0514

Article Open Access

Stability estimation of some Markov controlled processes

Evgueni Gordienko and Juan Ruiz de Chavez

Published/Copyright: November 24, 2022

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

$Open Mathematics$

From the journal Open Mathematics Volume 20 Issue 1

Abstract

We consider a discrete-time Markov controlled process endowed with the expected total discounted reward. We assume that the distribution of the underlying random vectors is unknown and that it is approximated by an appropriate known distribution. We found upper bounds of a decrease in reward when the policy, optimal for the approximating process, is applied to control the original process.

Keywords: optimal control policy; stability inequality; the total variation and the Dudley metrics

MSC 2010: 90B05; 90C31; 90C40; 93E20

1 Introduction

In the theory of discrete-time Markov processes, the term “stability” is used in various meanings. First and foremost, uncontrolled processes, this refers to some recurrent or ergodic properties of the processes (see, e.g., [1]).

Quite a long time ago, this concept moved into the field of controlled processes, particularly, in the context of adaptive control. (Among the huge number of references, we indicate only a couple of fairly recent ones [2,3].)

The second widely used meaning of the word “stability” is close to “continuity.” Speaking of the quantitative approach to such continuity under perturbations of certain parameters, the deviations of some basic characteristics of the Markov processes (such as the limiting distribution) are estimated.

Using probability metrics, the methods of quantitative continuity of uncontrolled processes have been developed, for instance, in the works [4,5,6].

The quantitative assessment of the stability (or “continuity”) of optimal control of a Markov process has its own peculiarities. Here, the policy that is optimal for a certain “approximating process” is used to control the original (“real”) process. The underlying probability distributions of the latter are unknown and are often evaluated by statistical procedures. Such estimation leads to what we have designated as the “approximating controlled process.”

The problem is posed as finding the upper bounds of the stability index, which is defined in (2.7) in Section 2, and it expresses the decrease in the given performance index (compared to the application of the optimal for the original process control). This problem was probably first considered in [7,8]. Since then, the authors just mentioned and others have been solving this problem for various classes of discrete-time Markov controlled processes and for different performance indexes (optimization criteria).

In this article, we consider Markov control processes with general state and action spaces, choosing the expected total discounted reward as an optimization criterion. Thus, the results given in Section 3 are related to those obtained in the previous articles [9,10,11]. In contrast to the problem setting in these articles, we focus our attention on the controlled processes with bounded one-step rewards. This allows us to obtain new stability inequalities using both the total variation metric and the Dudley metric.

The total variation distance works well under the standard compactness-continuity conditions, but to obtain the corresponding stability inequality in terms of the Dudley metric, we have to impose additional Lipschitz continuity conditions.

The Dudley metric is convenient in an important situation where the nonparametric approach is applied, i.e., when unknown probability distributions are approximated by empirical distributions (see, e.g., [12]).

It should be noted that the problem of estimating the stability of optimal control considered in the article is closely related to the problem of adaptive control of Markov processes. In the adaptive formulation, the control is accompanied by some estimation procedure, and the current control policies should approximate the optimal ones as the distribution (or parameters) is refined. For the development of adaptive algorithms, quantitative estimates of the “stability of optimal control” can be useful. Among the vast literature, the works [13,14,15, 16,17] used the expected total discounted reward as a criterion of optimization and discuss the application of nonparametric estimation of “governing distributions.”

2 Setting of the problem

We consider a discrete-time Markov controlled process of the form:

(2.1) X t = F ( X t − 1 , a t , ξ t ) , t = 1 , 2 , … ,

where X t ∈ X is a state of the process at time t , and ξ 1 , ξ 2 , … is a sequence of independent and identically distributed (i.i.d.) random vectors with values in a complete separable metric space ( S , ρ ) . Let A be a given action set. Then, if X t − 1 = z ∈ X , then the control (action) a t is selected from a designated compact subset A ( z ) ⊂ A . We assume that X and A are complete separable metric spaces (which are, particularly, Borel spaces). The metric in X will be denoted by d . Finally, F : X × A × S → X is a measurable function.

A sequence π = ( a 1 , … , a t , … ) , where the control a t at time t is a measurable function of the current state x t − 1 and can also depend on previous states and actions, is called control policy, or simply policy. A policy π is called stationary and denoted by f if there is a measurable function f : X → A such that a t = f ( X t − 1 ) ∈ A ( X t − 1 ) , t = 1 , 2 … .

We denote by:

Π the set of all policies;
F the set of all stationary policies.

A policy optimization criterion, in our setting, is the expected total discounted reward:

(2.2) V ( x , π ) = E x π ∑ t = 1 ∞ α t − 1 r ( X t − 1 , a t ) , π ∈ Π , x ∈ X ,

where E x π is the expectation with respect to the probability that corresponds to the application of a policy π with an initial state of the process x ∈ X (see, e.g., [18] for the construction of the corresponding probability space). r ( z , a ) is the one-step reward acquired when the process is in the state z and the action a is selected and α ∈ ( 0 , 1 ) is a given discount factor.

Throughout the article, we will assume that r is a measurable bounded function, that is,

(2.3) sup ( x , a ) ∈ K ∣ r ( x , a ) ∣ ≤ b < ∞ .

In this inequality and further on, K = def { ( x , a ) ∈ X × A : a ∈ A ( x ) for all x ∈ X } , which is supposed to be a measurable subset of X × A .

The policy π ∗ is called optimal, if for each x ∈ X ,

(2.4) V ( x , π ∗ ) = V ∗ ( x ) = def sup π ∈ Π V ( x , π ) , x ∈ X .

In many applications, all components of the process, except for the distribution G of the random vector ξ 1 in (2.1), are known. For the distribution G , usually some approximation G ˜ is available (e.g., obtained from statistical data).

Despite the fact that a controller is looking for the optimal policy π ∗ , she/he is forced to work with the following approximating controlled processes:

(2.5) X ˜ t = F ( X ˜ t − 1 , a ˜ t , ξ ˜ t ) , t = 1 , 2 , … .

The only difference between this process and the “original” process in (2.1) is that the i.i.d. random vectors, ξ ˜ 1 , ξ ˜ 2 , … , have the common distribution G ˜ .

The expected total discounted reward V ˜ ( x , π ) for the process (2.5) is defined by formula (2.2), in which X t − 1 , a t is replaced by X ˜ t − 1 , a ˜ t .

Let B denote the space of all measurable bounded functions u : X → R , with the uniform norm:

‖ u ‖ = def sup x ∈ X ∣ u ( x ) ∣ .

Let ξ and ξ ˜ be generic vectors for ξ 1 , ξ 2 , … and ξ ˜ 1 , ξ ˜ 2 , … , respectively.

Assumption 1

For each fixed x ∈ X :

the function r ( x , ⋅ ) is continuous on A ( x ) ;
for every u ∈ B , the maps
a → E u [ F ( x , a , ξ ) ] and a → E u [ F ( x , a , ξ ˜ ) ]
are continuous on A ( x ) .

The next assertion is well-known (see, e.g., [13, Ch. 2], and [19] for the proof).

Proposition 2.1

Under Assumption 1, there exist stationary policies π ∗ ≡ f ∗ and π ˜ ∗ ≡ f ˜ ∗ , which are optimal for the “real” process (2.1) and for the approximating process (2.5), respectively.

In other words, (2.4) holds with f ∗ , and also

(2.6) V ˜ ( x , f ˜ ∗ ) = V ˜ ∗ ( x ) = def sup π ∈ Π V ˜ ( x , π ) , x ∈ X .

Remark 2.1

If we assume that for a compact A , A ( x ) = A , x ∈ X and the one-step reward function r ( x , a ) is continuous on X × A , then Proposition 2.1 holds true if we replace Assumption 1(b) with the following less restrictive condition:

Assumption 1

( b ∗ ): For each x ∈ X and for every continuous and bounded function u : X → R , the maps

a → E u [ F ( x , a , ξ ) ] and a → E u [ F ( x , a , ξ ˜ ) ]

are continuous on A . (See [19] for the corresponding proof of Proposition 2.1.)

Assume that the controller can find the policy f ˜ ∗ , and she/he applies f ˜ ∗ to control the “original” process (2.1). In this way, f ˜ ∗ is used as reasonable approximation to the not available policy f ∗ . We will measure the accuracy of such an approximation by evaluating the following stability index:

(2.7) Δ ( x ) = def V ( x , f ∗ ) − V ( x , f ˜ ∗ ) ≥ 0 , x ∈ X .

The problem under consideration is to prove stability inequalities of the type:

sup x ∈ X Δ ( x ) ≤ C μ ( G , G ˜ ) ,

where μ is either the total variation metric or the Dudley metric.

3 The results

First, we recall the definitions of two metrics in the space of distributions of random vectors with values in ( S , ℬ S ) . Here, ℬ S is the Borel σ -algebra of subsets of S .

The total variation metric V (see, e.g., [20]):

If ξ and ξ ˜ are random vectors with distributions G and G ˜ , then

(3.1) V ( G , G ˜ ) = def sup φ ∈ B 1 ∣ E φ ( ξ ) − E φ ( ξ ˜ ) ∣ ,

where

B 1 = { φ : S → R : φ is measurable and ‖ φ ‖ = sup s ∈ S ∣ φ ( s ) ∣ ≤ 1 } .

The Dudley metric d (see [21]):

(3.2) d ( G , G ˜ ) = def sup φ ∈ B 1 , L ∣ E φ ( ξ ) − E φ ( ξ ˜ ) ∣ ,

where

(3.3) B 1 , L = φ ∈ B 1 : ‖ φ ‖ + sup s ≠ s ′ ∣ φ ( s ) − φ ( s ′ ) ∣ ρ ( s , s ′ ) ≤ 1 , where ρ is the metric in S .

It is well-known that the convergence in the metric d is equivalent to the weak convergence of distributions (see, e.g., [21]).

Theorem 1

Under (2.3) and Assumption 1,

(3.4) sup x ∈ X Δ ( x ) ≤ 2 α b ( 1 − α ) 2 V ( G , G ˜ ) .

Proof

In view of Proposition 2.1, we can write (2.4) and (2.6) as follows ( x ∈ X ):

(3.5) V ( x , f ∗ ) = V ∗ ( x ) = sup f ∈ F V ( x , f ) ,

(3.6) V ˜ ( x , f ∗ ) = V ˜ ∗ ( x ) = sup f ∈ F V ˜ ( x , f ) .

Then, for arbitrary x ∈ X by (2.7) and (3.5) and (3.6),

(3.7) Δ ( x ) ≤ ∣ V ( x , f ∗ ) − V ˜ ( x , f ˜ ∗ ) ∣ + ∣ V ˜ ( x , f ˜ ∗ ) − V ( x , f ˜ ∗ ) ∣ = ∣ sup f ∈ F V ( x , f ) − sup f ∈ F V ˜ ( x , f ) ∣ + ∣ V ( x , f ˜ ∗ ) − V ˜ ( x , f ˜ ∗ ) ∣ ≤ 2 sup f ∈ F ∣ V ( x , f ) − V ˜ ( x , f ) ∣ .

Let us fix an arbitrary stationary policy f ∈ F and define two operators:

T f : B → B and T ˜ f : B → B as follows ( u ∈ B ):

(3.8) T f u ( x ) = def { r ( x , f ( x ) ) + α E u [ F ( x , f ( x ) , ξ ) ] } , x ∈ X ,

(3.9) T ˜ f u ( x ) = def { r ( x , f ( x ) ) + α E u [ F ( x , f ( x ) , ξ ˜ ) ] } , x ∈ X .

The following two facts are well-known (see, e.g., [13, Ch. 2]):

The functions V f ( ⋅ ) = def V ( ⋅ , f ) and V ˜ f ( ⋅ ) = def V ˜ ( ⋅ , f ) (where “ ⋅ ” stands for x ∈ X ) belong to B , and moreover, they are fixed points of the operators T f and T ˜ f , that is,
(3.10) T f V f = V f and T ˜ f V ˜ f = V ˜ f .
The operators T f and T ˜ f are contractive with modulus α , that is, ( u , v ∈ B ):
(3.11) ‖ T f u − T f v ‖ ≤ α ‖ u − v ‖ ; ‖ T ˜ f u − T ˜ f v ‖ ≤ α ‖ u − v ‖ .
Therefore,
‖ V f − V ˜ f ‖ = ‖ T f V f − T ˜ f V ˜ f ‖ ≤ ‖ T f V f − T f V ˜ f ‖ + ‖ T f V ˜ f − T ˜ f V ˜ f ‖ ≤ α ‖ V f − V ˜ f ‖ + ‖ T f V ˜ f − T ˜ f V ˜ f ‖ .
Hence,
(3.12) ‖ V f − V ˜ f ‖ ≤ 1 1 − α ‖ T f V ˜ f − T ˜ f V ˜ f ‖ .

Let us estimate the second factor on the right side of (3.12). By (3.8) and (3.9), we have

(3.13) ‖ T f V ˜ f − T ˜ f V ˜ f ‖ = α sup x ∈ X ∣ E V ˜ f [ F ( x , f ( x ) , ξ ) ] − E V ˜ f [ F ( x , f ( x ) , ξ ˜ ) ] ∣ .

Using the definition of V ˜ f (i.e., (2.2) with X ˜ t , a ˜ t ), we see that

(3.14) sup x ∈ X ∣ V ˜ f ( x ) ∣ ≤ ∑ t = 1 ∞ α t − 1 b = b 1 − α .

Thus, for each x fixed, in (3.13), the function V ˜ f [ F ( x , f ( x ) , s ) ] of s ∈ S is bounded by b ( 1 − α ) − 1 . Applying the definitions (3.1), (3.13), and (3.12), we find that

sup x ∈ X ∣ V f ( x ) − V ˜ f ( x ) ∣ ≤ α b ( 1 − α ) 2 V ( G , G ˜ ) .

Combining the last inequality with (3.7), we obtain (3.4).□

In a fairly common situation, the unknown distribution G is estimated by the empirical distribution G ˜ n , obtained from the sample ξ 1 , ξ 2 , … , ξ n . Excluding the cases of discrete G , V ( G , G ˜ n ) fails to approach zero as n → ∞ . Thus, in many situations, inequality (3.4) is useless. On the other hand, under mild conditions, we have:

d ( G , G ˜ n ) → 0 almost surely , and E d ( G , G ˜ n ) → 0 as n → ∞ ,

(see the end of this section.)

To obtain the stability inequality with the Dudley metric d on the right-hand side, we need additional Lipschitz conditions.

Assumption 2

There exist a constant L 0 and a measurable function L ¯ 1 : S → [ 0 , ∞ ) such that:
(3.15) ( 1 ) ∣ r ( x , a ) − r ( y , a ) ∣ ≤ L 0 d ( x , y ) , for all ( x , a ) , ( y , a ) ∈ K ;

(3.16) ( 2 ) d [ F ( x , a , ξ ) , F ( y , a , ξ ) ] ≤ L ¯ 1 ( ξ ) d ( x , y ) , for all ( x , a ) , ( y , a ) ∈ K ,

E L ¯ 1 ( ξ ) = L 1 and α L 1 < 1 .
There is a constant L < ∞ such that for each ( x , a ) ∈ K ; s , s ′ ∈ S ,
(3.17) d [ F ( x , a , s ) , F ( x , a , s ′ ) ] ≤ L ρ ( s , s ′ ) .
A is compact and A ( x ) = A for all x ∈ X .

Theorem 2

Under Assumptions 1 and 2,

(3.18) sup x ∈ X Δ ( x ) ≤ 2 α ( 1 − α ) 2 b 1 − α + L 0 L 1 − α L 1 d ( G , G ˜ ) ,

where d is the Dudley metric defined in (3.2).

Proof

We define the operators T : B → B and T ˜ : B → B as follows ( u ∈ B ):

(3.19) T u ( x ) = def sup a ∈ A { r ( x , a ) + α E u [ F ( x , a , ξ ) ] } , x ∈ X ,

(3.20) T ˜ u ( x ) = def sup a ∈ A { r ( x , a ) + α E u [ F ( x , a , ξ ˜ ) ] } , x ∈ X .

In [13, Ch. 2], it was proved that: (1)

(3.21) V ∗ = T V ∗ and V ˜ ∗ = T ˜ V ˜ ∗ ,

where V ∗ and V ˜ ∗ defined in (3.5) and (3.6) are value functions of the process (2.1) and of the process (2.5), respectively.

(2) Both operators T and T ˜ are contractive (with respect to ‖ ⋅ ‖ ) with modulus α .

Let us define the number (generally belonging to [ 0 , ∞ ] ):

(3.22) μ ( ξ , ξ ˜ ) = def sup ( x , a ) ∈ K ∣ E V ∗ [ F ( x , a , ξ ) ] − E V ∗ [ F ( x , a , ξ ˜ ) ] ∣ .

The first step in the proof is to establish the following inequality:

(3.23) sup x ∈ X Δ ( x ) ≤ 2 α ( 1 − α ) 2 μ ( ξ , ξ ˜ ) .

For ( x , a ) ∈ K , let

(3.24) H ( x , a ) = def r ( x , a ) + α E V ∗ [ F ( x , a , ξ ) ] ,

(3.25) H ˜ ( x , a ) = def r ( x , a ) + α E V ˜ ∗ [ F ( x , a , ξ ˜ ) ] ,

and for each t ≥ 1 ,

Γ t = { x , a 1 ; X 1 , a 2 ; … , X t − 1 , a t }

be the part of a trajectory of the process (2.1) when applying the stationary policy f ˜ ∗ .

By Markov property of process (2.1) (when a stationary policy is applied) and (3.24), we have:

ζ t = def E f ∗ ˜ [ α V ∗ ( X t ) ∣ Γ t ] = H ( X t − 1 , a t ) − r ( X t − 1 , a t ) = H ( X t − 1 , a t ) − r ( X t − 1 , a t ) − sup a ∈ A H ( X t − 1 , a ) + sup a ∈ A H ( X t − 1 , a ) .

We can see from (3.24), (3.19), and (3.21) that

sup a ∈ A H ( X t − 1 , a ) = V ∗ ( X t − 1 ) .

Hence,

(3.26) ζ t = H ( X t − 1 , a t ) − sup a ∈ A H ( X t − 1 , a ) − r ( X t − 1 , a t ) + V ∗ ( X t − 1 ) = − Λ t − r ( X t − 1 , a t ) + V ∗ ( X t − 1 ) ,

where

(3.27) Λ t = def sup a ∈ A H ( X t − 1 , a ) − H ( X t − 1 , a t ) .

Now, rewriting (3.26) as

V ∗ ( X t − 1 ) − r ( X t − 1 , a t ) − ζ t = Λ t

and taking expectation E x f ∗ ˜ (in both parts), we obtain:

E x f ∗ ˜ V ∗ ( X t − 1 ) − E x f ∗ ˜ r ( X t − 1 , a t ) − α E x f ∗ ˜ V ∗ ( X t ) = E x f ∗ ˜ Λ t .

Multiplying the last equality by α t − 1 and summing the inequalities with t = 1 , 2 … , n , we obtain:

(3.28) V ∗ ( x ) − α n E x f ∗ ˜ V ∗ ( X n ) − ∑ t = 1 n α t − 1 E x f ∗ ˜ r ( X t − 1 , a t ) = ∑ t = 1 n α t − 1 E x f ∗ ˜ Λ t .

From (3.14), it follows that V ∗ is a bounded function. So, taking in (3.28) limit n → ∞ , the second term on the left-hand side tends to zero, while the third term approaches V ( x , f ˜ ∗ ) . Therefore,

(3.29) Δ ( x ) = V ∗ ( x ) − V ( x , f ˜ ∗ ) = ∑ t = 1 ∞ α t − 1 E x f ∗ ˜ Λ t .

Since f ˜ ∗ is the optimal policy for the process (2.5) applying (3.20), (3.21), and (3.25), we easily find that

sup a ∈ A H ˜ ( X t − 1 , a ) = H ˜ ( X t − 1 , a t ) .

Hence, by (3.27),

Λ t = sup a ∈ A H ( X t − 1 , a ) − sup a ∈ A H ˜ ( X t − 1 , a ) + H ˜ ( X t − 1 , a t ) − H ( X t − 1 , a t )

and

∣ Λ t ∣ ≤ sup a ∈ A 2 ∣ H ( X t − 1 , a ) − H ˜ ( X t − 1 , a ) ∣ ≤ 2 α sup a ∈ A ∣ E V ∗ [ F ( X t − 1 , a , ξ ) ] − E V ˜ ∗ [ F ( X t − 1 , a , ξ ˜ ) ] ∣ ,

where the expectation in the last term is taken with respect to the random vectors ξ and ξ ˜ (with X t − 1 being fixed). From the last inequality, we obtain:

(3.30) ∣ Λ t ∣ ≤ 2 α sup a ∈ A ∣ E V ∗ [ F ( X t − 1 , a , ξ ) ] − E V ∗ [ F ( X t − 1 , a , ξ ˜ ) ] ∣ + 2 α sup a ∈ A ∣ E V ∗ [ F ( X t − 1 , a , ξ ˜ ) ] − E V ˜ ∗ [ F ( X t − 1 , a , ξ ˜ ) ] ∣ .

The first term on the right-hand side of (3.30) is not greater than 2 α μ ( ξ , ξ ˜ ) (see (3.22)), and the second term is not greater than 2 α ‖ V ∗ − V ˜ ∗ ‖ .

Using (3.21) and the contractive property of T ˜ , we have

‖ V ∗ − V ˜ ∗ ‖ ≤ ‖ T ˜ V ˜ ∗ − T ˜ V ∗ ‖ + ‖ T ˜ V ∗ − T V ∗ ‖ ≤ α ‖ V ∗ − V ˜ ∗ ‖ + ‖ T V ∗ − T ˜ V ∗ ‖

or (see (3.19), (3.20))

‖ V ∗ − V ˜ ∗ ‖ ≤ α 1 − α sup x ∈ X sup a ∈ A ∣ E V ∗ [ F ( x , a , ξ ) ] − E V ∗ [ F ( x , a , ξ ˜ ) ] ∣ ≤ α 1 − α μ ( ξ , ξ ˜ ) .

The last inequality and (3.30) provide that for each t ≥ 1 ,

∣ Λ t ∣ ≤ 2 α 1 + α 1 − α μ ( ξ , ξ ˜ ) .

Substituting this in (3.29), we obtain (3.23).

The second step in the proof of the theorem is to show that under Assumption 2, in (3.23),

(3.31) μ ( ξ , ξ ˜ ) ≤ b 1 − α + L 0 L 1 − α L 1 d ( G , G ˜ ) .

By (3.14), the function V ∗ in (3.22) is bounded by b ( 1 − α ) − 1 . Now, we will show that for all ( x , a ) ∈ K ; s , s ′ ∈ S ,

(3.32) ∣ V ∗ [ F ( x , a , s ) ] − V ∗ [ F ( x , a , s ′ ) ] ∣ ≤ L ˜ ρ ( s , s ′ ) ,

(3.33) where L ˜ = L 0 L 1 − α L 1 .

First, we check that the value function V ∗ : X → R satisfies the Lipschitz conditions with the constant L 0 / ( 1 − α L 1 ) .

Let u 0 ≡ 0 and T be the operator defined in (3.19). Also, set u 1 = T u 0 . Then, for any x , y ∈ X ,

(3.34) ∣ u 1 ( x ) − u 1 ( y ) ∣ = ∣ sup a ∈ A r ( x , a ) − sup a ∈ A r ( y , a ) ∣ ≤ sup a ∈ A ∣ r ( x , a ) − r ( y , a ) ∣ ≤ L 0 d ( x , y ) ,

due to (3.15) in Assumption 2.

Let now u 2 = T u 1 . Then, in view of (3.19),

∣ u 2 ( x ) − u 2 ( y ) ∣ ≤ sup a ∈ A { ∣ r ( x , a ) − r ( y , a ) ∣ + α ∣ E u 1 [ F ( x , a , ξ ) ] − E u 1 [ F ( y , a , ξ ) ] ∣ } ≤ L 0 d ( x , y ) + α L 0 sup a ∈ A E r ( F ( x , a , ξ ) , F ( y , a , ξ ) ) ≤ L 0 ( 1 + α L 1 ) d ( x , y ) .

To obtain the last inequality, we have made use of (3.34) and Assumption 2(a), (2).

Letting u n = T u n − 1 , n ≥ 1 , it is proved by induction that for any x , y ,

(3.35) ∣ u n ( x ) − u n ( y ) ∣ ≤ L 0 [ 1 + α L 1 + … , ( α L 1 ) n − 1 ] d ( x , y ) .

Since V ∗ is a fixed point of the contractive operator T , we have ‖ V ∗ − T n u 0 ‖ → 0 as n → ∞ . We see from (3.35) that for every n ≥ 1 , the function u n = T n u 0 is Lipschitz with the constant L ˜ ˜ = L 0 ( 1 − α L 1 ) . Consequently, V ∗ satisfies the Lipschitz condition with the constant L ˜ ˜ .

To verify (3.32), observe that by Assumption 2(b), the function φ ( s ) = V ∗ [ F ( x , a , s ) ] is a composition of two Lipschitz functions.

Note that ‖ φ ‖ ≤ b ( 1 − α ) − 1 , and φ is Lipschitz with the constant L ˜ in (3.33). Therefore, if we divide φ by b ( 1 − α ) − 1 + L ˜ , we obtain a function from the class B 1 , L in (3.3). Finally, to obtain inequality (3.31), it suffices to compare (3.22) with the definition of the Dudley metric given in (3.2) and (3.3).□

The natural question arises: How to evaluate d ( G , G ˜ ) in (3.18) if the distribution G is assumed to be unknown? We can give the answer in one of the most important cases when G ˜ is the empirical distribution, used to estimate G .

Now, we assume that the random vectors ξ 1 , ξ 2 … in (2.1) are observable and let ξ 1 , ξ 2 … , ξ n be i.i.d. observations of a random vector ξ with distribution G . The empirical distribution G ˜ ≡ G ˜ n is defined (on ( S , ℬ S ) ) as follows:

G ˜ n = 1 n ∑ k = 1 n δ ξ k , where for k = 1 , 2 … , n ,

δ ξ k ( B ) = 1 , if ξ k ∈ B , 0 , otherwise .

( B ∈ ℬ S ).

Assume that S = R k and ∣ ⋅ ∣ is the Euclidian norm. Also, suppose that there exist constants K < ∞ and h > 0 such that E e h ∣ ξ ∣ ≤ K .

Then, there is a calculable constant C = C ( k , K , h ) such that: for each n = 1 , 2 , …

(3.36) E d ( G , G ˜ n ) ≤ C δ ( k , n ) ,

where δ ( k , n ) = log ( 1 + n ) n 1 / 2 , if k = 1 , log 2 ( 1 + n ) n 1 / 2 , if k = 2 , log ( 1 + n ) n 1 / k , if k ≥ 3 .

The inequality (3.36) was shown in Proposition 2.1 in [10], but it is actually a fairly direct consequence of Proposition 3.4 in [12]. Taking expectation in both parts of (3.18) one can apply inequality (3.36).

Remark 3.1

There is a class of controlled Markov processes with observable “perturbations” ξ 1 , ξ 2 … . (One representative is discussed in Example 2.) Even more often, the mentioned random vectors are not observable. In such cases, one should either use some indirect methods of bounding d ( G , G n ) , or look for other treatments. It is worth noting that our setting of the problem, generally speaking, does not require any estimation procedure. The distribution G can be, for example, some “theoretical simplification” of the known, but “too complex” real distribution G .

4 Examples

Example 1

In fact, this is a simple counterexample showing that Assumption 2 is essential for inequality (3.18) to hold.

Let X = [ 0 , ∞ ) , A = { 0 , 1 } , S = R 2 , and for ξ t = ( ξ t ( 1 ) , ξ t ( 2 ) ) ,

X t = ξ t ( 1 ) + X t − 1 a t ξ t ( 2 ) , t = 1 , 2 , … .

For a = 0 and a = 1 , the one-step reward function is the same and given by the following formula:

(4.1) r ( x , a ) = 2 , if x = 0 x , if x ∈ ( 0 , 1 ] 1 , if x > 1 .

For an arbitrary but fixed ε > 0 , we set G = δ ( 0 , 1 ) , that is,

P ( ξ t ( 1 ) = 0 ) = 1 and P ( ξ t ( 2 ) = 1 ) = 1 ,

and also, G ˜ = δ ( ε , 1 ) , that is

P ( ξ t ( 1 ) = ε ) = 1 and P ( ξ t ( 2 ) = 1 ) = 1 .

Then, the “real” process is

(4.2) X t = X t − 1 a t , t ≥ 1 ,

and the approximating one is

(4.3) X ˜ t = ε + X ˜ t − 1 a t , t ≥ 1 .

Let α ∈ ( 0 , 1 ) be any discount factor.

Let us fix the initial state X 0 = X ˜ 0 = 1 . From (4.1) and (4.2), we see that the optimal stationary policy for the process (4.2) is f ∗ = { 0 , 0 , … } (i.e., always select the action f ∗ ( x ) = 0 ). The corresponding reward is

(4.4) V ( 1 , f ∗ ) = 1 + ∑ t = 2 ∞ α t − 1 ⋅ 2 = 1 1 − α + α 1 − α .

Since the process (4.3) can never reach the state x = 0 and r ( x , a ) is non-decreasing on ( 0 , ∞ ) , the stationary optimal policy for the process (4.3) is f ˜ ∗ = { 1 , 1 , … } . The application of f ˜ ∗ to the process (4.2) gives

V ( 1 , f ˜ ∗ ) = ∑ t = 1 ∞ α t − 1 ⋅ 1 = 1 1 − α .

Comparing this with (4.4), we see that the stability index in (2.7) is

Δ ( 1 ) = α 1 − α > 0 .

On the other hand, it is easy to show that d ( G , G ε ) = ε → 0 (as ε → 0 ).

Note that V ( G , G ε ) = 2 for all ε > 0 .

Example 2

(See, e.g., [13, Ch. 1] or [22].) In this model, related to a dam operation, the stocks of water are specified by the equations

(4.5) X t = min { X t − 1 − a t + ξ t , M } , t = 1 , 2 , … ,

where M < ∞ is the capacity of a water reservoir, X t − 1 is the stock of water at the beginning of t th period (say, day). The control a t is the volume of water released during the t th period (e.g., for irrigation). Finally, ξ t is a non-negative random variable representing the water inflow in the t th period. We assume that ξ 1 , ξ 2 , … are i.i.d. random variables having density g .

As we see from (4.5), for this control process, X = [ 0 , M ] , A ( x ) = [ 0 , x ] , x ∈ [ 0 , M ] , and S = [ 0 , ∞ ) .

Choosing some bounded one-step reward function r ( x , a ) (which in the simplest case is ( − 1 ) × the cost of a unit of water) and fixing a discount factor α ∈ ( 0 , 1 ) , we are faced with the problem of optimal water management, which is set as maximizing the expected long-term total discounted reward.

We assume that the density g (of the water inflow) is unknown, and it is approximated by some known density g ˜ (obtained, for instance, from statistical estimations).

Also, we assume the following:

For each x ∈ [ 0 , M ] , the one-step reward r ( x , a ) is a continuous function of a ∈ [ 0 , x ] .
Both densities g and g ˜ are bounded and continuous on ( 0 , ∞ ) .

In the verification of Assumption 1, (b) is a matter of simple calculations. Then, according to Proposition 2.1, there exist stationary optimal policies f ∗ and f ˜ ∗ , for the process (4.5) and, correspondingly, for the following approximating water release process:

X ˜ t = min { X ˜ t − 1 − a ˜ t + ξ ˜ t , M } , t = 1 , 2 , … ,

where the i.i.d. random variables have density g ˜ . The application, for instance, of the policy f ∗ signifies that at t th period, the part f ∗ ( X t − 1 ) of a current stock X t − 1 is released.

Noting that all conditions of Theorem 1 are satisfied, and for distributions having densities,

V ( G , G ˜ ) = ∫ 0 ∞ ∣ g ( y ) − g ˜ ( y ) ∣ d y ,

by (3.4) we have

sup x ∈ [ 0 , M ] Δ ( x ) ≤ 2 α b ( 1 − α ) 2 ∫ 0 ∞ ∣ g ( y ) − g ˜ ( y ) ∣ d y ,

where b = sup ( x , a ) ∈ K ∣ r ( x , a ) ∣ , and Δ ( x ) is the stability index defined in (2.7).

Example 3

(Controlled “environmental stochastic process”) The uncontrolled version of this discrete-time stochastic processes is defined by following recurrent equations (see, e.g., [23, Ch. 9]):

(4.6) X t = α ( ξ t ) X t − 1 + φ ( ξ t ) , t = 1 , 2 , … ,

where ξ 1 , ξ 2 … are i.i.d. random vectors with values in the Euclidian space R k , and X t ∈ R ( t = 0 , 1 , 2 , … ).

Processes of type (4.6) are used in modeling some phenomena in environmental science.

We will consider a controlled variant of (4.6), that is, the process

(4.7) X t = α ( ξ t ) X t − 1 + φ ( a t , ξ t ) , t = 1 , 2 , … ,

where a t ∈ A , and A is a given compact subset of the Euclidian space R m .

In this way, A ( x ) = A for all x ∈ X = R . In this example, the space S is R k .

Let r ( x , a ) be a certain, bounded by b , one-step reward function, which is continuous on R × A , and, moreover, for some L 0 < ∞ ,

(4.8) ∣ r ( x , a ) − r ( y , a ) ∣ ≤ L 0 ∣ x − y ∣ ,

for all x , y ∈ R and a ∈ A .

We assume that

(4.9) E α ( ξ 1 ) ≤ L 1 and α L 1 < 1 ,

and, for some L < ∞ ,

(4.10) ∣ φ ( a , s ) − φ ( a , s ′ ) ∣ ≤ L ∣ s − s ′ ∣ ,

for all s , s ′ ∈ R k , and a ∈ A , and also for each s ∈ R k , the map a → φ ( a , s ) is continuous. Using (4.7)–(4.10), it is easy to check the fulfillment of Assumption 2. Also, Assumption 1(a) and (b*) are fulfilled. Indeed, if u : R → R is continuous and bounded, then the map a → E u [ α ( ξ ) x + φ ( a , ξ ) ] is continuous by the dominated convergence theorem.

All of the above allow us to apply the stability inequality (3.18). Making use of the known relationship between the Dudley and Wasserstein metric, for the particular case where k = 1 (i.e., ξ t is a random variable), the mentioned inequality can be written as follows:

sup x ∈ R Δ ( x ) ≤ 2 3 / 2 α ( 1 − α ) 2 b 1 − α + L 0 L 1 − α L 1 ∫ − ∞ ∞ ∣ F ξ ( y ) − F ξ ˜ ( y ) ∣ d y 1 / 2 ,

where F ξ and F ξ ˜ are the distribution functions of ξ and ξ ˜ , respectively, and ξ ˜ is generic for i.i.d. random vectors ξ ˜ 1 , ξ ˜ 2 , … involved in the approximating process

X ˜ t = α ( ξ ˜ t ) X ˜ t − 1 + φ ( a ˜ t , ξ ˜ t ) , t = 1 , 2 , … .

Acknowledgement

We thank the reviewers for their careful revision of the manuscript and for suggestions, which allow us to correct and improve the presentation of the article.

Author contributions: All authors read and approved the final manuscript.
Conflict of interest: The authors state no conflict of interest.
Data availability statement: No data, models, or code are generated or used during the study.

References

[1] S. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, Springer-Verlag, London, 1993. 10.1007/978-1-4471-3267-7Search in Google Scholar

[2] Ch. Andrieu, V. B. Tadić, and M. Vihola, On the stability of some controlled Markov chains and its applications to stochastic approximation with Markovian dynamic, Ann. Appl. Probab. 25 (2015), no. 1, 1–45, https://doi.org/10.1214/13-AAP953. Search in Google Scholar

[3] Y. F. Atchadé and G. Fort, Limit theorems for some adaptive MCMC algorithms with subgeometric kernels: Part II, Bernoulli 18 (2012), no. 3, 975–1001, https://doi.org/10.3150/11-BEJ360. Search in Google Scholar

[4] V. M. Zolotarev, On the continuity of stochastic sequences generated by recurrent processes, Theory Probab. Appl. 20 (1975), no. 4, 819–832, https://doi.org/10.1137/1120088. Search in Google Scholar

[5] N. V. Kartashov, Inequalities in stability and ergodicity theorems for Markov chains with a general phase space. II, Teor. Veroyatn. Primen. 30 (1985), no. 3, 478–485, (in Russian). 10.1137/1130063Search in Google Scholar

[6] V. V. Kalasnikov and S. A. Anichkin, Continuity of random sequences and approximation of Markov chains, Adv. Appl. Probab. 13 (1981), no. 2, 402–414. 10.2307/1426691Search in Google Scholar

[7] N. M. Van Dijk, Perturbation theory for unbounded Markov reward processes with applications to queuing, Adv. Appl. Probab. 20 (1988), no. 1, 99–111, https://doi.org/10.2307/1427272. Search in Google Scholar

[8] E. I. Gordienko, Stability estimates for controlled Markov chains with a minorant. Stability problems of stochastic models, J. Sov. Math. 40 (1988), 481–486, https://doi.org/10.1007/BF01083641. Search in Google Scholar

[9] R. Montes-de-Oca, A. Sakhanenko, and F. Salem-Silva, Estimates for perturbations of general discounted Markov control chains, Appl. Math. 30 (2003), no. 3, 287–304, https://doi.org/10.4064/am30-3-4. Search in Google Scholar

[10] E. Gordienko, E. Lemus-Rodriiiiiguez, and R. Montes-de-Oca, Discounted cost optimality problem: Stability with respect to weak metrics, Math. Methods Oper. Res. 68 (2008), no. 1, 77–96, https://doi.org/10.1007/s00186-007-0171-z. Search in Google Scholar

[11] E. I. Gordienko and F. S. Salem, Robustness inequality for Markov control processes with unbounded costs, Systems Control Lett. 33 (1998), no. 2, 125–130, DOI: https://doi.org/10.1016/S0167-6911(97)00077-7. https://doi.org/10.1016/S0167-6911(97)00077-7Search in Google Scholar

[12] R. M. Dudley, The speed of mean Glivenko-Cantelli convergence, Ann. Math. Statist. 40 (1969), 40–50, https://doi.org/10.1214/aoms/1177697802. Search in Google Scholar

[13] O. Hernández-Lerma, Adaptive Markov Control Processes, Applied Mathematical Sciences, vol. 79, Springer-Verlag, New York, 1989. 10.1007/978-1-4419-8714-3Search in Google Scholar

[14] M. Schäl, Estimation and control in discounted stochastic dynamic programming, Stochastics 20 (1987), no. 1, 51–71, https://doi.org/10.1080/17442508708833435. Search in Google Scholar

[15] R. Cavazos-Cadena, Nonparametric adaptive control of discounted stochastic systems with compact state space, J. Optim. Theory Appl. 65 (1990), no. 2, 191–207, https://doi.org/10.1007/BF01102341. Search in Google Scholar

[16] O. Hernández-Lerma and S. I. Marcus, Adaptive control of discounted Markov decision chains, J. Optim. Theory Appl. 46 (1985), no. 3, 227–235, https://doi.org/10.1007/BF00938426. Search in Google Scholar

[17] E. I. Gordienko and J. A. Minjárez-Sosa, Adaptive control for discrete-time Markov processes with unbounded costs: Discounted criterion, Kybernetika (Prague) 34 (1998), no. 2, 217–234. Search in Google Scholar

[18] K. Hinderer, Foundations of non-stationary dynamic programming with discrete time parameter, Lecture Notes in Operations Research and Mathematical Systems, Vol. 33, Springer-Verlag, Berlin-New York, 1970. 10.1007/978-3-642-46229-0Search in Google Scholar

[19] O. Hernández-Lerma and M. Muñoz de Özak, Discrete-time Markov control processes with discounted unbounded costs: optimality criteria, Kybernetika (Prague) 28 (1992), no. 3, 191–212. Search in Google Scholar

[20] S. T. Rachev, Probability Metrics and the Stability of Stochastic Models, John Wiley & Sons, Ltd., Chichester, 1991. Search in Google Scholar

[21] R. M. Dudley, Real analysis and probability, in: Cambridge Studies in Advanced Mathematics, vol. 74, Cambridge University Press, Cambridge, 2002, Revised reprint of the 1989 original. 10.1017/CBO9780511755347Search in Google Scholar

[22] S. Yakowitz, Dynamic programming applications in water resources, Water Resources 18 (1982), 673–696. 10.1029/WR018i004p00673Search in Google Scholar

[23] S. T. Rachev and L. Rüschendorf, Mass Transportation Problems. Vol. II: Applications, Probability and its Applications (New York), Springer-Verlag, New York, 1998. Search in Google Scholar

Received: 2022-02-11

Revised: 2022-06-13

Accepted: 2022-09-21

Published Online: 2022-11-24

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/math-2022-0514

Keywords for this article

optimal control policy; stability inequality; the total variation and the Dudley metrics

Creative Commons

BY 4.0