Risk-Averse Stochastic Shortest Path Planning - Aaron Ames

Page created by Tracy Figueroa

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Risk-Averse Stochastic Shortest Path Planning
                                                                  Mohamadreza Ahmadi, Anushri Dixit, Joel W. Burdick, and Aaron D. Ames

                                           Abstract— We consider the stochastic shortest path planning              one risk measure over another depends on factors such as
                                           problem in MDPs, i.e., the problem of designing policies that            sensitivity to rare events, ease of estimation from data, and
                                           ensure reaching a goal state from a given initial state with             computational tractability. Artzner et. al. [8] characterized a
                                           minimum accrued cost. In order to account for rare but impor-
                                           tant realizations of the system, we consider a nested dynamic            set of natural properties that are desirable for a risk measure,
                                           coherent risk total cost functional rather than the conventional         called a coherent risk measure, and have henceforth obtained
arXiv:2103.14727v1 [eess.SY] 26 Mar 2021

                                           risk-neutral total expected cost. Under some assumptions, we             widespread acceptance in finance and operations research,
                                           show that optimal, stationary, Markovian policies exist and              among others. Coherent risk measures can be interpreted as
                                           can be found via a special Bellman’s equation. We propose a              a special form of distributional robustness, which will be
                                           computational technique based on difference convex programs
                                           (DCPs) to find the associated value functions and therefore the          leveraged later in this paper.
                                           risk-averse policies. A rover navigation MDP is used to illustrate       Conditional value-at-risk (CVaR) is an important coherent risk
                                           the proposed methodology with conditional-value-at-risk (CVaR)           measure that has received significant attention in decision
                                           and entropic-value-at-risk (EVaR) coherent risk measures.
                                                                                                                    making problems, such as MDPs [20], [19], [36], [9]. General
                                                                                                                    coherent risk measures for MDPs were studied in [39],
                                                                                                                    [3], wherein it was further assumed the risk measure is
                                                                   I. I NTRODUCTION
                                                                                                                    time consistent, akin to the dynamic programming property.
                                           Shortest path problems [10], i.e., the problem of reaching               Following the footsteps of [39], [46] proposed a sampling-
                                           a goal state form an initial state with minimum total cost,              based algorithm for MDPs with static and dynamic coherent
                                           arise in several real-world applications, such as driving di-            risk measures using policy gradient and actor-critic methods,
                                           rections on web mapping websites like MapQuest or Google                 respectively (also, see a model predictive control technique for
                                           Maps [41] and robotic path planning [18]. In a shortest path             linear dynamical systems with coherent risk objectives [45]).
                                           problem, if transitions from one system state to another is              A method based on stochastic reachability analysis was pro-
                                           subject to stochastic uncertainty, the problem is referred to            posed in [17] to estimate a CVaR-safe set of initial conditions
                                           as a stochastic shortest path (SSP) problem [13], [44]. In this          via the solution to an MDP. A worst-case CVaR SSP planning
                                           case, we are interested in designing policies such that the total        method was proposed and solved via dynamic programming
                                           expected cost is minimized. Such planning under uncertainty              in [27]. Also, total cost undiscounted MDPs with static CVaR
                                           problems are indeed equivalent to an undiscounted total cost             measures were studied in [16] and solved via a surrogate
                                           Markov decision processes (MDPs) [37] and can be solved                  MDP, whose solution approximates the optimal policy with
                                           efficiently via the dynamic programming method [13], [12].               arbitrary accuracy.
                                           However, emerging applications in path planning, such as                 In this paper, we propose a method for designing policies
                                           autonomous navigation in extreme environments, e.g., sub-                for SSP planning problems, such that the total accrued cost
                                           terranean [24] and extraterrestrial environments [2], not only           in terms of dynamic, coherent risk measures is minimized (a
                                           require reaching a goal region, but also risk-awareness for              generalization of the problems considered in [27] and [16] to
                                           mission success. Nonetheless, the conventional total expected            dynamic, coherent risk measures). We begin by showing that,
                                           cost is only meaningful if the law of large numbers can be in-           under the assumption that the goal region is reachable in finite
                                           voked and it ignores important but rare system realizations. In          time with non-zero probability, the total accumulated risk cost
                                           addition, robust planning solutions may give rise to behavior            is always bounded. We further show that, if the coherent risk
                                           that is extremely conservative.                                          measures satisfy a Markovian property, we can find optimal,
                                           Risk can be quantified in numerous ways. For example,                    stationary, Markovian risk-averse policies via solving a spe-
                                           mission risks can be mathematically characterized in terms               cial Bellman’s equation. We also propose a computational
                                           of chance constraints [34], [33], utility functions [23], and            method based on difference convex programming to solve the
                                           distributional robustness [48]. Chance constraints often ac-             Bellman’s equation and therefore design risk-averse policies.
                                           count for Boolean events (such as collision with an obstacle             We elucidate the proposed method via numerical examples
                                           or reaching a goal set) and do not take into consideration               involving a rover navigation MDP and CVaR and entropic-
                                           the tail of the cost distribution. To account for the latter,            value-at-risk (EVaR) measures.
                                           risk measures have been advocated for planning and decision              The rest of the paper is organized as follows. In the next
                                           making tasks in robotic systems [32]. The preference of                  section, we review some definitions and properties used in
                                                                                                                    the sequel. In Section III, we present the problem under
                                             The authors are with the California Institute of Technology, 1200 E.   study and show its well-posedness under an assumption. In
                                           California Blvd., MC 104-44, Pasadena, CA 91125, e-mail: ({mrahmadi,
                                           adixit, ames}@caltech.edu, jwb@robotics.caltech.edu.)                    Section IV, we present the main result of the paper, i.e., a

Consider a probability space pΩ, F, Pq, a filtration F0 Ă
                                                                               ¨ ¨ ¨ FT Ă F, and an adapted sequence of random vari-
                                            si                                 ables (stage-wise costs) ct , t “ 0, . . . , T , where T P
                                                                               Ně0 Y t8u. For t “ 0, . . . , T , we further define the spaces
                                                                               Ct “ Lp pΩ, Ft , Pq, p P r1, 8q, Ct:T “ Ct ˆ ¨ ¨ ¨ ˆ CT and
            T psg | sg , αq “ 1                                                C “ C0 ˆ C1 ˆ ¨ ¨ ¨ . In order to describe how one can evaluate
                                                                               the risk of sub-sequence ct , . . . , cT from the perspective of
                                                                               stage t, we require the following definitions.
                      sg                             sj                        Definition 2 (Conditional Risk Measure): A mapping ρt:T :
                                                                               Ct:T Ñ Ct , where 0 ď t ď N , is called a conditional risk
Fig. 1. The transition graph of the particular class of MDPs studied in this   measure, if it has the following monotonicity property:
paper. The goal state sg is cost-free and absorbing.
                                                                                     ρt:T pcq ď ρt:T pc1 q,    @c, @c1 P Ct:T such that c ĺ c1 .
special Bellman’s equation for the risk-averse SSP problem.                    Definition 3 (Dynamic Risk Measure): A dynamic risk mea-
In Section V, we describe a computational method to find                       sure is a sequence of conditional risk measures ρt:T : Ct:T Ñ
risk-averse policies. In Section VI, we illustrate the proposed                Ct , t “ 0, . . . , T .
method via a numerical example and finally, in Section VI,                     One fundamental property of dynamic risk measures is their
we conclude the paper.                                                         consistency over time [39, Definition 3]. If a risk measure is
Notation: We denote by Rn the n-dimensional Euclidean                          time-consistent, we can define the one-step conditional risk
space and Ně0 the set of non-negative integers. We use                         measure ρt : Ct`1 Ñ Ct , t “ 0, . . . , T ´ 1 as follows:
bold font to denote a vector and p¨qJ for its transpose, e.g.,
a “ pa1 , . . . , an qJ , with n P t1, 2, . . .u. For a vector a, we                                ρt pct`1 q “ ρt,t`1 p0, ct`1 q,                (1)
use a ľ pĺq0 to denote element-wise non-negativity (non-                       and for all t “ 1, . . . , T , we obtain:
positivity) and a ” 0 to show all elements of a are zero.                                                    `
For a finite set A, we denote its power set by 2A , i.e., the                    ρt,T pct , . . . , cT q “ ρt ct ` ρt`1 pct`1 ` ρt`2 pct`2 ` ¨ ¨ ¨
set of all subsets of A. For a probability space pΩ, F, Pq
                                                                                                                                               ˘
                                                                                                            ` ρT ´1 pcT ´1 ` ρT pcT qq ¨ ¨ ¨ qq . (2)
and a constant p P r1, 8q, Lp pΩ, F, Pq denotes the vector
space of real valued random variables c for which E|c|p ă 8.                   Note that the time-consistent risk measure is completely de-
Superscripts are used to denote indices and subscripts are used                fined by one-step conditional risk measures ρt , t “ 0, . . . , T ´
to denote time steps (stages), e.g., for s P S, s21 means the                  1 and, in particular, for t “ 0, (2) defines a risk measure of
the value of s2 P S at the 1st stage.                                          the entire sequence c P C0:T .
                                                                               At this point, we are ready to define a coherent risk measure.
                         II. P RELIMINARIES                                    Definition 4 (Coherent Risk Measure): We call the one-step
This section, briefly reviews notions and definitions used                     conditional risk measures ρt : Ct`1 Ñ Ct , t “ 1, . . . , N ´ 1
throughout the paper.                                                          as in (2) a coherent risk measure, if it satisfies the following
                                                                               conditions
We are interested in designing policies for a class of finite
MDPs (termed transient MDPs in [16]) as shown in Figure 1,                       ‚    Convexity: ρt pλc ` p1 ´ λqc1 q ď λρt pcq ` p1 ´ λqρt pc1 q,
which is defined next.                                                                for all λ P p0, 1q and all c, c1 P Ct`1 ;
                                                                                 ‚    Monotonicity: If c ď c1 , then ρt pcq ď ρt pc1 q for all
Definition 1 (MDP): An MDP                  is   a    tuple,    M        “
                                                                                      c, c1 P Ct`1 ;
pS, Act, T, s0 , c, sg q, where
                                                                                 ‚    Translational Invariance: ρt pc1 ` cq “ ρt pc1 q ` c for all
   ‚   States S “ ts1 , . . . , s|S| u of the autonomous agent(s)                     c P Ct and c1 P Ct`1 ;
       and world model,                                                          ‚    Positive Homogeneity: ρt pβcq “ βρt pcq for all c P Ct`1
   ‚   Actions Act “ tα1 , . . . , α|Act| u available to the robot,                   and β ě 0.
   ‚   A transition   probability distribution T psj |si , αq, satisfy-
       ing sPS T ps|si , αq “ 1, @si P S, @α P Act,
           ř                                                                   In fact, we can show that there exists a dual (or distributionally
   ‚   An initial state s0 P S, and                                            robust) representation for any coherent risk measure. Let
   ‚   An immediate cost function, cpsi , αi q ě 0, for each state             m, n P r1, 8q such that 1{m ` 1{n “ 1 and
       si P S and action αi P Act,
                                                                                                              ÿ
                                                                                   P “ q P Ln pS, 2S , Pq |
                                                                                                                                           (
                                                                                                                  qps1 qPps1 q “ 1, q ě 0 .
   ‚   sg P S is a special cost-free goal (termination) state, i.e.,                                              s1 PS
       T psg | sg , αq “ 1 and cpsg , αq “ 0 for all α P Act.                  Proposition 1 (Proposition 4.14 in [26]): Let Q be a closed
We assume the immediate cost function c is non-negative and                    convex subset of P. The one-step conditional risk measure
upper-bounded by a positive constant c̄.                                       ρt : Ct`1 Ñ Ct , t “ 1, . . . , N ´ 1 is a coherent risk measure
Our risk-averse policies for the SSP problem rely on the                       if and only if
notion of dynamic coherent risk measures, whose definitions
                                                                                               ρt pcq “ sup xc, qyQ ,      @c P Ct`1 ,             (3)
and properties are presented next.                                                                       qPQ

where x¨, ¨yQ denotes the inner product in Q.                                       that pπ is dependent solely on tπ1 , π2 , . . . , πτ u. Moreover,
                                                                                    since Act is finite, the number of τ -stage policies is also
Hereafter, all risk measures are assumed to be coherent.
                                                                                    finite, which implies finiteness of pπ . Hence, p ă 1 as well.
                                                                                    Therefore, for any policy π and initial state s, we obtain
                   III. P ROBLEM FORMULATION                                        Pps2τ ‰ sg | s0 “ s, πq “ Pps2τ ‰ sg | sτ ‰ sg , s0 “
                                                                                    s, πq ˆ Ppsτ ‰ sg | s0 “ s, πq ď p2 . Then, by induction, we
Next, we formally describe the risk-averse SSP problem. We
                                                                                    can show that, for any admissible SSP policy π, we have
also demonstrate that, if the goal state is reachable in finite
time, the risk-averse SSP problem is well-posed. Let π “                                        Ppskτ ‰ sg | s0 “ s, πq ď pk ,          @s P S,            (6)
tπ0 , π1 , . . .u be an admissible policy.
                                                                                    and k “ 1, 2, . . .. Indeed, we can show that the risk-averse
Problem 1: Consider MDP M as described in Definition 1.                             cost incurred in the τ periods between τ k and τ pk ` 1q ´ 1
Given an initial state s0 ‰ sg , we are interested in solving                       is bounded as follows
the following problem                                                                     `                                                   ˘
                                                                                       ρ0 ¨ ¨ ¨ ρτ pk`1q´1 pcτ k ` ¨ ¨ ¨ ` cτ pk`1q´1 q ¨ ¨ ¨   (7a)
                       π ˚ P arg min Jps0 , πq,                              (4)
                                     π                                                 “ ρ̃pcτ k ` cτ k`1 ` ¨ ¨ ¨ ` cτ pk`1q´1 q                         (7b)
where                                                                                  ď ρ̃pc̄ ` ¨ ¨ ¨ ` c̄q                                             (7c)
        Jps0 , πq “ lim ρt:T pcps0 , π0 q, . . . , cpsT , πT qq ,            (5)       “ sup xτ c̄, qyQ                                                  (7d)
                       T Ñ8                                                               qPQ

is the total risk functional for the admissible policy π.                              ď xτ c̄, q ˚ yQ                                                   (7e)
                                                                                              ÿ
                                                                                       “ τ c̄      Ppskτ ‰ sg | s0 “ s, πqq ˚ ps, πq                      (7f)
In fact, we are interested in reaching the goal state sg such
                                                                                             sPS
that the total risk cost is minimized1 . Note that the risk-averse                                                                     ÿ
deterministic shortest problem can be obtained as a special                            ď τ c̄ ˆ sup pPpskτ ‰ sg | s0 “ s, πqq                |q ˚ ps, πq| (7g)
                                                                                                                                       sPS
case when the transitions are deterministic. We define the
optimal risk value function as                                                         ď τ c̄ pk ,                                                       (7h)

                   J ˚ psq “ min Jps, πq, @s P S,                                   where in (7b) we used the translational invariance property of
                                 π                                                  coherent risk measures and defined ρ̃ “ ρ0 ˝ ¨ ¨ ¨ ˝ ρτ pk`1q´1 .
and call a stationary policy π “ tµ, µ, . . .u (denoted µ)                          Since any finite compositions of the coherent risk measures
optimal if Jps, µq “ J ˚ psq “ minπ Jps, πq, @s P S.                                is a risk measure [42], we have that ρ̃ is also a coherent
We posit the following assumption, which implies that the                           risk measure. Moreover, since the immediate cost function
goal state is reachable eventually under all policies.                              c is upper-bounded, from the monotonicity property of the
                                                                                    coherent risk measure ρ̃, we obtain (7c). Equality (7d) is
Assumption 1 (Goal is Reachable in Finite Time):                                    derived from Proposition 1. Inequality (7e) is obtained via
Regardless of the policy used and the initial state, there                          defining q ˚ “ argsupqPQ xτ c̄, qyQ . In inequality (7g), we used
exists an integer τ such that there is a positive probability                       Hölder inequality and finally we used (6) to obtain the last
that the goal state sg is visited after no more than τ stages2 .                    inequality. Thus, the risk-averse total cost Jps, πq, s P S,
We then have the following observation with respect to                              exists and is finite, because given Assumption 1 we have
Problem 13 .                                                                        |Jps0 , πq| “ lim ρ0 ˝ ¨ ¨ ¨ ˝ ρτ ´1 ˝ ¨ ¨ ¨ ˝ ρT pc0 ` ¨ ¨ ¨ ` cT q
                                                                                                   T Ñ8
Proposition 2: Let Assumption 1 hold. Then, the risk-averse                                     8
                                                                                                ÿ       `                                                    ˘
SSP problem, i.e., Problem 1, is well-posed and Jps0 , πq is                                ď         ρ0 ¨ ¨ ¨ ρτ pk`1q´1 pcτ k ` ¨ ¨ ¨ ` cτ pk`1q´1 q ¨ ¨ ¨
bounded for all policies π.                                                                     k“0
                                                                                                 8
                                                                                                ÿ                  τ c̄
     Proof: Assumption 1 implies that for each admissible                                   ď         τ c̄ pk “         ,
policy π, we have pπ “ maxsPS Ppsτ ‰ sg | s0 “ s, πq ă 1.                                       k“0
                                                                                                                  1´p
That is, given a policy π, the probability pπ of not visiting                                                                                    (8)
the goal state sg is less than one. Let p “ maxπ pπ . Remark                        where in the first equality above we used the translational
                                                                                    invariance property and in the first inequality we used the sub-
   1 An important class of SSP planning problems are concerned with
                                                                                    additivity property of coherent risk measures. Hence, Jps0 , πq
minimum-time reachability. Indeed, our formulation also encapsulates                is bounded for all π.
minimum-time problems, in which for MDP M, we have cpsq “ 1, for
all s P Sztsg u.
   2 If instead of one goal state sg , we were interested in a set of goal states
G Ă S, it suffices to define τ “ inftt | Ppst P G | s0 P SzG, πq ą 0u.                             IV. R ISK -AVERSE SSP P LANNING
Then, all the paper’s derivations can be applied. For the sake of simplicity
of the presentation, we present the results for a single goal state.                This section presents the paper’s main result, which includes a
   3 Note that Problem 1 is ill-posed in general. For example, if the induced       special Bellman’s equation for finding the risk value functions
Markov chain for an admissible policy is periodic, then the limit in (5)            for Problem 1. Furthermore, assuming that the coherent risk
may not exist. This is in contrast to risk-averse discounted infinite-horizon
MDPs [3], for which we only require non-negativity and boundedness of               measures satisfy a Markovian property, we show that the
immediate costs.                                                                    optimal risk-averse policies are stationary and Markovian.

To begin with, note that at any time t, the value of ρt                 limiting process and the measure ρt commute. The limit term
is Ft -measurable and is allowed to depend on the entire                is indeed the total risk cost starting at sτ M , i.e., Jpsτ M , πq.
history of the process ts0 , s1 , . . .u and we cannot expect           Next, we show that under Assumption 1, this term remains
to obtain a Markov optimal policy [35]. In order to obtain              bounded. From (8) in the proof of Proposition 2, we have
Markov optimal policies for Problem 1, we need the following                                      `
                                                                         |Jpsτ M ,πq| “ lim ρτ M cτ M ¨ ¨ ¨ ` ρT pcT q ¨ ¨ ¨ q
property [39, Section 4] of risk measures.                                                   T Ñ8
                                                                                         8
                                                                                         ÿ
Definition 5 (Markov Risk Measure [25]): A one-step con-
                                                                                                  `                                              ˘
                                                                                  ď           ρτ k cτ k ` ¨ ¨ ¨ ` ρτ pk`1q´1 pcτ pk`1q´1 q ¨ ¨ ¨
ditional risk measure ρt : Ct`1 Ñ Ct is a Markov risk                                   k“M
measure with respect to MDP P, if there exist a risk transition                          8
                                                                                         ÿ                τ c̄ pM
mapping σt : Lm pS, 2S , Pq ˆ S ˆ M Ñ R such that for all                         ď           τ c̄ pk “           .                          (14)
                                                                                                           1´p
v P Lm pS, 2S , Pq and αt P πpst q, we have                                             k“M
                                                                        Substituting the above bound in (13) gives
       ρt pvpst`1 qq “ σt pvpst`1 q, st , T pst`1 |st , αt qq .   (9)
                                                                                         `                                τ c̄ pM      ˘
In fact, if ρt is a coherent risk measure, σt also satisfies the          Jps0 , πq ď ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 `          q¨¨¨
                                                                                                                           1´p
properties of a coherent risk measure (Definition 4).
                                                                                   `                                   ˘ τ c̄ pM
Assumption 2: The one-step coherent risk measure ρt is a                     “ ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 q ¨ ¨ ¨ `         , (15)
                                                                                                                           1´p
Markov risk measure.
                                                                        where the last equality holds via the translational invariance
                                                                                                                                    c̄ pM
We can now present the main result in the paper, a form of              property of the one-step risk measures and the fact that τ 1´p
Bellman’s equations for solving the risk-averse SSP problem.            is constant. Similarly, following (14), we can also obtain a
                                                                        lower bound on Jps0 , πq as follows
Theorem 1: Consider MDP P as described in Definition 1                                   `                              τ c̄ pM        ˘
and let Assumptions 1 and 2 hold. Then, the following                    Jps0 , πq ě ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 ´         q¨¨¨
                                                                                                                         1´p
statements are true for the risk-averse SSP problem:                                     `                             ˘ τ c̄ pM
(i) Given (non-negative) initial condition J 0 psq, s P S, the                     “ ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 q ´         . (16)
                                                                                                                            1´p
sequence generated by the recursive formula (dynamic pro-
gramming)                                                               Thus, from (15) and (16), we obtain
                     ˆ                                                               τ c̄ pM       `                                     ˘
     k`1                                                                 Jps0 , πq ´         ď ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 qq ¨ ¨ ¨
   J     psq “ min cps, αq                                                            1´p
               αPAct
                                                ˙                                                            τ c̄ pM
                                              (                                              ď Jps0 , πq `           .                (17)
                ` σ J k ps1 q, s, T ps1 |s, αq , @s P S, (10)                                                 1´p

converges to the optimal risk value function J ˚ psq, s P S.            Furthermore, Assumption 1 implies that J 0 psg q “ 0. If we
(ii) The optimal risk value functions J ˚ psq, s P S are the            consider J 0 as a terminal risk value function, we can obtain
unique solution to the Bellman’s equation                                  |ρτ M pJ 0 psτ M qq| “ | sup xJ 0 psτ M q, qyQ |                 (18a)
                   ˆ                                                                                   qPQ

   J ˚ psq “ min cps, αq                                                    “ |xJ 0 psτ M q, q ˚ yQ |                                       (18b)
             αPAct                                                              ÿ
                                              (
                                                ˙                           “|      Ppsτ M “ s | s0 , πqq ˚ ps, πqJ 0 psq|                  (18c)
                ` σ J ˚ ps1 q, s, T ps1 |s, αq , @s P S; (11)                   ÿsPS

                                                                            ď         Ppsτ M “ s | s0 , πqq ˚ ps, πq ˆ max |J 0 psq|        (18d)
(iii) For any stationary Markovian policy µ, the risk averse                                                              sPS
                                                                                sPS
value functions Jps, µpsqq, s P S are the unique solutions to
                                                                                ÿ
                                                                            ď         Ppsτ M “ s | s0 , πq ˆ max |J 0 psq|                  (18e)
                                                                                                                sPS
                                                                                sPS
  Jps, µq “ cps, µpsqq                                                           M
                                              (                             ďp        max |J 0 psq|,                                        (18f)
                  ` σ Jps, µq, s, T ps1 |s, αq , @s P S; (12)                          sPS
                                                                        where, similar to the derivation in (7), in (18a) we used Propo-
(iv) A stationary Markovian policy µ is optimal if and only
                                                                        sition 1. Defining q ˚ “ argsupqPQ xτ c̄, qyQ , we obtained
if µ attains the minimum in Bellman’s equation (11).
                                                                        (18b) and the last inequality is based on the fact that the
     Proof: For every positive integer M , an initial state s0 ,        probability of sτ M ‰ sg is less than equal to pM as in (6).
and policy π, we can split the nested risk cost (5), where ρt,T
                                                                        Combining inequalities (17) and (18), we have
is defined in (2), at time index τ M and obtain
                                                                                                           τ c̄ pM
                 `
   Jps0 , πq “ ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1                          ´pM max |J 0 psq| ` Jps0 , πq ´
                            `                            ˘                    sPS                           1´p
                                                                               `                                                         ˘
              ` lim ρτ M cτ M ` ¨ ¨ ¨ ` ρT pcT q ¨ ¨ ¨ qq , (13)
                 T Ñ8
                                                                           ď ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 ` ρτ M pJ 0 psτ M qq ¨ ¨ ¨
where we used the fact that the one-step coherent risk                                                                τ c̄ pM
                                                                            ď pM max |J 0 psq| ` Jps0 , πq `                  .              (19)
measures ρt are continuous [40, Corollary 3.1] and hence the                           sPS                             1´p

Remark that the middle term in the ineqaulity above is the              Then, Part (iii) and the above equation imply Jps, µq “ J ˚ psq
τ M -stage risk-averse cost of the policy π with the terminal           for all s P S. Conversely, if Jps, µq “ J ˚ psq for all s P S,
cost J 0 psτ M q. Given Assumption 2, from [39, Theorem                 then Items (ii) and (iii) imply that µ is optimal.
2], the minimum of this cost is generated by the dynamic
                                                                        At this point, we should highlight that, in the case of condi-
programming recursion (10) after τ M iterations. Taking the
                                                                        tional expectation as the coherent     risk measure (ρt “ E and
minimum over the policy π on every side of (19) yields                                               ř
                                                                        σ tJps1 q, s, T ps1 |s, αqu “ s1 PS T ps1 |s, αqJps1 q), Theorem 1
                                             τ c̄ pM                    simplifies to [11, Proposition 5.2.1] for the risk-neutral SSP
          ´pM max |J 0 psq| ` J ˚ ps0 q ´
                 sPS                          1´p                       problem. In fact, Theorem 1 is a generalization of [11,
             ď J τ M ps0 q                                              Proposition 5.2.1] to the risk-averse case.
                                                  τ c̄ pM               Recursion (dynamic programming) (10) represents the Value
             ď pM max |J 0 psq| ` J ˚ ps0 q `             ,     (20)    Iteration (VI) for finding the risk value functions. In general,
                       sPS                         1´p
                                                                        value iteration converges with infinite number of iterations
for all s0 and M . Finally, let k “ τ M . Since the above               (k Ñ 8), but it can be shown, for a stationary policy
inequality holds for all M , taking the limit of M Ñ 8 gives            µ resulting in an acyclic induced Markov chain, the VI
                                                                        algorithm converges in |S| of steps (see the derivation for
   lim J τ M ps0 q “ lim J k ps0 q “ J ˚ ps0 q,      @s0 P S. (21)
  M Ñ8                  kÑ8                                             the risk-neutral SSP problem in [14]).
(ii) Taking the limit, k Ñ 8, of both sides                ` of (10)    Alternatively, one can design risk-averse policies
yields limkÑ8 J k`1 psq     (˘ “     lim  kÑ8    min αPAct  cps, αq `   using Policy Iteration (PI). That is, starting with an
σ J k ps1 q, s, T ps1 |s, αq , @s P S. Equality (21) in the proof       initial policy µ0 , we can carry out policy evaluation
of Part (i) implies that                                                via (12) followed by a policy improvement step, which
                            ˆ                                           calculates an´ improved !policy µk`1 , as µk`1       psq “
                                                                                                                            )¯
                                                                                                         k
     ˚
   J psq “ lim min cps, αq                                              arg minαPAct cps, αq ` σ J µ ps, αq, s, T ps1 |s, αq , @s P
              kÑ8 αPAct
                                                        ˙               S. This process is repeated until no further improvement is
                                                                                                                       k`1       k
                                                                        found in terms of the risk value functions: J µ psq “ J µ psq
                                                      (
                        ` σ J k ps1 q, s, T ps1 |s, αq , @s P S.
                                                                        for all s P S.
Since the limit and the minimization commute over a finite              However, we do not pursue VI or PI approaches further in
number of alternatives, we have                                         this work. The main obstacle for using VI and PI is that
                  ˆ                                                     equations (10)-(12) are nonlinear (and non-smooth) in the risk
    ˚
  J psq “ min cps, αq                                                   value functions for a general coherent risk measure. Solving
            αPAct
                                              ˙                         nonlinear equations (10)-(12) for the risk value functions may
                          k 1        1
                                            (
              ` lim σ J ps q, s, T ps |s, αq , @s P S.                  require significant computational burden (see the specialized
                  kÑ8                                                   non-smooth Newton Method in [39] for solving similar non-
Finally, because σ is continuous [40, Corollary 3.1], the limit         linear VIs). Instead, the next section present a computational
and σ commute as well and from (21), we obtain J ˚ psq “                method based on difference convex programs (DCPs).
minαPAct pcps, αq ` σ tJ ˚ ps1 q, s, T ps1 |s, αquq , @s P S. To
show uniqueness, note that for any Jpsq, s P S satisfying                         V. A DCP C OMPUTATIONAL A PPROACH
the above equation, the dynamic programming recursion (10)
starting at Jpsq, s P S replicates Jpsq, s P S and from Part (i)        In this section, we propose a computational method based
we infer Jpsq “ J ˚ psq for all s P S.                                  on DCPs to find the risk value functions and subsequently
(iii) Given a stationary Markovian policy µ, at every state s,          policies that minimize the accrued dynamic risk in the SSP
we have α “ µpsq, hence from Item (i), we have                          planning. Before stating the DCP formulation, we show that
                                                                        the Bellman operator in (10) is non-decreasing. Let
                           ˆ
                                                                                                `                         ˘
   J k`1 ps, µq “ min        cps, αq                                    Dπ J : “ cps, πpsqq ` σ Jps1 q, s, T ps1 |s, πpsqq , @s P S,
                  αPtµpsqu                                                              `           `                      ˘˘
                                                     ˙                    DJ : “ min cps, αq ` σ Jps1 q, s, T ps1 |s, αq , @s P S.
                                                   (                              αPAct
                  ` σ J k ps1 , µq, s, T ps1 |s, αq , @s P S.

Since the minimum is only over one element, we                          Lemma 1: Let Assumptions 1 and 2 hold. For all v, w P
                                                           ( have
J k`1 ps, µq “ cps, µpsqq ` σ J k ps1 , µq, s, T ps1 |s, αq , @s P      Lm pS, 2S , Pq, if v ď w, then Dπ v ď Dπ w and Dv ď Dw.
S, which with k Ñ 8 converges uniquely (see Item (ii)) to                    Proof: Since ρ is a Markov risk measure, we have
Jps, µq.                                                                ρpvq “ σpv, s, T q for all v P Lm pS, 2S , Pq. Furthermore,
(iv) The stationary policy attains its minimum in (11), if              since ρ is a coherent risk measure from Proposition 1, we
                      ˆ
                                                             (
                                                               ˙        know that (3) holds. Inner producting both sides of v ď w
     ˚                                 ˚ 1             1
   J psq “ min          cps, αq ` σ J ps q, s, T ps |s, αq              with the probability measure q P Q Ă P from right and
             αPtµpsqu
                                                    (                   taking the supremum over Q yields supqPQ xv, qyQ ď
         “ cps, µpsqq ` σ J ˚ ps1 q, s, T ps1 |s, αq , @s P S.          supqPQ xw, qyQ . From Proposition 1, we have σpv, s, T q “

supqPQ xv, qyQ and σpw, s, T q “ supqPQ xw, qyQ . There-
fore, σpv, s, T q ď σpw, s, T q. Adding c to both sides of the
above inequality, gives Dπ v ď Dπ w. Taking the minimum
with respect to α P Act from both sides of Dπ v ď Dπ w,
does not change the inequality and gives Dv ď Dw.
We are now ready to state an optimization formulation to the
Bellman equation (10).
Proposition 3: Consider MDP M as described in Defini-
tion 1. Let the Assumptions of Theorem 1 hold. Then, the
optimal value functions J ˚ psq, s P S, are the solutions to the Fig. 2. Grid world illustration for the rover navigation example. Blue cells
denote the obstacles and the yellow cell denotes the goal.
following optimization problem
ÿ In particular, in this work, we use a variant of the convex-
sup Jpsq (22) concave procedure [31], [43], wherein the concave terms
J sPS
are replaced by a convex upper bound and solved. In fact,
subject to the disciplined convex-concave programming (DCCP) [43]
Jpsq ď cps, αq ` σtJps1 q, s, T ps1 |s, αqu, @ps, αq P S ˆ Act. technique linearizes DCP problems into a (disciplined) convex
Proof: From Lemma 1, we infer that Dπ and D are program (carried out automatically via the DCCP Python
non-decreasing; i.e., for v ď w, we have Dπ v ď Dπ w and package [43]), which is then converted into an equivalent
Dv ď Dw. Therefore, if J ď DJ, then DJ ď DpDJq. By cone program by replacing each function with its graph imple-
repeated application of D, we obtain J ď DJ ď D2 J ď mentation. Then, the cone program can be solved readily by
D8 J “ J ˚ . Any feasible solution to (22) must satisfy J ď available convex programming solvers, such as CVXPY [21].
DJ and hence must satisfy J ď J ˚ . Thus, J ˚ is the largest In the Appendix, we present the specific DCPs required
J that satisfies the constraint in optimization (22). Hence, the for risk-averse SSP planning for CVaR and EVaR risk
optimal solution to (22) is the same as that of (10). measures used in our numerical experiments in the next
section. Note that for the risk-neutral conditional expecta-
Once a solution J ˚ to optimization problem (22) is found,
tion measure, optimization (23)řbecomes a linear program,
we can find a corresponding stationary Markovian policy as
since σ tJps1 q, s, T ps1 |s, αqu “ s1 PS T ps1 |s, αqJps1 q is lin-
ˆ ear in the decision variables J .
˚
µ psq “ arg min cps, αq
αPAct
VI. N UMERICAL E XPERIMENTS
˙
(
` σ J ˚ ps, αq, s, T ps1 |s, αq , @s P S.
In this section, we evaluate the proposed method for risk-
averse SSP planning with a rover navigation MDP (also used
in [2], [3]). We consider the traditional total expectation as
A. DCPs for Risk-Averse SSP Planning well as CVaR and EVaR. The experiments were carried out
on a MacBook Pro with 2.8 GHz Quad-Core Intel Core i5
Assumption 1 implies that each ρ is a coherent, Markov and 16 GB of RAM. The resultant linear programs and DCPs
risk measure. Hence, the mapping v ÞÑ σpv, ¨, ¨q is convex were solved using CVX [21] with DCCP [43] add-on.
(because σ is also a coherent risk measure). We next show
An agent (e.g. a rover) must autonomously navigate a 2-
that optimization problem (22) is in fact a DCP.
ř dimensional terrain map (e.g. Mars surface) represented by an
Let f0 “ 0, g0 pJ q “ sPS Jpsq, f1 pJ q “ Jpsq, g1 ps, αq “ M ˆ N grid with 0.25M N obstacles. Thus, the state space is
cps, αq, and g2 pJ q “ σpJ, ¨, ¨q. Note that f0 and g1 are convex given by S “ tsi |i “ x ` y, x P t1, . . . , M u, y P t1, . . . , N uu
(constant) functions and g0 , f1 , and g2 are convex functions with x “ 1, y “ 0 being the leftmost bottom grid. Since
in J . Then, (22) can be expressed as the minimization the rover can move from cell to cell, its action set is Act “
tE, W, N, Su. The actions move the robot from its current
inf f0 ´ g0 pJ q
J cell to a neighboring cell, with some uncertainty. The state
subject to transition probabilities for various cell types are shown for
f1 pJq ´ g1 ps, αq ´ g2 pJq ď 0, @s, α. (23) actions E (East) and N (North) in Figure 2. Other actions lead
to similar transitions. Hitting an obstacle incurs the immediate
The above optimization problem is indeed a standard cost of 5, while the goal grid region has zero immediate cost.
DCP [28]. Many applications require solving DCPs, such Any other grid has a cost of 1 to represent fuel consumption.
as feature selection in machine learning [30] and inverse Once the policies are calculated, as a robustness test similar
covariance estimation in statistics [47]. DCPs can be solved to [20], [2], [3], we included a set of single grid obstacles that
globally [28], e.g. using branch and bound algorithms [29]. are perturbed in a random direction to one of the neighboring
Yet, a locally optimal solution can be obtained based on grid cells with probability 0.2 to represent uncertainty in the
techniques of nonlinear optimization [14] more efficiently. terrain map. For each risk measure, we run 100 Monte Carlo

pM ˆ N qρ        J ˚ ps0 q
                                              Total
                                                     # U.O. F.R.
                                                                     for risk-averse robot path planning.
                                            Time [s]
                                                                     The table also outlines the failure ratios of each risk mea-
             p4 ˆ 5qE            13.25       0.56      2    39%
                                                                     sure. In this case, EVaR outperformed both CVaR and total
             p10 ˆ 10qE          27.31       1.04      4    46%
             p10 ˆ 20qE          38.35       1.30      8    58%
                                                                     expectation in terms of robustness, which is consistent with
                                                                     the fact that EVaR is a more conservative risk measure. Lower
             p4 ˆ 5qCVaR0.7      18.76       0.58      2    14%      failure/collision rates were observed for ε “ 0.3, which
             p10 ˆ 10qCVaR0.7    35.72       1.12      4    19%
                                                                     correspond to more risk-averse policies. In addition, these
             p10 ˆ 20qCVaR0.7    47.36       1.36      8    21%
             p4 ˆ 5qCVaR0.3      25.69       0.57      2    10%      results suggest that, although total expectation can be used
             p10 ˆ 10qCVaR0.3    43.86       1.16      4    13%      as a measure of performance in high number of Monte Carlo
             p10 ˆ 20qCVaR0.3    49.03       1.34      8    15%      simulations, it may not be practical to use it for real-world
             p4 ˆ 5qEVaR0.7      26.67       1.83      2     9%
                                                                     planning under uncertainty scenarios. CVaR and EVaR seem
             p10 ˆ 10qEVaR0.7    41.31       2.02      4    11%      to be a more efficient metric for performance in shortest path
             p10 ˆ 20qEVaR0.7    50.79       2.64      8    17%      planning under uncertainty.
             p4 ˆ 5qEVaR0.3      29.05       1.79      2    7%
             p10 ˆ 10qEVaR0.3    48.73       2.01      4    10%                               VII. C ONCLUSIONS
             p10 ˆ 20qEVaR0.3    61.28       2.78      8    12%
                                                                     We proposed a method based on dynamic programming
                                  TABLE I                            for designing risk-averse policies for the SSP problem. We
 C OMPARISON BETWEEN TOTAL EXPECTATION , CVA R, AND EVA R RISK       presented a computational approach in terms of difference
MEASURES .  pM ˆ N qρ DENOTES THE GRID - WORLD OF SIZE M ˆ N AND     convex programs for finding the associated risk value func-
ONE - STEP COHERENT RISK MEASURE ρ. T OTAL T IME DENOTES THE TIME    tions and hence the risk-averse policies. Future research will
   TAKEN BY THE CVX SOLVER TO SOLVE THE ASSOCIATED LINEAR            extend to risk-averse MDPs with average costs and risk-averse
 PROGRAMS OR  DCP S . # U.O. DENOTES THE NUMBER OF SINGLE GRID       MDPs with linear temporal logic specifications, where the
  UNCERTAIN OBSTACLES USED FOR ROBUSTNESS TEST. F.R. DENOTES         former problem is cast as a special case of the risk-averse
    THE FAILURE RATE OUT OF 100 M ONTE C ARLO SIMULATIONS .          SSP problem. In this work, we assumed the states are fully
                                                                     observable, we will study the SSP problems with partial state
simulations with the calculated policies and count the number        observation [4], [2] in the future, as well.
of runs ending in a collision.
                                                                                                   R EFERENCES
In the experiments, we considered three grid-world sizes of
4 ˆ 5, 10 ˆ 10, and 10 ˆ 20 corresponding to 20, 100, and             [1] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd. A rewriting
                                                                          system for convex optimization problems. Journal of Control and
200 states, respectively. We allocated 2, 4, and 8 uncertain              Decision, 5(1):42–60, 2018.
(single-cell) obstacles for the 4 ˆ 5, 10 ˆ 10, and 10 ˆ 20           [2] M. Ahmadi, M. Ono, M. D. Ingham, R. M. Murray, and A. D. Ames.
grids, respectively. In each case, we solve DCP (22) (linear              Risk-averse planning under uncertainty. In 2020 American Control
                                                                          Conference (ACC), pages 3305–3312. IEEE, 2020.
program in the case of total expectation) with |S||Act| “             [3] M. Ahmadi, U. Rosolia, M. D. Ingham, R. M. Murray, and A. D. Ames.
M N ˆ 4 “ 4M N constraints and M N ` 1 variables (the risk                Constrained risk-averse Markov decision processes. In The 35th AAAI
value functions J’s and ζ for CVaR and EVaR as discussed                  Conference on Artificial Intelligence (AAAI-21), 2021.
                                                                      [4] M. Ahmadi, R. Sharan, and J. W. Burdick. Stochastic finite state control
in the Appendix). In these experiments, we set the confidence             of POMDPs with LTL specifications. arXiv preprint arXiv:2001.07679,
levels to ε “ 0.3 (more risk-averse) and ε “ 0.7 (less risk-              2020.
averse) for both CVaR and EVaR coherent risk measures. The            [5] A. Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure.
                                                                          J. Optimization Theory and Applications, 155(3):1105–1123, 2012.
initial condition was chosen as s0 “ s1 , i.e., the agent starts      [6] A. Ahmadi-Javid and M. Fallah-Tafti. Portfolio optimization with
at the leftmost grid at the bottom, and the goal state was                entropic value-at-risk. Euro. J. Operational Res., 279(1):225–241, 2019.
selected as sg “ sM N , i.e., the rightmost grid at the top.          [7] A. Ahmadi-Javid and A. Pichler. An analytical study of norms and
                                                                          banach spaces induced by the entropic value-at-risk. Mathematics and
A summary of our numerical experiments is provided in                     Financial Economics, 11(4):527–550, 2017.
Table 1. Note the computed values of Problem 1 satisfy                [8] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of
                                                                          risk. Mathematical finance, 9(3):203–228, 1999.
Epcq ď CVaRε pcq ď EVaRε pcq. This is in accordance                   [9] N. Bäuerle and J. Ott. Markov decision processes with average-value-
with the theory that EVaR is a more conservative coherent                 at-risk criteria. Math. Methods Operations Res., 74(3):361–379, 2011.
risk measure than CVaR [5] (see also our work on EVaR-               [10] R. Bellman. On a routing problem. Quarterly of applied mathematics,
                                                                          16(1):87–90, 1958.
based model predictive control for dynamically moving ob-            [11] D. Bertsekas. Dynamic programming and optimal control. Athena
stacles [22]). Furthermore, the total accrued risk cost is higher         Scientific: Nashua, NH, USA, vol. 1, 2017.
for ε “ 0.3, since this leads to more risk-averse policies.          [12] D. Bertsekas and H. Yu. Stochastic shortest path problems under weak
                                                                          conditions. Lab. for Information and Decision Systems Report LIDS-
For total expectation coherent risk measure, the calculations             P-2909, MIT, 2013.
took significantly less time, since they are the result of solving   [13] D. P. Bertsekas and J. N. Tsitsiklis. An analysis of stochastic shortest
                                                                          path problems. Math. Operations Research, 16(3):580–595, 1991.
a set of linear programs. For CVaR and EVaR, a set of                [14] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
DCPs were solved. EVaR calculations were the most com-               [15] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge
putationally involved, since they require solving exponential             university press, 2004.
                                                                     [16] S. Carpin, Y. Chow, and M. Pavone. Risk aversion in finite markov
cone programs. Note that these calculations can be carried out            decision processes using total cost criteria and average value at risk. In
offline for policy synthesis and then the policy can be applied           IEEE Int. Conf. Robotics and Automation, pages 335–342, 2016.

[17] M. Chapman, J. Lacotte, A. Tamar, D. Lee, K. M Smith, V. Cheng,            [47] J. Thai, T. Hunter, A. K. Akametalu, C. J. Tomlin, and A. M. Bayen.
     J. Fisac, S. Jha, M. Pavone, and C. Tomlin. A risk-sensitive finite-time        Inverse covariance estimation from data with missing values using the
     reachability approach for safety of stochastic dynamic systems. In 2019         concave-convex procedure. In IEEE Conf. Decision and Control, pages
     American Control Conference (ACC), pages 2958–2963. IEEE, 2019.                 5736–5742, 2014.
[18] Danny Z Chen. Developing algorithms and software for geometric path        [48] H. Xu and S. Mannor. Distributionally robust Markov decision
     planning problems. ACM Computing Surveys, 28(4es):18–es, 1996.                  processes. In Advances in Neural Information Processing Systems,
[19] Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in                 pages 2505–2513, 2010.
     MDPs. In Advances in neural information processing systems, pages
     3509–3517, 2014.
[20] Y. Chow, A. Tamar, S. Mannor, and M. Pavone. Risk-sensitive and                                          A PPENDIX
     robust decision-making: a CVaR optimization approach. In Advances
     in Neural Information Processing Systems, pages 1522–1530, 2015.           In this appendix, we present the specific DCPs for finding the
[21] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling                  risk value functions for two coherent risk measures studied
     language for convex optimization. Journal of Machine Learning
     Research, 17(83):1–5, 2016.                                                in our numerical experiments, namely, CVaR and EVaR.
[22] A. Dixit, M. Ahmadi, and J. W. Burdick. Risk-sensitive motion planning     For a given confidence level ε P p0, 1q, value-at-risk (VaRε )
     using entropic value-at-risk. European Control Conference, 2021.
[23] K. Dvijotham, M. Fazel, and E. Todorov. Convex risk averse control
                                                                                denotes the p1´εq-quantile value of the cost variable. CVaRε
     design. In IEEE Conf. Decision and Control, pages 4020–4025, 2014.         is the expected loss in the p1 ´ εq-tail given that the particular
[24] D. D. Fan, K. Otsu, Y. Kubo, A. Dixit, J. Burdick, and A. A. Agha-         threshold VaRε has been crossed. CVaRε is given by
     Mohammadi. STEP: Stochastic traversability evaluation and planning                                "                             *
     for safe off-road navigation. arXiv 2103.02828, 2021.                                                   1
[25] J. Fan and A. Ruszczyński. Process-based risk measures and risk-                ρt pct`1 q “ inf ζ ` E rpct`1 ´ ζq` | Ft s ,            (24)
                                                                                                   ζPR       ε
     averse control of discrete-time systems. Mathematical Programming,
     pages 1–28, 2018.                                                          where p¨q` “ maxt¨, 0u. A value of ε » 1 corresponds to
[26] H. Föllmer and A. Schied. Stochastic finance: an introduction in
     discrete time. Walter de Gruyter, 2011.                                    a risk-neutral policy; whereas, a value of ε Ñ 0 is rather a
[27] C. Gavriel, G. Hanasusanto, and D. Kuhn. Risk-averse shortest path         risk-averse policy.
     problems. In 2012 IEEE 51st IEEE Conference on Decision and Control
     (CDC), pages 2533–2538. IEEE, 2012.
                                                                                In fact, Theorem 1 can applied to CVaR since it is a
[28] R. Horst and N. V. Thoai. DC programming: overview. Journal of             coherent risk measure. For MDP M, the risk value func-
     Optimization Theory and Applications, 103(1):1–43, 1999.                   tions can be řcomputed by DCP (23), where           g2 pJq “
[29] E. L. Lawler and D. E. Wood. Branch-and-bound methods: A survey.                                                           (
     Operations research, 14(4):699–719, 1966.
                                                                                inf ζPR ζ ` 1ε s1 PS pJps1 q ´ ζq` T ps1 | s, αq , where the
[30] H. A. Le Thi, H. M. Le, T. P. Dinh, et al. A dc programming approach       infimum on the right hand side of the above equation can be
     for feature selection in support vector machines learning. Advances in     absorbed into the overal infimum problem, i.e., inf J,ζ . Note
     Data Analysis and Classification, 2(3):259–278, 2008.
                                                                                that g2 pJq above is convex in ζ [38, Theorem 1].
[31] T. Lipp and S. Boyd. Variations and extension of the convex–concave
     procedure. Optimization and Engineering, 17(2):263–287, 2016.              Unfortunately, CVaR ignores the losses below the VaR thresh-
[32] A. Majumdar and M. Pavone. How should a robot assess risk? towards         old (since it is only concerned with the average of VaR at the
     an axiomatic theory of risk in robotics. In Robotics Research, pages
     75–84. Springer, 2020.                                                     p1´q-tail of the cost distribution). EVaR is the tightest upper
[33] M. Ono, M. Pavone, Y. Kuwata, and J. Balaram. Chance-constrained           bound in the sense of Chernoff inequality for the value at
     dynamic programming with application to risk-aware robotic space           risk (VaR) and CVaR and its dual representation is associated
     exploration. Autonomous Robots, 39(4):555–571, 2015.
[34] M. Ono, B. C. Williams, and L. Blackmore. Probabilistic planning for       with the relative entropy. In fact, it was shown in [7] that
     continuous dynamic systems under bounded risk. Journal of Artificial       EVaRε and CVaRε are equal only if there are no losses
     Intelligence Research, 46:511–577, 2013.                                   (c Ñ 0) below the VaRε threshold. In addition, EVaR is
[35] J. T. Ott. A Markov decision model for a surveillance application and
     risk-sensitive Markov decision processes. 2010.                            a strictly monotone risk measure; whereas, CVaR is only
[36] L. Prashanth. Policy gradients for CVaR-constrained MDPs. In               monotone [6] (see Definition 4). EVaRε is given by
     International Conference on Algorithmic Learning Theory, pages 155–
                                                                                                               Ereζct`1 | Ft s
                                                                                                        ˆ ˆ                    ˙ ˙
     169. Springer, 2014.
[37] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic                ρt pct`1 q “ inf log                     {ζ .        (25)
                                                                                                    ζą0               ε
     Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA,
     1st edition, 1994.                                                         Similar to CVaRε , for EVaRε , ε Ñ 1 is a risk-neutral
[38] R. T. Rockafellar, S. Uryasev, et al. Optimization of conditional value-   case; whereas, ε Ñ 0 corresponds to a risk-averse case.
     at-risk. Journal of risk, 2:21–42, 2000.
[39] A. Ruszczyński. Risk-averse dynamic programming for markov deci-          In fact, it was demonstrated in [5, Proposition 3.2] that
     sion processes. Mathematical programming, 125(2):235–261, 2010.            limεÑ0 EVaRε pcq “ ess suppcq “ c̄ (worst-case cost).
[40] A. Ruszczyński and A. Shapiro. Optimization of convex risk functions.
     Mathematics of operations research, 31(3):433–452, 2006.                   Since EVaRε is a coherent risk measure, the conditions
[41] P. Sanders. Fast route planning. Google Tech Talk, March 23 2009.          of Theorem 1 hold. Since ζ ą 0, using the change
[42] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on stochastic      of variables, J̃ ” ζJ (note that this change of vari-
     programming: modeling and theory. SIAM, 2014.
[43] X. Shen, S. Diamond, Y. Gu, and S. Boyd. Disciplined convex-concave        ables is monotone increasing in ζ [1]), we can com-
     programming. In 2016 IEEE 55th Conference on Decision and Control          pute EVaR value functions             by solving (23), where f0 “
     (CDC), pages 1009–1014. IEEE, 2016.                                               ˜ “ J,
                                                                                0, f1 pJq       ˜ g0 pJq
                                                                                                       ˜ “ř           ˜
                                                                                                                      Jpsq,                      ˜ “
                                                                                                                            g1 pcq “ ζc, and g2 pJq
[44] C Elliott Sigal, A Alan B Pritsker, and James J Solberg. The stochastic        ´ř         ˜ 1q
                                                                                               Jps      1
                                                                                                                ¯ sPS
                                                                                         1   e      T ps  |s,αq
     shortest route problem. Operations Research, 28(5):1122–1129, 1980.        log     s PS
                                                                                                  ε              .
[45] S. Singh, Y. Chow, A. Majumdar, and M. Pavone. A framework
     for time-consistent, risk-sensitive model predictive control: Theory and   Similar to the CVaR case, the infimum over ζ can be lumped
     algorithms. IEEE Transactions on Automatic Control, 2018.                  into the overall infimum problem, i.e., inf J̃,ζą0 . Note that
[46] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Sequential                   ˜ is convex in J,˜ since the logarithm of sums of expo-
     decision making with coherent risk. IEEE Transactions on Automatic         g2 pJq
     Control, 62(7):3323–3338, 2016.                                            nentials is convex [15, p. 72].

You can also read