Risk-Averse Stochastic Shortest Path Planning - Aaron Ames
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Risk-Averse Stochastic Shortest Path Planning Mohamadreza Ahmadi, Anushri Dixit, Joel W. Burdick, and Aaron D. Ames Abstract— We consider the stochastic shortest path planning one risk measure over another depends on factors such as problem in MDPs, i.e., the problem of designing policies that sensitivity to rare events, ease of estimation from data, and ensure reaching a goal state from a given initial state with computational tractability. Artzner et. al. [8] characterized a minimum accrued cost. In order to account for rare but impor- tant realizations of the system, we consider a nested dynamic set of natural properties that are desirable for a risk measure, coherent risk total cost functional rather than the conventional called a coherent risk measure, and have henceforth obtained arXiv:2103.14727v1 [eess.SY] 26 Mar 2021 risk-neutral total expected cost. Under some assumptions, we widespread acceptance in finance and operations research, show that optimal, stationary, Markovian policies exist and among others. Coherent risk measures can be interpreted as can be found via a special Bellman’s equation. We propose a a special form of distributional robustness, which will be computational technique based on difference convex programs (DCPs) to find the associated value functions and therefore the leveraged later in this paper. risk-averse policies. A rover navigation MDP is used to illustrate Conditional value-at-risk (CVaR) is an important coherent risk the proposed methodology with conditional-value-at-risk (CVaR) measure that has received significant attention in decision and entropic-value-at-risk (EVaR) coherent risk measures. making problems, such as MDPs [20], [19], [36], [9]. General coherent risk measures for MDPs were studied in [39], [3], wherein it was further assumed the risk measure is I. I NTRODUCTION time consistent, akin to the dynamic programming property. Shortest path problems [10], i.e., the problem of reaching Following the footsteps of [39], [46] proposed a sampling- a goal state form an initial state with minimum total cost, based algorithm for MDPs with static and dynamic coherent arise in several real-world applications, such as driving di- risk measures using policy gradient and actor-critic methods, rections on web mapping websites like MapQuest or Google respectively (also, see a model predictive control technique for Maps [41] and robotic path planning [18]. In a shortest path linear dynamical systems with coherent risk objectives [45]). problem, if transitions from one system state to another is A method based on stochastic reachability analysis was pro- subject to stochastic uncertainty, the problem is referred to posed in [17] to estimate a CVaR-safe set of initial conditions as a stochastic shortest path (SSP) problem [13], [44]. In this via the solution to an MDP. A worst-case CVaR SSP planning case, we are interested in designing policies such that the total method was proposed and solved via dynamic programming expected cost is minimized. Such planning under uncertainty in [27]. Also, total cost undiscounted MDPs with static CVaR problems are indeed equivalent to an undiscounted total cost measures were studied in [16] and solved via a surrogate Markov decision processes (MDPs) [37] and can be solved MDP, whose solution approximates the optimal policy with efficiently via the dynamic programming method [13], [12]. arbitrary accuracy. However, emerging applications in path planning, such as In this paper, we propose a method for designing policies autonomous navigation in extreme environments, e.g., sub- for SSP planning problems, such that the total accrued cost terranean [24] and extraterrestrial environments [2], not only in terms of dynamic, coherent risk measures is minimized (a require reaching a goal region, but also risk-awareness for generalization of the problems considered in [27] and [16] to mission success. Nonetheless, the conventional total expected dynamic, coherent risk measures). We begin by showing that, cost is only meaningful if the law of large numbers can be in- under the assumption that the goal region is reachable in finite voked and it ignores important but rare system realizations. In time with non-zero probability, the total accumulated risk cost addition, robust planning solutions may give rise to behavior is always bounded. We further show that, if the coherent risk that is extremely conservative. measures satisfy a Markovian property, we can find optimal, Risk can be quantified in numerous ways. For example, stationary, Markovian risk-averse policies via solving a spe- mission risks can be mathematically characterized in terms cial Bellman’s equation. We also propose a computational of chance constraints [34], [33], utility functions [23], and method based on difference convex programming to solve the distributional robustness [48]. Chance constraints often ac- Bellman’s equation and therefore design risk-averse policies. count for Boolean events (such as collision with an obstacle We elucidate the proposed method via numerical examples or reaching a goal set) and do not take into consideration involving a rover navigation MDP and CVaR and entropic- the tail of the cost distribution. To account for the latter, value-at-risk (EVaR) measures. risk measures have been advocated for planning and decision The rest of the paper is organized as follows. In the next making tasks in robotic systems [32]. The preference of section, we review some definitions and properties used in the sequel. In Section III, we present the problem under The authors are with the California Institute of Technology, 1200 E. study and show its well-posedness under an assumption. In California Blvd., MC 104-44, Pasadena, CA 91125, e-mail: ({mrahmadi, adixit, ames}@caltech.edu, jwb@robotics.caltech.edu.) Section IV, we present the main result of the paper, i.e., a
Consider a probability space pΩ, F, Pq, a filtration F0 Ă ¨ ¨ ¨ FT Ă F, and an adapted sequence of random vari- si ables (stage-wise costs) ct , t “ 0, . . . , T , where T P Ně0 Y t8u. For t “ 0, . . . , T , we further define the spaces Ct “ Lp pΩ, Ft , Pq, p P r1, 8q, Ct:T “ Ct ˆ ¨ ¨ ¨ ˆ CT and T psg | sg , αq “ 1 C “ C0 ˆ C1 ˆ ¨ ¨ ¨ . In order to describe how one can evaluate the risk of sub-sequence ct , . . . , cT from the perspective of stage t, we require the following definitions. sg sj Definition 2 (Conditional Risk Measure): A mapping ρt:T : Ct:T Ñ Ct , where 0 ď t ď N , is called a conditional risk Fig. 1. The transition graph of the particular class of MDPs studied in this measure, if it has the following monotonicity property: paper. The goal state sg is cost-free and absorbing. ρt:T pcq ď ρt:T pc1 q, @c, @c1 P Ct:T such that c ĺ c1 . special Bellman’s equation for the risk-averse SSP problem. Definition 3 (Dynamic Risk Measure): A dynamic risk mea- In Section V, we describe a computational method to find sure is a sequence of conditional risk measures ρt:T : Ct:T Ñ risk-averse policies. In Section VI, we illustrate the proposed Ct , t “ 0, . . . , T . method via a numerical example and finally, in Section VI, One fundamental property of dynamic risk measures is their we conclude the paper. consistency over time [39, Definition 3]. If a risk measure is Notation: We denote by Rn the n-dimensional Euclidean time-consistent, we can define the one-step conditional risk space and Ně0 the set of non-negative integers. We use measure ρt : Ct`1 Ñ Ct , t “ 0, . . . , T ´ 1 as follows: bold font to denote a vector and p¨qJ for its transpose, e.g., a “ pa1 , . . . , an qJ , with n P t1, 2, . . .u. For a vector a, we ρt pct`1 q “ ρt,t`1 p0, ct`1 q, (1) use a ľ pĺq0 to denote element-wise non-negativity (non- and for all t “ 1, . . . , T , we obtain: positivity) and a ” 0 to show all elements of a are zero. ` For a finite set A, we denote its power set by 2A , i.e., the ρt,T pct , . . . , cT q “ ρt ct ` ρt`1 pct`1 ` ρt`2 pct`2 ` ¨ ¨ ¨ set of all subsets of A. For a probability space pΩ, F, Pq ˘ ` ρT ´1 pcT ´1 ` ρT pcT qq ¨ ¨ ¨ qq . (2) and a constant p P r1, 8q, Lp pΩ, F, Pq denotes the vector space of real valued random variables c for which E|c|p ă 8. Note that the time-consistent risk measure is completely de- Superscripts are used to denote indices and subscripts are used fined by one-step conditional risk measures ρt , t “ 0, . . . , T ´ to denote time steps (stages), e.g., for s P S, s21 means the 1 and, in particular, for t “ 0, (2) defines a risk measure of the value of s2 P S at the 1st stage. the entire sequence c P C0:T . At this point, we are ready to define a coherent risk measure. II. P RELIMINARIES Definition 4 (Coherent Risk Measure): We call the one-step This section, briefly reviews notions and definitions used conditional risk measures ρt : Ct`1 Ñ Ct , t “ 1, . . . , N ´ 1 throughout the paper. as in (2) a coherent risk measure, if it satisfies the following conditions We are interested in designing policies for a class of finite MDPs (termed transient MDPs in [16]) as shown in Figure 1, ‚ Convexity: ρt pλc ` p1 ´ λqc1 q ď λρt pcq ` p1 ´ λqρt pc1 q, which is defined next. for all λ P p0, 1q and all c, c1 P Ct`1 ; ‚ Monotonicity: If c ď c1 , then ρt pcq ď ρt pc1 q for all Definition 1 (MDP): An MDP is a tuple, M “ c, c1 P Ct`1 ; pS, Act, T, s0 , c, sg q, where ‚ Translational Invariance: ρt pc1 ` cq “ ρt pc1 q ` c for all ‚ States S “ ts1 , . . . , s|S| u of the autonomous agent(s) c P Ct and c1 P Ct`1 ; and world model, ‚ Positive Homogeneity: ρt pβcq “ βρt pcq for all c P Ct`1 ‚ Actions Act “ tα1 , . . . , α|Act| u available to the robot, and β ě 0. ‚ A transition probability distribution T psj |si , αq, satisfy- ing sPS T ps|si , αq “ 1, @si P S, @α P Act, ř In fact, we can show that there exists a dual (or distributionally ‚ An initial state s0 P S, and robust) representation for any coherent risk measure. Let ‚ An immediate cost function, cpsi , αi q ě 0, for each state m, n P r1, 8q such that 1{m ` 1{n “ 1 and si P S and action αi P Act, ÿ P “ q P Ln pS, 2S , Pq | ( qps1 qPps1 q “ 1, q ě 0 . ‚ sg P S is a special cost-free goal (termination) state, i.e., s1 PS T psg | sg , αq “ 1 and cpsg , αq “ 0 for all α P Act. Proposition 1 (Proposition 4.14 in [26]): Let Q be a closed We assume the immediate cost function c is non-negative and convex subset of P. The one-step conditional risk measure upper-bounded by a positive constant c̄. ρt : Ct`1 Ñ Ct , t “ 1, . . . , N ´ 1 is a coherent risk measure Our risk-averse policies for the SSP problem rely on the if and only if notion of dynamic coherent risk measures, whose definitions ρt pcq “ sup xc, qyQ , @c P Ct`1 , (3) and properties are presented next. qPQ
where x¨, ¨yQ denotes the inner product in Q. that pπ is dependent solely on tπ1 , π2 , . . . , πτ u. Moreover, since Act is finite, the number of τ -stage policies is also Hereafter, all risk measures are assumed to be coherent. finite, which implies finiteness of pπ . Hence, p ă 1 as well. Therefore, for any policy π and initial state s, we obtain III. P ROBLEM FORMULATION Pps2τ ‰ sg | s0 “ s, πq “ Pps2τ ‰ sg | sτ ‰ sg , s0 “ s, πq ˆ Ppsτ ‰ sg | s0 “ s, πq ď p2 . Then, by induction, we Next, we formally describe the risk-averse SSP problem. We can show that, for any admissible SSP policy π, we have also demonstrate that, if the goal state is reachable in finite time, the risk-averse SSP problem is well-posed. Let π “ Ppskτ ‰ sg | s0 “ s, πq ď pk , @s P S, (6) tπ0 , π1 , . . .u be an admissible policy. and k “ 1, 2, . . .. Indeed, we can show that the risk-averse Problem 1: Consider MDP M as described in Definition 1. cost incurred in the τ periods between τ k and τ pk ` 1q ´ 1 Given an initial state s0 ‰ sg , we are interested in solving is bounded as follows the following problem ` ˘ ρ0 ¨ ¨ ¨ ρτ pk`1q´1 pcτ k ` ¨ ¨ ¨ ` cτ pk`1q´1 q ¨ ¨ ¨ (7a) π ˚ P arg min Jps0 , πq, (4) π “ ρ̃pcτ k ` cτ k`1 ` ¨ ¨ ¨ ` cτ pk`1q´1 q (7b) where ď ρ̃pc̄ ` ¨ ¨ ¨ ` c̄q (7c) Jps0 , πq “ lim ρt:T pcps0 , π0 q, . . . , cpsT , πT qq , (5) “ sup xτ c̄, qyQ (7d) T Ñ8 qPQ is the total risk functional for the admissible policy π. ď xτ c̄, q ˚ yQ (7e) ÿ “ τ c̄ Ppskτ ‰ sg | s0 “ s, πqq ˚ ps, πq (7f) In fact, we are interested in reaching the goal state sg such sPS that the total risk cost is minimized1 . Note that the risk-averse ÿ deterministic shortest problem can be obtained as a special ď τ c̄ ˆ sup pPpskτ ‰ sg | s0 “ s, πqq |q ˚ ps, πq| (7g) sPS case when the transitions are deterministic. We define the optimal risk value function as ď τ c̄ pk , (7h) J ˚ psq “ min Jps, πq, @s P S, where in (7b) we used the translational invariance property of π coherent risk measures and defined ρ̃ “ ρ0 ˝ ¨ ¨ ¨ ˝ ρτ pk`1q´1 . and call a stationary policy π “ tµ, µ, . . .u (denoted µ) Since any finite compositions of the coherent risk measures optimal if Jps, µq “ J ˚ psq “ minπ Jps, πq, @s P S. is a risk measure [42], we have that ρ̃ is also a coherent We posit the following assumption, which implies that the risk measure. Moreover, since the immediate cost function goal state is reachable eventually under all policies. c is upper-bounded, from the monotonicity property of the coherent risk measure ρ̃, we obtain (7c). Equality (7d) is Assumption 1 (Goal is Reachable in Finite Time): derived from Proposition 1. Inequality (7e) is obtained via Regardless of the policy used and the initial state, there defining q ˚ “ argsupqPQ xτ c̄, qyQ . In inequality (7g), we used exists an integer τ such that there is a positive probability Hölder inequality and finally we used (6) to obtain the last that the goal state sg is visited after no more than τ stages2 . inequality. Thus, the risk-averse total cost Jps, πq, s P S, We then have the following observation with respect to exists and is finite, because given Assumption 1 we have Problem 13 . |Jps0 , πq| “ lim ρ0 ˝ ¨ ¨ ¨ ˝ ρτ ´1 ˝ ¨ ¨ ¨ ˝ ρT pc0 ` ¨ ¨ ¨ ` cT q T Ñ8 Proposition 2: Let Assumption 1 hold. Then, the risk-averse 8 ÿ ` ˘ SSP problem, i.e., Problem 1, is well-posed and Jps0 , πq is ď ρ0 ¨ ¨ ¨ ρτ pk`1q´1 pcτ k ` ¨ ¨ ¨ ` cτ pk`1q´1 q ¨ ¨ ¨ bounded for all policies π. k“0 8 ÿ τ c̄ Proof: Assumption 1 implies that for each admissible ď τ c̄ pk “ , policy π, we have pπ “ maxsPS Ppsτ ‰ sg | s0 “ s, πq ă 1. k“0 1´p That is, given a policy π, the probability pπ of not visiting (8) the goal state sg is less than one. Let p “ maxπ pπ . Remark where in the first equality above we used the translational invariance property and in the first inequality we used the sub- 1 An important class of SSP planning problems are concerned with additivity property of coherent risk measures. Hence, Jps0 , πq minimum-time reachability. Indeed, our formulation also encapsulates is bounded for all π. minimum-time problems, in which for MDP M, we have cpsq “ 1, for all s P Sztsg u. 2 If instead of one goal state sg , we were interested in a set of goal states G Ă S, it suffices to define τ “ inftt | Ppst P G | s0 P SzG, πq ą 0u. IV. R ISK -AVERSE SSP P LANNING Then, all the paper’s derivations can be applied. For the sake of simplicity of the presentation, we present the results for a single goal state. This section presents the paper’s main result, which includes a 3 Note that Problem 1 is ill-posed in general. For example, if the induced special Bellman’s equation for finding the risk value functions Markov chain for an admissible policy is periodic, then the limit in (5) for Problem 1. Furthermore, assuming that the coherent risk may not exist. This is in contrast to risk-averse discounted infinite-horizon MDPs [3], for which we only require non-negativity and boundedness of measures satisfy a Markovian property, we show that the immediate costs. optimal risk-averse policies are stationary and Markovian.
To begin with, note that at any time t, the value of ρt limiting process and the measure ρt commute. The limit term is Ft -measurable and is allowed to depend on the entire is indeed the total risk cost starting at sτ M , i.e., Jpsτ M , πq. history of the process ts0 , s1 , . . .u and we cannot expect Next, we show that under Assumption 1, this term remains to obtain a Markov optimal policy [35]. In order to obtain bounded. From (8) in the proof of Proposition 2, we have Markov optimal policies for Problem 1, we need the following ` |Jpsτ M ,πq| “ lim ρτ M cτ M ¨ ¨ ¨ ` ρT pcT q ¨ ¨ ¨ q property [39, Section 4] of risk measures. T Ñ8 8 ÿ Definition 5 (Markov Risk Measure [25]): A one-step con- ` ˘ ď ρτ k cτ k ` ¨ ¨ ¨ ` ρτ pk`1q´1 pcτ pk`1q´1 q ¨ ¨ ¨ ditional risk measure ρt : Ct`1 Ñ Ct is a Markov risk k“M measure with respect to MDP P, if there exist a risk transition 8 ÿ τ c̄ pM mapping σt : Lm pS, 2S , Pq ˆ S ˆ M Ñ R such that for all ď τ c̄ pk “ . (14) 1´p v P Lm pS, 2S , Pq and αt P πpst q, we have k“M Substituting the above bound in (13) gives ρt pvpst`1 qq “ σt pvpst`1 q, st , T pst`1 |st , αt qq . (9) ` τ c̄ pM ˘ In fact, if ρt is a coherent risk measure, σt also satisfies the Jps0 , πq ď ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 ` q¨¨¨ 1´p properties of a coherent risk measure (Definition 4). ` ˘ τ c̄ pM Assumption 2: The one-step coherent risk measure ρt is a “ ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 q ¨ ¨ ¨ ` , (15) 1´p Markov risk measure. where the last equality holds via the translational invariance c̄ pM We can now present the main result in the paper, a form of property of the one-step risk measures and the fact that τ 1´p Bellman’s equations for solving the risk-averse SSP problem. is constant. Similarly, following (14), we can also obtain a lower bound on Jps0 , πq as follows Theorem 1: Consider MDP P as described in Definition 1 ` τ c̄ pM ˘ and let Assumptions 1 and 2 hold. Then, the following Jps0 , πq ě ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 ´ q¨¨¨ 1´p statements are true for the risk-averse SSP problem: ` ˘ τ c̄ pM (i) Given (non-negative) initial condition J 0 psq, s P S, the “ ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 q ´ . (16) 1´p sequence generated by the recursive formula (dynamic pro- gramming) Thus, from (15) and (16), we obtain ˆ τ c̄ pM ` ˘ k`1 Jps0 , πq ´ ď ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 qq ¨ ¨ ¨ J psq “ min cps, αq 1´p αPAct ˙ τ c̄ pM ( ď Jps0 , πq ` . (17) ` σ J k ps1 q, s, T ps1 |s, αq , @s P S, (10) 1´p converges to the optimal risk value function J ˚ psq, s P S. Furthermore, Assumption 1 implies that J 0 psg q “ 0. If we (ii) The optimal risk value functions J ˚ psq, s P S are the consider J 0 as a terminal risk value function, we can obtain unique solution to the Bellman’s equation |ρτ M pJ 0 psτ M qq| “ | sup xJ 0 psτ M q, qyQ | (18a) ˆ qPQ J ˚ psq “ min cps, αq “ |xJ 0 psτ M q, q ˚ yQ | (18b) αPAct ÿ ( ˙ “| Ppsτ M “ s | s0 , πqq ˚ ps, πqJ 0 psq| (18c) ` σ J ˚ ps1 q, s, T ps1 |s, αq , @s P S; (11) ÿsPS ď Ppsτ M “ s | s0 , πqq ˚ ps, πq ˆ max |J 0 psq| (18d) (iii) For any stationary Markovian policy µ, the risk averse sPS sPS value functions Jps, µpsqq, s P S are the unique solutions to ÿ ď Ppsτ M “ s | s0 , πq ˆ max |J 0 psq| (18e) sPS sPS Jps, µq “ cps, µpsqq M ( ďp max |J 0 psq|, (18f) ` σ Jps, µq, s, T ps1 |s, αq , @s P S; (12) sPS where, similar to the derivation in (7), in (18a) we used Propo- (iv) A stationary Markovian policy µ is optimal if and only sition 1. Defining q ˚ “ argsupqPQ xτ c̄, qyQ , we obtained if µ attains the minimum in Bellman’s equation (11). (18b) and the last inequality is based on the fact that the Proof: For every positive integer M , an initial state s0 , probability of sτ M ‰ sg is less than equal to pM as in (6). and policy π, we can split the nested risk cost (5), where ρt,T Combining inequalities (17) and (18), we have is defined in (2), at time index τ M and obtain τ c̄ pM ` Jps0 , πq “ ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 ´pM max |J 0 psq| ` Jps0 , πq ´ ` ˘ sPS 1´p ` ˘ ` lim ρτ M cτ M ` ¨ ¨ ¨ ` ρT pcT q ¨ ¨ ¨ qq , (13) T Ñ8 ď ρ0 c0 ` ¨ ¨ ¨ ` ρτ M ´1 pcτ M ´1 ` ρτ M pJ 0 psτ M qq ¨ ¨ ¨ where we used the fact that the one-step coherent risk τ c̄ pM ď pM max |J 0 psq| ` Jps0 , πq ` . (19) measures ρt are continuous [40, Corollary 3.1] and hence the sPS 1´p
Remark that the middle term in the ineqaulity above is the Then, Part (iii) and the above equation imply Jps, µq “ J ˚ psq τ M -stage risk-averse cost of the policy π with the terminal for all s P S. Conversely, if Jps, µq “ J ˚ psq for all s P S, cost J 0 psτ M q. Given Assumption 2, from [39, Theorem then Items (ii) and (iii) imply that µ is optimal. 2], the minimum of this cost is generated by the dynamic At this point, we should highlight that, in the case of condi- programming recursion (10) after τ M iterations. Taking the tional expectation as the coherent risk measure (ρt “ E and minimum over the policy π on every side of (19) yields ř σ tJps1 q, s, T ps1 |s, αqu “ s1 PS T ps1 |s, αqJps1 q), Theorem 1 τ c̄ pM simplifies to [11, Proposition 5.2.1] for the risk-neutral SSP ´pM max |J 0 psq| ` J ˚ ps0 q ´ sPS 1´p problem. In fact, Theorem 1 is a generalization of [11, ď J τ M ps0 q Proposition 5.2.1] to the risk-averse case. τ c̄ pM Recursion (dynamic programming) (10) represents the Value ď pM max |J 0 psq| ` J ˚ ps0 q ` , (20) Iteration (VI) for finding the risk value functions. In general, sPS 1´p value iteration converges with infinite number of iterations for all s0 and M . Finally, let k “ τ M . Since the above (k Ñ 8), but it can be shown, for a stationary policy inequality holds for all M , taking the limit of M Ñ 8 gives µ resulting in an acyclic induced Markov chain, the VI algorithm converges in |S| of steps (see the derivation for lim J τ M ps0 q “ lim J k ps0 q “ J ˚ ps0 q, @s0 P S. (21) M Ñ8 kÑ8 the risk-neutral SSP problem in [14]). (ii) Taking the limit, k Ñ 8, of both sides ` of (10) Alternatively, one can design risk-averse policies yields limkÑ8 J k`1 psq (˘ “ lim kÑ8 min αPAct cps, αq ` using Policy Iteration (PI). That is, starting with an σ J k ps1 q, s, T ps1 |s, αq , @s P S. Equality (21) in the proof initial policy µ0 , we can carry out policy evaluation of Part (i) implies that via (12) followed by a policy improvement step, which ˆ calculates an´ improved !policy µk`1 , as µk`1 psq “ )¯ k ˚ J psq “ lim min cps, αq arg minαPAct cps, αq ` σ J µ ps, αq, s, T ps1 |s, αq , @s P kÑ8 αPAct ˙ S. This process is repeated until no further improvement is k`1 k found in terms of the risk value functions: J µ psq “ J µ psq ( ` σ J k ps1 q, s, T ps1 |s, αq , @s P S. for all s P S. Since the limit and the minimization commute over a finite However, we do not pursue VI or PI approaches further in number of alternatives, we have this work. The main obstacle for using VI and PI is that ˆ equations (10)-(12) are nonlinear (and non-smooth) in the risk ˚ J psq “ min cps, αq value functions for a general coherent risk measure. Solving αPAct ˙ nonlinear equations (10)-(12) for the risk value functions may k 1 1 ( ` lim σ J ps q, s, T ps |s, αq , @s P S. require significant computational burden (see the specialized kÑ8 non-smooth Newton Method in [39] for solving similar non- Finally, because σ is continuous [40, Corollary 3.1], the limit linear VIs). Instead, the next section present a computational and σ commute as well and from (21), we obtain J ˚ psq “ method based on difference convex programs (DCPs). minαPAct pcps, αq ` σ tJ ˚ ps1 q, s, T ps1 |s, αquq , @s P S. To show uniqueness, note that for any Jpsq, s P S satisfying V. A DCP C OMPUTATIONAL A PPROACH the above equation, the dynamic programming recursion (10) starting at Jpsq, s P S replicates Jpsq, s P S and from Part (i) In this section, we propose a computational method based we infer Jpsq “ J ˚ psq for all s P S. on DCPs to find the risk value functions and subsequently (iii) Given a stationary Markovian policy µ, at every state s, policies that minimize the accrued dynamic risk in the SSP we have α “ µpsq, hence from Item (i), we have planning. Before stating the DCP formulation, we show that the Bellman operator in (10) is non-decreasing. Let ˆ ` ˘ J k`1 ps, µq “ min cps, αq Dπ J : “ cps, πpsqq ` σ Jps1 q, s, T ps1 |s, πpsqq , @s P S, αPtµpsqu ` ` ˘˘ ˙ DJ : “ min cps, αq ` σ Jps1 q, s, T ps1 |s, αq , @s P S. ( αPAct ` σ J k ps1 , µq, s, T ps1 |s, αq , @s P S. Since the minimum is only over one element, we Lemma 1: Let Assumptions 1 and 2 hold. For all v, w P ( have J k`1 ps, µq “ cps, µpsqq ` σ J k ps1 , µq, s, T ps1 |s, αq , @s P Lm pS, 2S , Pq, if v ď w, then Dπ v ď Dπ w and Dv ď Dw. S, which with k Ñ 8 converges uniquely (see Item (ii)) to Proof: Since ρ is a Markov risk measure, we have Jps, µq. ρpvq “ σpv, s, T q for all v P Lm pS, 2S , Pq. Furthermore, (iv) The stationary policy attains its minimum in (11), if since ρ is a coherent risk measure from Proposition 1, we ˆ ( ˙ know that (3) holds. Inner producting both sides of v ď w ˚ ˚ 1 1 J psq “ min cps, αq ` σ J ps q, s, T ps |s, αq with the probability measure q P Q Ă P from right and αPtµpsqu ( taking the supremum over Q yields supqPQ xv, qyQ ď “ cps, µpsqq ` σ J ˚ ps1 q, s, T ps1 |s, αq , @s P S. supqPQ xw, qyQ . From Proposition 1, we have σpv, s, T q “
supqPQ xv, qyQ and σpw, s, T q “ supqPQ xw, qyQ . There- fore, σpv, s, T q ď σpw, s, T q. Adding c to both sides of the above inequality, gives Dπ v ď Dπ w. Taking the minimum with respect to α P Act from both sides of Dπ v ď Dπ w, does not change the inequality and gives Dv ď Dw. We are now ready to state an optimization formulation to the Bellman equation (10). Proposition 3: Consider MDP M as described in Defini- tion 1. Let the Assumptions of Theorem 1 hold. Then, the optimal value functions J ˚ psq, s P S, are the solutions to the Fig. 2. Grid world illustration for the rover navigation example. Blue cells denote the obstacles and the yellow cell denotes the goal. following optimization problem ÿ In particular, in this work, we use a variant of the convex- sup Jpsq (22) concave procedure [31], [43], wherein the concave terms J sPS are replaced by a convex upper bound and solved. In fact, subject to the disciplined convex-concave programming (DCCP) [43] Jpsq ď cps, αq ` σtJps1 q, s, T ps1 |s, αqu, @ps, αq P S ˆ Act. technique linearizes DCP problems into a (disciplined) convex Proof: From Lemma 1, we infer that Dπ and D are program (carried out automatically via the DCCP Python non-decreasing; i.e., for v ď w, we have Dπ v ď Dπ w and package [43]), which is then converted into an equivalent Dv ď Dw. Therefore, if J ď DJ, then DJ ď DpDJq. By cone program by replacing each function with its graph imple- repeated application of D, we obtain J ď DJ ď D2 J ď mentation. Then, the cone program can be solved readily by D8 J “ J ˚ . Any feasible solution to (22) must satisfy J ď available convex programming solvers, such as CVXPY [21]. DJ and hence must satisfy J ď J ˚ . Thus, J ˚ is the largest In the Appendix, we present the specific DCPs required J that satisfies the constraint in optimization (22). Hence, the for risk-averse SSP planning for CVaR and EVaR risk optimal solution to (22) is the same as that of (10). measures used in our numerical experiments in the next section. Note that for the risk-neutral conditional expecta- Once a solution J ˚ to optimization problem (22) is found, tion measure, optimization (23)řbecomes a linear program, we can find a corresponding stationary Markovian policy as since σ tJps1 q, s, T ps1 |s, αqu “ s1 PS T ps1 |s, αqJps1 q is lin- ˆ ear in the decision variables J . ˚ µ psq “ arg min cps, αq αPAct VI. N UMERICAL E XPERIMENTS ˙ ( ` σ J ˚ ps, αq, s, T ps1 |s, αq , @s P S. In this section, we evaluate the proposed method for risk- averse SSP planning with a rover navigation MDP (also used in [2], [3]). We consider the traditional total expectation as A. DCPs for Risk-Averse SSP Planning well as CVaR and EVaR. The experiments were carried out on a MacBook Pro with 2.8 GHz Quad-Core Intel Core i5 Assumption 1 implies that each ρ is a coherent, Markov and 16 GB of RAM. The resultant linear programs and DCPs risk measure. Hence, the mapping v ÞÑ σpv, ¨, ¨q is convex were solved using CVX [21] with DCCP [43] add-on. (because σ is also a coherent risk measure). We next show An agent (e.g. a rover) must autonomously navigate a 2- that optimization problem (22) is in fact a DCP. ř dimensional terrain map (e.g. Mars surface) represented by an Let f0 “ 0, g0 pJ q “ sPS Jpsq, f1 pJ q “ Jpsq, g1 ps, αq “ M ˆ N grid with 0.25M N obstacles. Thus, the state space is cps, αq, and g2 pJ q “ σpJ, ¨, ¨q. Note that f0 and g1 are convex given by S “ tsi |i “ x ` y, x P t1, . . . , M u, y P t1, . . . , N uu (constant) functions and g0 , f1 , and g2 are convex functions with x “ 1, y “ 0 being the leftmost bottom grid. Since in J . Then, (22) can be expressed as the minimization the rover can move from cell to cell, its action set is Act “ tE, W, N, Su. The actions move the robot from its current inf f0 ´ g0 pJ q J cell to a neighboring cell, with some uncertainty. The state subject to transition probabilities for various cell types are shown for f1 pJq ´ g1 ps, αq ´ g2 pJq ď 0, @s, α. (23) actions E (East) and N (North) in Figure 2. Other actions lead to similar transitions. Hitting an obstacle incurs the immediate The above optimization problem is indeed a standard cost of 5, while the goal grid region has zero immediate cost. DCP [28]. Many applications require solving DCPs, such Any other grid has a cost of 1 to represent fuel consumption. as feature selection in machine learning [30] and inverse Once the policies are calculated, as a robustness test similar covariance estimation in statistics [47]. DCPs can be solved to [20], [2], [3], we included a set of single grid obstacles that globally [28], e.g. using branch and bound algorithms [29]. are perturbed in a random direction to one of the neighboring Yet, a locally optimal solution can be obtained based on grid cells with probability 0.2 to represent uncertainty in the techniques of nonlinear optimization [14] more efficiently. terrain map. For each risk measure, we run 100 Monte Carlo
pM ˆ N qρ J ˚ ps0 q Total # U.O. F.R. for risk-averse robot path planning. Time [s] The table also outlines the failure ratios of each risk mea- p4 ˆ 5qE 13.25 0.56 2 39% sure. In this case, EVaR outperformed both CVaR and total p10 ˆ 10qE 27.31 1.04 4 46% p10 ˆ 20qE 38.35 1.30 8 58% expectation in terms of robustness, which is consistent with the fact that EVaR is a more conservative risk measure. Lower p4 ˆ 5qCVaR0.7 18.76 0.58 2 14% failure/collision rates were observed for ε “ 0.3, which p10 ˆ 10qCVaR0.7 35.72 1.12 4 19% correspond to more risk-averse policies. In addition, these p10 ˆ 20qCVaR0.7 47.36 1.36 8 21% p4 ˆ 5qCVaR0.3 25.69 0.57 2 10% results suggest that, although total expectation can be used p10 ˆ 10qCVaR0.3 43.86 1.16 4 13% as a measure of performance in high number of Monte Carlo p10 ˆ 20qCVaR0.3 49.03 1.34 8 15% simulations, it may not be practical to use it for real-world p4 ˆ 5qEVaR0.7 26.67 1.83 2 9% planning under uncertainty scenarios. CVaR and EVaR seem p10 ˆ 10qEVaR0.7 41.31 2.02 4 11% to be a more efficient metric for performance in shortest path p10 ˆ 20qEVaR0.7 50.79 2.64 8 17% planning under uncertainty. p4 ˆ 5qEVaR0.3 29.05 1.79 2 7% p10 ˆ 10qEVaR0.3 48.73 2.01 4 10% VII. C ONCLUSIONS p10 ˆ 20qEVaR0.3 61.28 2.78 8 12% We proposed a method based on dynamic programming TABLE I for designing risk-averse policies for the SSP problem. We C OMPARISON BETWEEN TOTAL EXPECTATION , CVA R, AND EVA R RISK presented a computational approach in terms of difference MEASURES . pM ˆ N qρ DENOTES THE GRID - WORLD OF SIZE M ˆ N AND convex programs for finding the associated risk value func- ONE - STEP COHERENT RISK MEASURE ρ. T OTAL T IME DENOTES THE TIME tions and hence the risk-averse policies. Future research will TAKEN BY THE CVX SOLVER TO SOLVE THE ASSOCIATED LINEAR extend to risk-averse MDPs with average costs and risk-averse PROGRAMS OR DCP S . # U.O. DENOTES THE NUMBER OF SINGLE GRID MDPs with linear temporal logic specifications, where the UNCERTAIN OBSTACLES USED FOR ROBUSTNESS TEST. F.R. DENOTES former problem is cast as a special case of the risk-averse THE FAILURE RATE OUT OF 100 M ONTE C ARLO SIMULATIONS . SSP problem. In this work, we assumed the states are fully observable, we will study the SSP problems with partial state simulations with the calculated policies and count the number observation [4], [2] in the future, as well. of runs ending in a collision. R EFERENCES In the experiments, we considered three grid-world sizes of 4 ˆ 5, 10 ˆ 10, and 10 ˆ 20 corresponding to 20, 100, and [1] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd. A rewriting system for convex optimization problems. Journal of Control and 200 states, respectively. We allocated 2, 4, and 8 uncertain Decision, 5(1):42–60, 2018. (single-cell) obstacles for the 4 ˆ 5, 10 ˆ 10, and 10 ˆ 20 [2] M. Ahmadi, M. Ono, M. D. Ingham, R. M. Murray, and A. D. Ames. grids, respectively. In each case, we solve DCP (22) (linear Risk-averse planning under uncertainty. In 2020 American Control Conference (ACC), pages 3305–3312. IEEE, 2020. program in the case of total expectation) with |S||Act| “ [3] M. Ahmadi, U. Rosolia, M. D. Ingham, R. M. Murray, and A. D. Ames. M N ˆ 4 “ 4M N constraints and M N ` 1 variables (the risk Constrained risk-averse Markov decision processes. In The 35th AAAI value functions J’s and ζ for CVaR and EVaR as discussed Conference on Artificial Intelligence (AAAI-21), 2021. [4] M. Ahmadi, R. Sharan, and J. W. Burdick. Stochastic finite state control in the Appendix). In these experiments, we set the confidence of POMDPs with LTL specifications. arXiv preprint arXiv:2001.07679, levels to ε “ 0.3 (more risk-averse) and ε “ 0.7 (less risk- 2020. averse) for both CVaR and EVaR coherent risk measures. The [5] A. Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure. J. Optimization Theory and Applications, 155(3):1105–1123, 2012. initial condition was chosen as s0 “ s1 , i.e., the agent starts [6] A. Ahmadi-Javid and M. Fallah-Tafti. Portfolio optimization with at the leftmost grid at the bottom, and the goal state was entropic value-at-risk. Euro. J. Operational Res., 279(1):225–241, 2019. selected as sg “ sM N , i.e., the rightmost grid at the top. [7] A. Ahmadi-Javid and A. Pichler. An analytical study of norms and banach spaces induced by the entropic value-at-risk. Mathematics and A summary of our numerical experiments is provided in Financial Economics, 11(4):527–550, 2017. Table 1. Note the computed values of Problem 1 satisfy [8] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999. Epcq ď CVaRε pcq ď EVaRε pcq. This is in accordance [9] N. Bäuerle and J. Ott. Markov decision processes with average-value- with the theory that EVaR is a more conservative coherent at-risk criteria. Math. Methods Operations Res., 74(3):361–379, 2011. risk measure than CVaR [5] (see also our work on EVaR- [10] R. Bellman. On a routing problem. Quarterly of applied mathematics, 16(1):87–90, 1958. based model predictive control for dynamically moving ob- [11] D. Bertsekas. Dynamic programming and optimal control. Athena stacles [22]). Furthermore, the total accrued risk cost is higher Scientific: Nashua, NH, USA, vol. 1, 2017. for ε “ 0.3, since this leads to more risk-averse policies. [12] D. Bertsekas and H. Yu. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS- For total expectation coherent risk measure, the calculations P-2909, MIT, 2013. took significantly less time, since they are the result of solving [13] D. P. Bertsekas and J. N. Tsitsiklis. An analysis of stochastic shortest path problems. Math. Operations Research, 16(3):580–595, 1991. a set of linear programs. For CVaR and EVaR, a set of [14] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999. DCPs were solved. EVaR calculations were the most com- [15] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge putationally involved, since they require solving exponential university press, 2004. [16] S. Carpin, Y. Chow, and M. Pavone. Risk aversion in finite markov cone programs. Note that these calculations can be carried out decision processes using total cost criteria and average value at risk. In offline for policy synthesis and then the policy can be applied IEEE Int. Conf. Robotics and Automation, pages 335–342, 2016.
[17] M. Chapman, J. Lacotte, A. Tamar, D. Lee, K. M Smith, V. Cheng, [47] J. Thai, T. Hunter, A. K. Akametalu, C. J. Tomlin, and A. M. Bayen. J. Fisac, S. Jha, M. Pavone, and C. Tomlin. A risk-sensitive finite-time Inverse covariance estimation from data with missing values using the reachability approach for safety of stochastic dynamic systems. In 2019 concave-convex procedure. In IEEE Conf. Decision and Control, pages American Control Conference (ACC), pages 2958–2963. IEEE, 2019. 5736–5742, 2014. [18] Danny Z Chen. Developing algorithms and software for geometric path [48] H. Xu and S. Mannor. Distributionally robust Markov decision planning problems. ACM Computing Surveys, 28(4es):18–es, 1996. processes. In Advances in Neural Information Processing Systems, [19] Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in pages 2505–2513, 2010. MDPs. In Advances in neural information processing systems, pages 3509–3517, 2014. [20] Y. Chow, A. Tamar, S. Mannor, and M. Pavone. Risk-sensitive and A PPENDIX robust decision-making: a CVaR optimization approach. In Advances in Neural Information Processing Systems, pages 1522–1530, 2015. In this appendix, we present the specific DCPs for finding the [21] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling risk value functions for two coherent risk measures studied language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016. in our numerical experiments, namely, CVaR and EVaR. [22] A. Dixit, M. Ahmadi, and J. W. Burdick. Risk-sensitive motion planning For a given confidence level ε P p0, 1q, value-at-risk (VaRε ) using entropic value-at-risk. European Control Conference, 2021. [23] K. Dvijotham, M. Fazel, and E. Todorov. Convex risk averse control denotes the p1´εq-quantile value of the cost variable. CVaRε design. In IEEE Conf. Decision and Control, pages 4020–4025, 2014. is the expected loss in the p1 ´ εq-tail given that the particular [24] D. D. Fan, K. Otsu, Y. Kubo, A. Dixit, J. Burdick, and A. A. Agha- threshold VaRε has been crossed. CVaRε is given by Mohammadi. STEP: Stochastic traversability evaluation and planning " * for safe off-road navigation. arXiv 2103.02828, 2021. 1 [25] J. Fan and A. Ruszczyński. Process-based risk measures and risk- ρt pct`1 q “ inf ζ ` E rpct`1 ´ ζq` | Ft s , (24) ζPR ε averse control of discrete-time systems. Mathematical Programming, pages 1–28, 2018. where p¨q` “ maxt¨, 0u. A value of ε » 1 corresponds to [26] H. Föllmer and A. Schied. Stochastic finance: an introduction in discrete time. Walter de Gruyter, 2011. a risk-neutral policy; whereas, a value of ε Ñ 0 is rather a [27] C. Gavriel, G. Hanasusanto, and D. Kuhn. Risk-averse shortest path risk-averse policy. problems. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 2533–2538. IEEE, 2012. In fact, Theorem 1 can applied to CVaR since it is a [28] R. Horst and N. V. Thoai. DC programming: overview. Journal of coherent risk measure. For MDP M, the risk value func- Optimization Theory and Applications, 103(1):1–43, 1999. tions can be řcomputed by DCP (23), where g2 pJq “ [29] E. L. Lawler and D. E. Wood. Branch-and-bound methods: A survey. ( Operations research, 14(4):699–719, 1966. inf ζPR ζ ` 1ε s1 PS pJps1 q ´ ζq` T ps1 | s, αq , where the [30] H. A. Le Thi, H. M. Le, T. P. Dinh, et al. A dc programming approach infimum on the right hand side of the above equation can be for feature selection in support vector machines learning. Advances in absorbed into the overal infimum problem, i.e., inf J,ζ . Note Data Analysis and Classification, 2(3):259–278, 2008. that g2 pJq above is convex in ζ [38, Theorem 1]. [31] T. Lipp and S. Boyd. Variations and extension of the convex–concave procedure. Optimization and Engineering, 17(2):263–287, 2016. Unfortunately, CVaR ignores the losses below the VaR thresh- [32] A. Majumdar and M. Pavone. How should a robot assess risk? towards old (since it is only concerned with the average of VaR at the an axiomatic theory of risk in robotics. In Robotics Research, pages 75–84. Springer, 2020. p1´q-tail of the cost distribution). EVaR is the tightest upper [33] M. Ono, M. Pavone, Y. Kuwata, and J. Balaram. Chance-constrained bound in the sense of Chernoff inequality for the value at dynamic programming with application to risk-aware robotic space risk (VaR) and CVaR and its dual representation is associated exploration. Autonomous Robots, 39(4):555–571, 2015. [34] M. Ono, B. C. Williams, and L. Blackmore. Probabilistic planning for with the relative entropy. In fact, it was shown in [7] that continuous dynamic systems under bounded risk. Journal of Artificial EVaRε and CVaRε are equal only if there are no losses Intelligence Research, 46:511–577, 2013. (c Ñ 0) below the VaRε threshold. In addition, EVaR is [35] J. T. Ott. A Markov decision model for a surveillance application and risk-sensitive Markov decision processes. 2010. a strictly monotone risk measure; whereas, CVaR is only [36] L. Prashanth. Policy gradients for CVaR-constrained MDPs. In monotone [6] (see Definition 4). EVaRε is given by International Conference on Algorithmic Learning Theory, pages 155– Ereζct`1 | Ft s ˆ ˆ ˙ ˙ 169. Springer, 2014. [37] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic ρt pct`1 q “ inf log {ζ . (25) ζą0 ε Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. Similar to CVaRε , for EVaRε , ε Ñ 1 is a risk-neutral [38] R. T. Rockafellar, S. Uryasev, et al. Optimization of conditional value- case; whereas, ε Ñ 0 corresponds to a risk-averse case. at-risk. Journal of risk, 2:21–42, 2000. [39] A. Ruszczyński. Risk-averse dynamic programming for markov deci- In fact, it was demonstrated in [5, Proposition 3.2] that sion processes. Mathematical programming, 125(2):235–261, 2010. limεÑ0 EVaRε pcq “ ess suppcq “ c̄ (worst-case cost). [40] A. Ruszczyński and A. Shapiro. Optimization of convex risk functions. Mathematics of operations research, 31(3):433–452, 2006. Since EVaRε is a coherent risk measure, the conditions [41] P. Sanders. Fast route planning. Google Tech Talk, March 23 2009. of Theorem 1 hold. Since ζ ą 0, using the change [42] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on stochastic of variables, J̃ ” ζJ (note that this change of vari- programming: modeling and theory. SIAM, 2014. [43] X. Shen, S. Diamond, Y. Gu, and S. Boyd. Disciplined convex-concave ables is monotone increasing in ζ [1]), we can com- programming. In 2016 IEEE 55th Conference on Decision and Control pute EVaR value functions by solving (23), where f0 “ (CDC), pages 1009–1014. IEEE, 2016. ˜ “ J, 0, f1 pJq ˜ g0 pJq ˜ “ř ˜ Jpsq, ˜ “ g1 pcq “ ζc, and g2 pJq [44] C Elliott Sigal, A Alan B Pritsker, and James J Solberg. The stochastic ´ř ˜ 1q Jps 1 ¯ sPS 1 e T ps |s,αq shortest route problem. Operations Research, 28(5):1122–1129, 1980. log s PS ε . [45] S. Singh, Y. Chow, A. Majumdar, and M. Pavone. A framework for time-consistent, risk-sensitive model predictive control: Theory and Similar to the CVaR case, the infimum over ζ can be lumped algorithms. IEEE Transactions on Automatic Control, 2018. into the overall infimum problem, i.e., inf J̃,ζą0 . Note that [46] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Sequential ˜ is convex in J,˜ since the logarithm of sums of expo- decision making with coherent risk. IEEE Transactions on Automatic g2 pJq Control, 62(7):3323–3338, 2016. nentials is convex [15, p. 72].
You can also read