On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion Yiming Zhang 1 Keith W. Ross 2 1 Abstract Haarnoja et al., 2018) or in a queuing scenario (Tadepalli & Ok, 1994; Sutton & Barto, 2018), there is no natural sep- We develop theory and algorithms for average- aration of episodes and the agent-environment interaction reward on-policy Reinforcement Learning (RL). continues indefinitely. The performance of an agent in a We first consider bounding the difference of the continuing task is more difficult to quantify since the total long-term average reward for two policies. We sum of rewards is typically infinite. show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) One way of making the long-term reward objective mean- results in a non-meaningful bound in the average- ingful for continuing tasks is to apply discounting so that reward setting. By addressing the average-reward the infinite-horizon return is guaranteed to be finite for any criterion directly, we then derive a novel bound bounded reward function. However the discounted objec- which depends on the average divergence between tive biases the optimal policy to choose actions that lead to the two policies and Kemeny’s constant. Based high near-term performance rather than to high long-term on this bound, we develop an iterative procedure performance. Such an objective is not appropriate when the which produces a sequence of monotonically im- goal is to optimize long-term behavior, i.e., when the natural proved policies for the average reward criterion. objective underlying the task at hand is non-discounted. In This iterative procedure can then be combined particular, we note that for the vast majority of benchmarks with classic DRL (Deep Reinforcement Learn- for reinforcement learning such as Atari games (Mnih et al., ing) methods, resulting in practical DRL algo- 2013) and MuJoCo (Todorov et al., 2012), a non-discounted rithms that target the long-run average reward cri- performance measure is used to evaluate the trained policies. terion. In particular, we demonstrate that Average- Although in many circumstances, non-discounted criteria Reward TRPO (ATRPO), which adapts the on- are more natural, most of the successful DRL algorithms policy TRPO algorithm to the average-reward today have been designed to optimize a discounted crite- criterion, significantly outperforms TRPO in the rion during training. One possible work-around for this most challenging MuJuCo environments. mismatch is to simply train with a discount factor that is very close to one. Indeed, from the Blackwell optimality theory of MDPs (Blackwell, 1962), we know that if the dis- 1. Introduction count factor is very close to one, then an optimal policy for the infinite-horizon discounted criterion is also optimal for The goal of Reinforcement Learning (RL) is to build agents the long-run average-reward criterion. However, although that can learn high-performing behaviors through trial-and- Blackwell’s result suggests we can simply use a large dis- error interactions with the environment. Broadly speak- count factor to optimize non-discounted criteria, problems ing, modern RL tackles two kinds of problems: episodic with large discount factors are in general more difficult to tasks and continuing tasks. In episodic tasks, the agent- solve (Petrik & Scherrer, 2008; Jiang et al., 2015; 2016; environment interaction can be broken into separate distinct Lehnert et al., 2018). Researchers have also observed that episodes, and the performance of the agent is simply the state-of-the-art DRL algorithms typically break down when sum of the rewards accrued within an episode. Examples the discount factor gets too close to one (Schulman et al., of episodic tasks include training an agent to learn to play 2016; Andrychowicz et al., 2020). Go (Silver et al., 2016; 2018), where the episode terminates when the game ends. In continuing tasks, such as robotic In this paper we seek to develop algorithms for finding locomotion (Peters & Schaal, 2008; Schulman et al., 2015; high-performing policies for average-reward DRL problems. 1 Instead of trying to simply use standard discounted DRL New York University 2 New York University Shanghai. Corre- spondence to: Yiming Zhang . algorithms with large discount factors, we instead attack the problem head-on, seeking to directly optimize the average- Proceedings of the 38 th International Conference on Machine reward criterion. While the average reward setting has been Learning, PMLR 139, 2021. Copyright 2021 by the author(s).
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion extensively studied in the classical Markov Decision Pro- • Most modern DRL algorithms introduce a discount cess literature (Howard, 1960; Blackwell, 1962; Veinott, factor during training even when the natural objective 1966; Bertsekas et al., 1995), and has to some extent been of interest is undiscounted. This leads to a discrep- studied for tabular RL (Schwartz, 1993; Mahadevan, 1996; ancy between the evaluation and training objective. Abounadi et al., 2001; Wan et al., 2020), it has received We demonstrate that optimizing the average reward relatively little attention in the DRL community. In this directly can effectively address this mismatch and lead paper, our focus is on developing average-reward on-policy to much stronger performance. DRL algorithms. One major source of difficulty with modern on-policy DRL 2. Preliminaries algorithms lies in controlling the step-size for policy updates. Consider a Markov Decision Process (MDP) (Sutton & In order to have better control over step-sizes, Schulman Barto, 2018) pS, A, P, r, µq where the state space S and et al. (2015) constructed a lower bound on the difference action space A are assumed to be finite. The transition between the expected discounted return for two arbitrary probability is denoted by P : S ˆ A ˆ S Ñ r0, 1s, the policies π and π 1 by building upon the work of Kakade bounded reward function r : S ˆ A Ñ rrmin , rmax s, and & Langford (2002). The bound is a function of the diver- µ : S Ñ r0, 1s is the initial state distribution. Let π : S Ñ gence between these two policies and the discount factor. ∆pAq be a stationary policy where ∆pAq is the probabilty Schulman et al. (2015) showed that iteratively maximizing simplex over A, and Π is the set of all stationary policies. this lower bound generates a sequence of monotonically We consider two classes of MDPs: improved policies for their discounted return. Assumption 1 (Ergodic). For every stationary policy, the In this paper, we first show that the policy improvement theo- induced Markov chain is irreducible and aperiodic. rem from Schulman et al. (2015) results in a non-meaningful bound in the average reward case. We then derive a novel Assumption 2 (Aperiodic Unichain). For every stationary result which lower bounds the difference of the average long- policy, the induced Markov chain contains a single aperi- run rewards. The bound depends on the average divergence odic recurrent class and a finite but possibly empty set of between the policies and on the so-called Kemeny con- transient states. stant, which measures to what degree the irreducible Markov chains associated with the policies are “well-mixed”. We By definition, any MDP which satisfies Assumption 1 is show that iteratively maximizing this lower bound guaran- also unichain. We note that most MDPs of practical interest tees monotonic average reward policy improvement. belong in these two classes. We will mostly focus on MDPs which satisfy Assumption 1 in the main text. In the supple- Similar to the discounted case, the problem of maximizing mentary material, we will address the aperiodic unichain the lower bound can be approximated with DRL algorithms case. Here we present the two objective formulations for which can be optimized using samples collected in the en- continuing control tasks: the average reward approach and vironment. In particular, we describe in detail the Average discounted reward criterion. Reward TRPO (ATRPO) algorithm, which is the average re- ward variant of the TRPO algorithm (Schulman et al., 2015). Average Reward Criterion Using the MuJoCo simulated robotic benchmark, we carry The average reward objective is defined as: out extensive experiments demonstrating the effectiveness « ff Nÿ ´1 of of ATRPO compared to its discounted counterpart, in 1 particular on the most challenging MuJoCo tasks. Notably, ρpπq :“ lim E rpst , at q “ E rrps, aqs. N Ñ8 N τ „π s„dπ t“0 a„π we show that ATRPO can significantly out-perform TRPO (1) on a set of high-dimensional continuing control tasks. řN ´1 Here dπ psq :“ limN Ñ8 N1 t“0 Pτ „π pst “ sq is the Our main contributions can be summarized as follows: stationary state distribution under policy π, and τ “ ps0 , a0 , . . . , q is a sample trajectory. The limits in ρpπq • We extend the policy improvement bound from Schul- and dπ psq are guaranteed to exist under our assumptions. man et al. (2015) and Achiam et al. (2017) to the av- Since the MDP is aperiodic, it can also be shown that erage reward setting. We demonstrate that our new dπ psq “ limtÑ8 Pτ „π pst “ sq. In the unichain case, the bound depends on the average divergence between the average reward ρpπq does not depend on the initial state two policies and on the mixing time of the underlying for any policy π (Bertsekas et al., 1995). We express the Markov chain. average-reward bias function as • We use the aforementioned policy improvement bound « 8 ˇ ff to derive novel on-policy deep reinforcement learning ÿ ˇ sπ V psq :“ E prpst , at q ´ ρpπqqˇˇs0 “ s algorithms for optimizing the average reward. τ „π t“0
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion and average-reward action-bias function as a specific choice of D: 1 rAπγ k ps, aqs « ff 8 ÿ ˇ ργ pπk`1 q ´ ργ pπk q ě E 1´γ ˇ s π ps, aq :“ E Q prpst , at q ´ ρpπqqˇˇs0 “ s, a0 “ a . s„dπk ,γ τ „π a„πk`1 (3) t“0 ´ C ¨ maxrDTV pπk`1 k πk qrsss s We define the average-reward advantage function as 1 ř where DTV pπ 1 k πqrss :“ 2 1 a |π pa|sq ´ πpa|sq| is the sπ ps, aq :“ Q s π ps, aq ´ Vs π psq. total variation divergence, and C “ 4γ{p1 ´ γq2 where A is some constant. Schulman et al. (2015) showed that by choosing πk`1 which maximizes the right hand side of Discounted Reward Criterion (3), we are guaranteed to have ργ pπk`1 q ě ργ pπk q. This For some discount factor γ P p0, 1q, the discounted reward provided the theoretical foundation for an entire class of objective is defined as on-policy DRL algorithms (Schulman et al., 2015; 2017; Wu et al., 2017; Vuong et al., 2019; Song et al., 2020). « ff 8 A natural question arises here is whether the iterative pro- ÿ t 1 ργ pπq :“ E γ rpst , at q “ E rrps, aqs cedure described by Schulman et al. (2015) also guarantees τ „π t“0 1´γ s„dπ,γ a„π improvement for the average reward. Since the discounted (2) ř8 and average reward objectives become equivalent as γ Ñ 1, where dπ,γ psq :“ p1 ´ γq t“0 γ t Pτ „π pst “ sq is known one may conjecture that we can also lower bound the pol- as the future discounted state visitation distribution under icy performance difference of the average reward objec- policy π. Note that unlike the average reward objective, tive by simply letting γ Ñ 1 for the bounds in Schulman the discounted objective depends on the initial state distri- et al. (2015). Unfortunately this results in a non-meaningful bution µ. It can be easily shown that dπ,γ psq Ñ dπ psq bound (see supplementary material for proof.) for all s as γ Ñ 1. The discounted value ˇfunction Proposition 1. Consider the bounds in Theorem 1 of Schul- „ π ř 8 t ˇ is defined as Vγ psq :“ Eτ „π t“0 γ rpst , at qˇs0 “ s ˇ man et al. (2015) and Corollary 1 of Achiam et al. (2017). and discounted action-value function Qπγ ps, aq :“ The right hand side of both bounds times 1 ´ γ goes to „ ř8 t ˇ ˇ negative infinity as γ Ñ 1. Eτ „π t“0 γ rpst , at qˇs0 “ s, a0 “ a . Finally, the dis- ˇ Since limγÑ1 p1 ´ γqpργ pπ 1 q ´ ργ pπqq “ ρpπ 1 q ´ ρpπq, counted advantage function is defined as Aπγ ps, aq :“ Proposition 1 says that the policy improvement guarantee Qπγ ps, aq ´ Vγπ psq. from Schulman et al. (2015) and Achiam et al. (2017) be- It is well-known that limγÑ1 p1 ´ γqργ pπq “ ρpπq, im- comes trivial when γ Ñ 1 and thus does not generalize plying that the discounted and average reward objectives to the average reward setting. In the next section, we will are equivalent in the limit as γ approaches 1 (Blackwell, derive a novel policy improvement bound for the average 1962). We further discuss the relationship between the dis- reward objective, which in turn can be used to generate counted and average reward criteria in Appendix A and monotonically improved policies w.r.t. the average reward. prove that limγÑ1 Aπγ ps, aq “ Asπ ps, aq (see Corollary A.1). The proofs of all results in the subsequent sections, if not 4. Main Results given, can be found in the supplementary material. 4.1. Average Reward Policy Improvement Theorem 3. Montonically Improvement Guarantees for Let dπ P R|S| be the probability column vector whose com- Discounted RL ponents are dπ psq. Let Pπ P R|S|ˆ|S| be the transition ma- trix under policy π whose ps, s1 q component is Pπ ps1 |sq “ In much of the on-policy DRL literature (Schulman et al., ř 1 ‹ 1 řN t 2015; 2017; Wu et al., 2017; Vuong et al., 2019; Song et al., a P ps |s, aqπpa|sq, and Pπ :“ limN Ñ8 N t“0 Pπ be the limiting distribution of the transition matrix. For aperi- 2020), algorithms iteratively update policies by maximiz- odic unichain MDPs, Pπ‹ “ limtÑ8 Pπt “ 1dTπ . ing them within a local region, i.e., at iteration k we find a policy πk`1 by maximizing ργ pπq within some region Suppose we have a new policy π 1 obtained via some update Dpπ, πk q ď δ for some divergence measure D. By using rule from the current policy π. Similar to the discounted different choices of D and δ, this approach allows us to case, we would like to measure their performance difference control the step-size of each update, which can lead to bet- ρpπ 1 q ´ ρpπq using an expression which depends on π and ter sample efficiency (Peters & Schaal, 2008). Schulman some divergence metric between the two policies. The et al. (2015) derived a policy improvement bound based on following identity shows that ρpπ 1 q ´ ρpπq can be expressed
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion using the average reward advantange function of π. is a constant independent of the starting state for any pol- Lemma 1. Under Assumption 2: icy π (Theorem 4.4.10 of Kemeny & Snell (1960).) The “ π ‰ constant κπ is sometimes referred to as Kemeny’s constant ρpπ 1 q ´ ρpπq “ E A s ps, aq (4) (Grinstead & Snell, 2012). This constant can be interpreted s„dπ1 a„π 1 as the mean number of steps it takes to get to any goal state for any two stochastic policies π and π 1 . weighted by the steady-distribution of the goal states. This weighted mean does not depend on the starting state, as Lemma 1 is an extension of the well-known policy differ- mentioned just above. ence lemma from Kakade & Langford (2002) to the average It can be shown that the value of Kemeny’s constant is also reward case. A similar result was proven by Even-Dar et al. related to the mixing time of the Markov Chain, i.e., how (2009) and Neu et al. (2010). For completeness, we provide fast the chain converges to the stationary distribution (see a simple proof in the supplementary material. Note that this Appendix C for additional details). expression depends on samples drawn from π 1 . However we can show through the following lemma that when dπ and The following result connects the sensitivity of the stationary dπ1 are “close” w.r.t. the TV divergence, we can evaluate distribution to changes to the policy. ρpπ 1 q using samples from dπ (see supplementary material Lemma 3. Under Assumption 1, the divergence between the for proof). stationary distributions dπ and dπ1 can be upper bounded Lemma 2. Under Assumption 2, the following bound holds by the average divergence between policies π and π 1 : for any two stochastic policies π and π 1 : ˇ ˇ DTV pdπ1 k dπ q ď pκ‹ ´ 1q E rDTV pπ 1 k πqrsss (9) ˇ ˇ “ ‰ ˇ ˇ s„dπ ˇρpπ 1 q ´ ρpπq ´ E A sπ ps, aq ˇ ď 2DTV pdπ1 k dπ q ˇ s„dπ ˇ where κ‹ “ maxπ κπ a„π 1 ˇ ˇ (5) ˇ ˇ sπ ps, aqsˇ. For Markov chains with a small mixing time, where an agent where “ maxs ˇEa„π1 pa|sq rA can quickly get to any state, Kemeny’s constant is relatively Lemma 2 implies that small and Lemma 3 shows that the stationary distributions “ sπ ps, aq ‰ are not highly sensitive to small changes in the policy. On ρpπ 1 q « ρpπq ` E A (6) the other hand, for Markov chains that that have high mixing s„dπ a„π 1 times, the factor can become very large. In this case Lemma when dπ and dπ1 are “close”. However in order to study how 3 shows that small changes in the policy can have a large policy improvement is connected to changes in the actual impact on the resulting stationary distributions. policies themselves, we need to analyze the relationship Combining the bounds in Lemma 2 and Lemma 3 gives us between changes in the policies and changes in stationary the following result: distributions. It turns out that the sensitivity of the station- ary distributions in relation to the policies is related to the Theorem 1. Under Assumption 1 the following bounds hold structure of the underlying Markov chain. for any two stochastic policies π and π 1 , : Let M π P R|S|ˆ|S| be the mean first passage time matrix Dπ´ pπ 1 q ď ρpπ 1 q ´ ρpπq ď Dπ` pπ 1 q (10) whose elements M π ps, s1 q is the expected number of steps it takes to reach state s1 from s under policy π. Under where Assumption 1, the matrix M π can be calculated via (see “ ‰ sπ ps, aq ˘ 2ξ E rDTV pπ 1 k πqrsss Theorem 4.4.7 of Kemeny & Snell (1960)) Dπ˘ pπ 1 q “ E A s„dπ s„dπ a„π 1 π π π M “ pI ´ Z ` EZdg qDπ (7) sπ ps, aq|. and ξ “ pκ‹ ´ 1q maxs Ea„π1 |A π where Z “ pI ´ Pπ ` Pπ‹ q´1 is known as the fundamental matrix of the Markov chain (Kemeny & Snell, 1960), E is The bounds in Theorem 1 are guaranteed to be finite. Anal- a square matrix consisting of all ones. The subscript ‘dg’ ogous to the discounted case, the multiplicative factor ξ on some square matrix refers to taking the diagonal of said provides guidance on the step-sizes for policy updates. Note matrix and placing zeros everywhere else. Dπ P R|S|ˆ|S| is that Theorem 1 holds for MDPs satisfying Assumption 1; in a diagonal matrix whose elements are 1{dπ psq. Appendix D we discuss how a similar result can be derived One important property of mean first passage time is that for the more general aperiodic unichain case. for any MDP which satisfies Assumption 1, the quantity The bound in Theorem 1 is given in terms of the TV diver- κπ “ ÿ dπ ps1 qM π ps, s1 q “ tracepZ π q (8) gence; however the KL divergence is more commonly used s1 in practice. The relationship between the TV divergence and
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion Algorithm 1 Approximate Average Reward Policy Iteration 5. Practical Algorithm 1: Input: π0 As noted in the previous section, Algorithm 1 is not practical 2: for k “ 0, 1, 2, . . . do for problems with large state and action spaces. In this 3: sπk ps, aq for all s, a Policy Evaluation: Evaluate A section, we will discuss how Algorithm 1 and Theorem 4: Policy Improvement: 1 can be used in practice to create algorithms which can effectively solve high dimensional DRL problems with the πk`1 “ argmax Dπ´k pπq (12) π use of trust region methods. where In Appendix F, we will also discuss how Theorem 1 can “ π ‰ be used to solve DRL problems with average cost safety Dπ´k pπq “ E A s k ps, aq constraints. RL with safety constraints are an important s„dπk a„π class of problems with practical implications (Amodei et al., c 2016). Trust region methods have been successfully applied ´ξ 2 E rDKL pπ}πk q rsss s„dπk to this class of problems as it provides worst-case constraint violation guarantees for evaluating the cost constraint values sπk ps, aq| and ξ “ pκ‹ ´ 1q maxs Ea„π |A for policy updates (Achiam et al., 2017; Yang et al., 2020; 5: end for Zhang et al., 2020). However the aforementioned theoreti- cal guarantees were only shown to apply to discounted cost constraints. Tessler et al. (2019) pointed out that trust-region KL divergence is given by Pinsker’s inequality (Tsybakov, based methods such as the Constrained Policy Optimiza- a that for any two distributions p and q: 2008), which says tion (CPO) algorithm (Achiam et al., 2017) cannot be used DTV pp k qq ď DKL pp}qq {2. We can then show that for average costs constraints. Contrary to this belief, in a Appendix F, we demonstrate that Theorem 1 provides a E rDTV pπ 1 k πqrsss ď E r DKL pπ 1 }πq rss{2s worst-case constraint violation guarantee for average costs s„dπ s„dπ b and trust-region-based constrained RL methods can easily ď E rDKL pπ 1 }πqsrsss{2 be modified to accommodate for average cost constraints. s„dπ (11) where the second inequality comes from Jensen’s inequality. 5.1. Average Reward Trust Region Methods The inequality in (11) shows that the bounds in Theorem 1 1 For DRL problems, it is common to consider some param- a hold when Es„dπ rDTV pπ k πqrsss is substituted with still eterized policy class ΠΘ “ tπθ : θ P Θu. Our goal is 1 Es„dπ rDKL pπ }πqsrss{2. to devise a computationally tractable version of Algorithm 1 for policies in ΠΘ . We can rewrite the unconstrained 4.2. Approximate Policy Iteration optimization problem in (12) as a constrained problem: One direct consequence of Theorem 1 is that iteratively max- maximize sπθk ps, aqs rA imizing the Dπ´ pπ 1 q term in the bound generates a mono- πθ PΠΘ E s„dπθ k tonically improving sequence of policies w.r.t. the average a„πθ (13) reward objective. Algorithm 1 gives an approximate policy subject to D̄KL pπθ k πθk q ď δ iteration algorithm that produces such a sequence of policies. where D̄KL pπθ k πθk q :“ Es„dπθ rDKL pπθ }πθk q rsss. Im- k Proposition 2. Given an initial policy π0 , Algorithm 1 is portantly, the advantage function A sπθk ps, aq appearing in guaranteed to generate a sequence of policies π1 , π2 , . . . (13) is the average-reward advantage function, defined as such that ρpπ0 q ď ρpπ1 q ď ρpπ2 q ď ¨ ¨ ¨ . the bias minus the action-bias, and not the discounted ad- vantage function. The constraint set tπθ P ΠΘ : D̄KL pπθ k sπk ps, aqs “ 0, Proof. At iteration k, Es„dπk ,a„π rA πθk q ď δu is called the trust region set. The problem (13) Es„dπk rDKL pπ}πk q rsss “ 0 for π “ πk . By Theorem can be regarded as an average reward variant of the trust 1 and (12), ρpπk`1 q ´ ρpπk q ě 0. region problem from Schulman et al. (2015). The step-size δ is treated as a hyperparamter in practice and should ideally However, Algorithm 1 is difficult to implement in prac- be tuned for each specific task. However we note that in sπk ps, aq and the tice since it requires exact knowledge of A the average reward, the choice of step-size is related to the transition matrix. Furthermore, calculating the term ξ is mixing time of the underlying Markov chain (since it is re- impractical for high-dimensional problems. In the next sec- lated to the multiplicative factor ξ in Theorem 1). When the tion, we will introduce a sample-based algorithm which mixing time is small, a larger step-size can be chosen and approximates the update rule in Algorithm 1. vice versa. While it is impractical to calculate the optimal
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion step-size, in certain applications domain knowledge on the Algorithm 2 Average Reward TRPO (ATRPO) mixing time can be used to serve as a guide for tuning δ. 1: Input: Policy parameters θ0 , critic net parameters φ0 , When we set πθk`1 to be the optimal solution to (13), similar learning rate α, trajectory truncation parameter N . to the discounted case, the policy improvement guarantee 2: for k “ 0, 1, 2, ¨ ¨ ¨ do no longer holds. However we can show that πθk`1 has the 3: Collect a truncated trajectory tst , at , st`1 , rt u, t “ following worst-case performance degradation guarantee: 1, . . . , N from the environment using πθk . 4: Calculate sample average reward of πθk via Proposition 3. Let πθk`1 be the optimal solution to (13) for řN ρ “ N1 t“1 rt . some πθk P ΠΘ . The policy performance difference between 5: for t “ 1, 2, . . . , N do πθk`1 and πθk can be lower bounded by 6: Get target Vsttarget “ rt ´ ρ ` Vsφk pst`1 q ? 7: Get advantage estimate: ρpπθk`1 q ´ ρpπθk q ě ´ξ πθk`1 2δ (14) Âpst , at q “ rt ´ ρ ` Vsφk pst`1 q ´ Vsφk pst q where ξ πθk`1 “ pκπθk`1 ´ 1q maxs Ea„πθk`1 |A sπθk ps, aq|. 8: end for 9: Update critic by Proof. Since D̄KL pπθk k πθk q “ 0, πθk is feasible. The φk`1 Ð φk ´ α∇φ Lpφk q objective value is 0 for πθ “ πθk . The bound follows from (10) and (11) where the average KL is bounded by δ. where N Several algorithms have been proposed for efficiently solv- 1 ÿ ›› ›2 ing the discounted version of (13): Schulman et al. (2015) Lpφk q “ V̄φk pst q ´ Vsttarget › N t“1 and Wu et al. (2017) converts (13) into a convex problem via Taylor approximations; another approach is to first solve 10: Use Âpst , at q to update θk using TRPO policy update (13) in the non-parametric policy space and then project the (Schulman et al., 2015). result back into the parameter space (Vuong et al., 2019; 11: end for Song et al., 2020). These algorithms can also be adapted for the average reward case and are theoretically justified via Theorem 1 and Proposition 3. In the next section, we will subtracted from the reward. Simply setting the discount provide as a specific example how this can be done for one factor to 1 in TRPO does not lead to Algorithm 2. such algorithm. iv ATRPO also assumes that the underlying task is a con- 5.2. Average Reward TRPO (ATRPO) tinuing infinite-horizon task. But since in practice we cannot run infinitely long trajectories, all trajectories In this section, we introduce ATRPO, which is an average- are truncated at some large truncation value N . Unlike reward modification of the TRPO algorithm (Schulman TRPO, during training we do not allow for episodic et al., 2015). Similar to TRPO, we apply Taylor approxi- tasks where episodes terminate early (before N ). For mations to (13). This gives us a new optimization problem the MuJoCo environments, we will address this by hav- which can be solved exactly using Lagrange duality (Boyd ing the agent not only resume locomotion after falling et al., 2004). The solution to this approximate problem but also incur a penalty for falling (see Section 6.) gives an explicit update rule for the policy parameters which then allows us to perform policy updates using an actor- critic framework. More details can be found in Appendix E. In Algorithm 2, for illustrative purposes, we use the average Algorithm 2 provides a basic outline of ATRPO. reward one-step bootstrapped estimate for the target of the critic and the advantage function. In practice, we instead de- The major differences between ATRPO and TRPO are as velop and use an average-reward version of the Generalized follows: Advantage Estimator (GAE) from Schulman et al. (2016). In Appendix G we provide more details on how GAE can i The critic network in Algorithm 2 approximates the be generalized to the average-reward case. average-reward bias rather than the discounted value function. 6. Experiments ii ATRPO must estimate the average return ρ of the current policy. We conducted experiments comparing the performance of ATRPO and TRPO on continuing control tasks. We con- iii The targets for the bias and the advantage are calculated sider three tasks (Ant, HalfCheetah, and Humanoid) from without discount factors and the average return ρ is the MuJoCo physical simulator (Todorov et al., 2012) imple-
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion Figure 1. Comparing performance of ATRPO and TRPO with different discount factors. The x-axis is the number of agent-environment interactions and the y-axis is the total return averaged over 10 seeds. The solid line represents the agents’ performance on evaluation trajectories of maximum length 1,000 (top row) and 10,000 (bottom row). The shaded region represents one standard deviation. mented using OpenAI gym (Brockman et al., 2016), where 6.2. Comparing ATRPO and TRPO the natural goal is to train the agents to run as fast as possible To simulate an infinite-horizon setting during training, we without falling. do the following: when the agent falls, the trajectory does not terminate; instead the agent incurs a large reset cost for 6.1. Evaluation Protocol falling, and then continues the trajectory from a random Even though the MuJoCo benchmark is commonly trained start state. The reset cost is set to 100. However, we show in using the discounted objective (see e.g. Schulman et al. the supplementary material (Appendix I.2) that the results (2015), Wu et al. (2017), Lillicrap et al. (2016), Schulman are largely insensitive to the choice of reset cost. We note et al. (2017), Haarnoja et al. (2018), Vuong et al. (2019)), it that this modification does not change the underlying goal is always evaluated without discounting. Similarly, we also of the task. We also point out that the reset cost is only ap- evaluate performance using the undiscounted total-reward plied during training and is not used in the evaluation phase objective for both TRPO and ATRPO. described in the previous section. Hyperparameter settings and other additional details can be found in Appendix H. Specifically for each environment, we train a policy for 10 million environment steps. During training, every 100,000 We plot the performance for ATRPO and TRPO trained with steps, we run 10 separate evaluation trajectories with the different discount factors in Figure 1. We see that TRPO current policy without exploration (i.e., the policy is kept with its best discount factor can perform as well as ATRPO fixed and deterministic). For each evaluation trajectory we for the simplest environment HalfCheetah. But ATRPO calculate the undiscounted return of the trajectory until the provides dramatic improvements in Ant and Humanoid. In agent falls or until 1,000 steps, whichever comes first. We particular for the most challenging environment Humanoid, then report the average undiscounted return over the 10 ATRPO performs on average 50.1% better than TRPO with trajectories. Note that this is the standard evaluation metric its best discount factor when evaluated on trajectories of for the MuJoCo environments. In order to understand the maximum length 1000. The improvement is even greater performance of the agent for long time horizons, we also when the agents are evaluated on trajectories of maximum report the performance of the agent evaluated on trajectories length 10,000 where the performance boost jumps to 913%. of maximum length 10,000. In Appendix I.1, we provide an additional set of experiments
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion Figure 2. Speed-time plot of a single trajectory (maximum length 10,000) for ATRPO and Discounted TRPO in the Humanoid-v3 environment. The solid line represents the speed of the agent at the corresponding timesteps. demonstrating that ATRPO also significantly outperforms gives the best performance for discounted TRPO. We see TRPO when TRPO is trained without the reset scheme de- in Figure 2 that the discounted algorithm gives a higher scribed at the beginning of this section (i.e. the standard initial speed at the beginning of the trajectory. However its MuJoCo setting.) overall speed is much more erratic throughout the trajectory, resulting in the agent falling over after approximately 5000 We make two observations regarding discounting. First, we steps. This coincides with the notion of discounting where note that increasing the discount factor does not necessar- more emphasis is placed at the beginning of the trajectory ily lead to better performance for TRPO. A larger discount and ignores longer-term behavior. On the other hand, the factor in principle enables the algorithm to seek a policy average-reward policy — while having a slightly lower ve- that performs well for the average-reward criterion (Black- locity overall throughout its trajectory — is able to sustain well, 1962). Unfortunately, a larger discount factor can the trajectory much longer, thus giving it a higher total re- also increase the variance of the gradient estimator (Zhao turn. In fact, we observed that for all 10 random seeds we et al., 2011; Schulman et al., 2016), increase the complex- tested, the average reward agent is able to finish the entire ity of the policy space (Jiang et al., 2015), lead to slower 10,000 time step trajectory without falling. In Table 1 we convergence (Bertsekas et al., 1995; Agarwal et al., 2020), present the summary statistics of trajectory length for all and degrade generalization in limited data settings (Amit trajectories using discounted TRPO we note that the median et al., 2020). Moreover, algorithms with discounting are trajectory length for the TRPO discounted agent is 452.5, known to become unstable as γ Ñ 1 (Naik et al., 2019). meaning that on average TRPO performs significantly worse Secondly, for TRPO the best discount factor is different for than what is reported in Figure. 2. each environment (0.99 for HalfCheetah and Ant, 0.95 for Humanoid). The discount factor therefore serves as a hy- perparameter which can be tuned to improve performance, Table 1. Summary statistics for all 10 trajectories using a choosing a suboptimal discount factor can have significant Humanoid-v3 agent trained with TRPO consequences. Both of these observation are consistent with what was seen in the literature (Andrychowicz et al., 2020). Min Max Average Median Std We have shown here that using the average reward crite- rion directly not only delivers superior performance but also 108 4806 883.1 452.5 1329.902 obviates the need to tune the discount factor. 6.3. Understanding Long Run Performance 7. Related Work Next, we demonstrate that agents trained using the aver- Dynamic programming algorithms for finding the optimal age reward criterion are better at optimizing for long-term average reward policies have been well-studied (Howard, returns. Here, we first train Humanoid with 10 million sam- 1960; Blackwell, 1962; Veinott, 1966). Several tabular Q- ples with ATRPO and with TRPO with a discount factor of learning-like algorithms for problems with unknown dy- 0.95 (shown to be the best discount factor in the previous ex- namics have been proposed, such as R-Learning (Schwartz, periments). Then for evaluation, we run the trained ATRPO 1993), RVI Q-Learning (Abounadi et al., 2001), CSV- and TRPO policies for a trajectory of 10,000 timesteps (or Learning (Yang et al., 2016), and Differential Q-Learning until the agent falls). We use the same random seeds for (Wan et al., 2020). Mahadevan (1996) conducted a thor- the two algorithms. Figure 2 is a plot of the speed of the ough empirical analysis of the R-Learning algorithm. We agent at each time step of the trajectory, using the seed that note that much of the previous work on average reward RL
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion focuses on the tabular setting without function approxima- References tions, and the theoretical properties of many of these Q- Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., learning-based algorithm are not well understood (in partic- Szepesvari, C., and Weisz, G. Politex: Regret bounds for ular R-learning). More recently, POLITEX updates policies policy iteration using expert prediction. In International using a Boltzmann distribution over the sum of action-value Conference on Machine Learning, pp. 3692–3702, 2019. function estimates of the previous policies (Abbasi-Yadkori et al., 2019) and Wei et al. (2020) introduced a model-free Abounadi, J., Bertsekas, D., and Borkar, V. S. Learning algorithm for optimizing the average reward of weakly- algorithms for markov decision processes with average communicating MDPs. cost. SIAM Journal on Control and Optimization, 40(3): For policy gradient methods, Baxter & Bartlett (2001) 681–698, 2001. showed that if 1{p1 ´ γq is large compared to the mix- Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained ing time of the Markov chain induced by the MDP, then policy optimization. In Proceedings of the 34th Interna- the gradient of ργ pπq can accurately approximate the gra- tional Conference on Machine Learning-Volume 70, pp. dient of ρpπq. Kakade (2001a) extended upon this result 22–31. JMLR. org, 2017. and provided an error bound on using an optimal discounted policy to maximize the average reward. In contrast, our Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Op- work directly deals with the average reward objective and timality and approximation with policy gradient methods provides theoretical guidance on the optimal step size for in markov decision processes. In Conference on Learning each policy update. Theory, pp. 64–66. PMLR, 2020. Policy improvement bounds have been extensively explored Altman, E. Constrained Markov decision processes, vol- in the discounted case. The results from Schulman et al. ume 7. CRC Press, 1999. (2015) are extensions of Kakade & Langford (2002). Pirotta Amit, R., Meir, R., and Ciosek, K. Discount factor as a et al. (2013) also proposed an alternative generalization to regularizer in reinforcement learning. In International Kakade & Langford (2002). Achiam et al. (2017) improved conference on machine learning, 2020. upon Schulman et al. (2015) by replacing the maximum divergence with the average divergence. Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul- man, J., and Mané, D. Concrete problems in ai safety. 8. Conclusion arXiv preprint arXiv:1606.06565, 2016. In this paper, we introduce a novel policy improvement Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., bound for the average reward criterion. The bound is based Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, on the average divergence between two policies and Ke- O., Michalski, M., et al. What matters in on-policy rein- meny’s constant or mixing time of the Markov chain. We forcement learning? a large-scale empirical study. arXiv show that previous existing policy improvement bounds for preprint arXiv:2006.05990, 2020. the discounted case results in a non-meaningful bound for Baxter, J. and Bartlett, P. L. Infinite-horizon policy-gradient the average reward objective. Our work provides the theo- estimation. Journal of Artificial Intelligence Research, retical justification and the means to generalize the popular 15:319–350, 2001. trust-region based algorithms to the average reward setting. Based on this theory, we propose ATRPO, a modification Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., and of the TRPO algorithm for on-policy DRL. We demonstrate Bertsekas, D. P. Dynamic programming and optimal through a series of experiments that ATRPO is highly effec- control, volume 1,2. Athena scientific Belmont, MA, tive on high-dimensional continuing control tasks. 1995. Blackwell, D. Discrete dynamic programming. The Annals Acknowledgements of Mathematical Statistics, pp. 719–726, 1962. We would like to extend our gratitude to Quan Vuong and Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex opti- the anonymous reviewers for their constructive comments mization. Cambridge university press, 2004. and suggestions. We also thank Shuyang Ling, Che Wang, Zining (Lily) Wang, and Yanqiu Wu for the insightful dis- Brémaud, P. Markov Chains Gibbs Fields, Monte Carlo cussions on this work. Simulation and Queues. Springer, 2 edition, 2020. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion Cho, G. E. and Meyer, C. D. Comparison of perturbation Lehnert, L., Laroche, R., and van Seijen, H. On value func- bounds for the stationary distribution of a markov chain. tion representation of long horizon problems. In Proceed- Linear Algebra and its Applications, 335(1-3):137–150, ings of the AAAI Conference on Artificial Intelligence, 2001. volume 32, 2018. Even-Dar, E., Kakade, S. M., and Mansour, Y. Online Levin, D. A. and Peres, Y. Markov chains and mixing times, markov decision processes. Mathematics of Operations volume 107. American Mathematical Soc., 2017. Research, 34(3):726–736, 2009. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, Grinstead, C. M. and Snell, J. L. Introduction to probability. T., Tassa, Y., Silver, D., and Wierstra, D. Continuous American Mathematical Soc., 2012. control with deep reinforcement learning. International Conference on Learning Representations (ICLR), 2016. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor- critic: Off-policy maximum entropy deep reinforcement Mahadevan, S. Average reward reinforcement learning: learning with a stochastic actor. International Conference Foundations, algorithms, and empirical results. Machine on Machine Learning (ICML), 2018. learning, 22(1-3):159–195, 1996. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge Antonoglou, I., Wierstra, D., and Riedmiller, M. Play- university press, 2012. ing atari with deep reinforcement learning. NIPS Deep Howard, R. A. Dynamic programming and markov pro- Learning Workshop, 2013. cesses. John Wiley, 1960. Naik, A., Shariff, R., Yasui, N., and Sutton, R. S. Discounted Hunter, J. J. Stationary distributions and mean first passage reinforcement learning is not an optimization problem. times of perturbed markov chains. Linear Algebra and NeurIPS Optimization Foundations for Reinforcement its Applications, 410:217–243, 2005. Learning Workshop, 2019. Neu, G., Antos, A., György, A., and Szepesvári, C. Online Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The depen- markov decision processes under bandit feedback. In dence of effective planning horizon on model accuracy. Advances in Neural Information Processing Systems, pp. In Proceedings of the 2015 International Conference on 1804–1812, 2010. Autonomous Agents and Multiagent Systems, pp. 1181– 1189. Citeseer, 2015. Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682– Jiang, N., Singh, S. P., and Tewari, A. On structural proper- 697, 2008. ties of mdps that bound loss due to shallow planning. In IJCAI, pp. 1640–1647, 2016. Petrik, M. and Scherrer, B. Biasing approximate dynamic programming with a lower discount factor. In Twenty- Kakade, S. Optimizing average reward using discounted Second Annual Conference on Neural Information Pro- rewards. In International Conference on Computational cessing Systems-NIPS 2008, 2008. Learning Theory, pp. 605–615. Springer, 2001a. Pirotta, M., Restelli, M., Pecorino, A., and Calandriello, Kakade, S. and Langford, J. Approximately optimal approxi- D. Safe policy iteration. In International Conference on mate reinforcement learning. In International Conference Machine Learning, pp. 307–315, 2013. on Machine Learning, volume 2, pp. 267–274, 2002. Ross, K. W. Constrained markov decision processes with Kakade, S. M. A natural policy gradient. Advances in neural queueing applications. Dissertation Abstracts Interna- information processing systems, 14, 2001b. tional Part B: Science and Engineering[DISS. ABST. INT. PT. B- SCI. & ENG.],, 46(4), 1985. Kallenberg, L. Linear Programming and Finite Markovian Control Problems. Centrum Voor Wiskunde en Informat- Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, ica, 1983. P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015. Kemeny, J. and Snell, I. Finite Markov Chains. Van Nos- trand, New Jersey, 1960. Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using general- Lehmann, E. L. and Casella, G. Theory of point estimation. ized advantage estimation. International Conference on Springer Science & Business Media, 2006. Learning Representations (ICLR), 2016.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Vuong, Q., Zhang, Y., and Ross, K. W. Supervised policy Klimov, O. Proximal policy optimization algorithms. update for deep reinforcement learning. In International arXiv preprint arXiv:1707.06347, 2017. Conference on Learning Representation (ICLR), 2019. Schwartz, A. A reinforcement learning method for maxi- Wan, Y., Naik, A., and Sutton, R. S. Learning and plan- mizing undiscounted rewards. In Proceedings of the tenth ning in average-reward markov decision processes. arXiv international conference on machine learning, volume preprint arXiv:2006.16318, 2020. 298, pp. 298–305, 1993. Wei, C.-Y., Jafarnia-Jahromi, M., Luo, H., Sharma, H., and Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Jain, R. Model-free reinforcement learning in infinite- Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., horizon average-reward markov decision processes. In Panneershelvam, V., Lanctot, M., et al. Mastering the International conference on machine learning, 2020. game of go with deep neural networks and tree search. Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba, nature, 529(7587):484, 2016. J. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Ad- Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, vances in neural information processing systems (NIPS), M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae- pp. 5285–5294, 2017. pel, T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Sci- Yang, S., Gao, Y., An, B., Wang, H., and Chen, X. Efficient ence, 362(6419):1140–1144, 2018. average reward reinforcement learning using constant shifting values. In AAAI, pp. 2258–2264, 2016. Song, H. F., Abdolmaleki, A., Springenberg, J. T., Clark, A., Soyer, H., Rae, J. W., Noury, S., Ahuja, A., Liu, S., Tiru- Yang, T.-Y., Rosca, J., Narasimhan, K., and Ramadge, P. J. mala, D., et al. V-mpo: on-policy maximum a posteriori Projection-based constrained policy optimization. In policy optimization for discrete and continuous control. International Conference on Learning Representation International Conference on Learning Representations, (ICLR), 2020. 2020. Zhang, Y., Vuong, Q., and Ross, K. First order constrained Sutton, R. S. and Barto, A. G. Reinforcement learning: An optimization in policy space. Advances in Neural Infor- introduction. MIT press, 2018. mation Processing Systems, 33, 2020. Zhao, T., Hachiya, H., Niu, G., and Sugiyama, M. Anal- Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, ysis and improvement of policy gradient estimation. In Y. Policy gradient methods for reinforcement learning Advances in Neural Information Processing Systems, pp. with function approximation. In Advances in neural in- 262–270, 2011. formation processing systems, pp. 1057–1063, 2000. Tadepalli, P. and Ok, D. H-learning: A reinforcement learning method to optimize undiscounted average re- ward. Technical Report 94-30-01, Oregon State Univer- sity, 1994. Tessler, C., Mankowitz, D. J., and Mannor, S. Reward con- strained policy optimization. International Conference on Learning Representation (ICLR), 2019. Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012. Tsybakov, A. B. Introduction to nonparametric estimation. Springer Science & Business Media, 2008. Veinott, A. F. On finding optimal policies in discrete dy- namic programming with no discounting. The Annals of Mathematical Statistics, 37(5):1284–1294, 1966.
You can also read