Evaluating task-agnostic exploration for fixed-batch learning of arbitrary future tasks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Evaluating task-agnostic exploration for fixed-batch learning of arbitrary future tasks Vibhavari Dasagi1 , Robert Lee1 , Jake Bruce1,2 , and Jürgen Leitner1 1 Queensland University of Technology (QUT), Brisbane, Australia 2 DeepMind, London, UK Contact: vibhavari.dasagi@hdr.qut.edu.au arXiv:1911.08666v1 [cs.LG] 20 Nov 2019 Abstract Deep reinforcement learning has been shown to solve challenging tasks where large amounts of training experience is available, usually ob- tained online while learning the task. Robotics is a significant potential application domain for many of these algorithms, but generating robot experience in the real world is expensive, espe- cially when each task requires a lengthy online training procedure. Off-policy algorithms can (a) Exploration phase (b) Learning phase in principle learn arbitrary tasks from a diverse enough fixed dataset. In this work, we evalu- Figure 1: In this work, we separate the phases of data ate popular exploration methods by generating gathering and policy learning. We evaluate the perfor- robotics datasets for the purpose of learning to mance of state-of-the-art exploration methods by using solve tasks completely offline without any fur- the data they collect to learn to solve arbitrary tasks ther interaction in the real world. We present completely offline. results on three popular continuous control tasks in simulation, as well as continuous control of learning [Watkins and Dayan, 1992; Ernst et al., 2005; a high-dimensional real robot arm. Code docu- Lange et al., 2012; Liu et al., 2015]: a well known classical menting all algorithms, experiments, and hyper- paradigm that has been relatively under-explored in the parameters is available at https://github. context of modern robotics research. com/qutrobotlearning/batchlearning. While many contemporary approaches can be described as on-policy in which all the training data in each update 1 Introduction is generated directly by the current version of the agent Recent research in the field of model-free deep reinforce- being trained [Mnih et al., 2016], off-policy algorithms ment learning (RL) has enabled complex, expressive poli- can in principle learn from the data generated by any cies to be learned from experience for many challenging arbitrary source of behavior, making them ideal candi- simulated and virtual task domains [Mnih et al., 2015; dates for batch learning. However, these algorithms have Lillicrap et al., 2015]. The success of these methods sug- been demonstrated to be unstable when learning from gests potential applications to robotics, and some progress fixed datasets with insufficient coverage due to overesti- has been made in this direction [Frank et al., 2014; mation bias in unfamiliar states [Fujimoto et al., 2018b; Zhang et al., 2015; Rusu et al., 2016; Levine et al., 2016; Kalashnikov et al., 2018]. Gu et al., 2017; Večerík et al., 2017]. A key limitation in Batch RL has the potential to represent a significant learning robotics skills with deep reinforcement learning step forward for robot learning, allowing robotics practi- is the cost of gathering new experience. Since different tioners to collect powerful calibration datasets of robot control tasks with the same robot often involve similar experience without requiring detailed task knowledge observations, actions, and dynamics, it would be conve- in advance, while enabling completely offline training nient to gather a single dataset with diversity sufficient on arbitrary tasks that were not known at exploration for agents to learn to solve arbitrary future tasks com- time [Bruce et al., 2017]. In this work, we would like to pletely offline; this setting is known as batch reinforcement call attention to this relatively under-explored paradigm,
and aim to take a step toward a solution by evaluating of task-agnostic exploration and task-aware exploitation the performance of various state-of-the-art exploration phases in reinforcement learning [Colas et al., 2018], in approaches for diverse task-agnostic experience collection, which randomized linear policies are used to generate for offline learning of arbitrary tasks that were not known a bootstrap sample followed by a random perturbation at dataset generation time. procedure to encourage diversity of state visitation as measured by hand-specified task-relevant features. The 2 Related Work data gathered by GEP is then used to initialize the ex- In this section we review the literature relating to off- perience memory of a state-of-the-art continuous RL policy learning on a fixed dataset, and the state-of-the-art agent [Lillicrap et al., 2015], after which on-task training in exploration methods that might be used to produce a proceeded as usual. maximally diverse training dataset without knowing the In this paper, we consider a related but different ultimate tasks of interest in advance. paradigm in which the dataset is collected entirely in advance with no knowledge of the ultimate tasks the 2.1 Off-Policy Learning agent will be trained on, and with no task-relevant fea- Reinforcement learning algorithms can be broadly clas- tures known ahead of time. Given this fixed dataset, sified on a spectrum from on-policy in which training we then initialize a state-dependent reward function and data always comes from the current version of the agent, train a task policy completely offline without any further to off-policy in which the agent is able to learn from interaction with the environment. The need for extremely arbitrary experience; we are motivated by learning com- diverse data in advance in order to cover arbitrary fu- pletely offline, so we focus our attention on the latter. ture tasks puts extra pressure on the exploration method, Off-policy reinforcement learning has the potential to be which forms the main focus of our evaluations in this particularly applicable in our situation, because it opens paper. up the possibility of learning from many sources of expe- 2.2 Exploration rience beyond that collected by the current policy [Gu et al., 2017]. Impressive results have been achieved in Learning generally requires exposure to diverse train- domains such as robotic grasping by making use of task- ing data. In reinforcement learning, generating diverse relevant datasets from previous policies and even from training data is typically achieved by an exploration mech- entirely different experimental runs, in which diverse anism internal to the agent in question, and exploration data collection was identified as an essential requirement techniques have been an active area of research in the for offline learning [Kalashnikov et al., 2018]. These ap- field from early on [Thrun, 1992]. Exploration is usually proaches can be susceptible to overestimation bias in considered in the context of online learning, in which the unfamiliar states due to optimistic backup of estimated agent must not only optimize its objective, but also take future value; this bias can be mitigated to some degree unexplored actions in order to learn the consequences with pessimism in the face of uncertainty by trusting the thereof. In this work, we are interested in exploration minimum of two independent estimates [Hasselt, 2010; from a slightly different angle: how to generate diverse Van Hasselt et al., 2016; Fujimoto et al., 2018a]. datasets in the absence of any task feedback whatsoever. In the case of far-off-policy learning, where the distri- Classical results show that the Q-learning algorithm is bution mismatch grows very large between the training provably convergent in the tabular case, given complete dataset and the state-action visitation induced by the exploration of the problem [Jaakkola et al., 1994]. Al- policy, this overestimation bias can lead to instability though tabular guarantees no longer hold in the context and complete learning failure as estimation errors com- of modern function approximation, it is intuitive that pound indefinitely without the possibility of correction effectively covering the space of the problem is important by actively visiting those overestimated states online. for convergent offline training. When the desired task is sufficiently similar to the task The simplest and most common exploration techniques performed by the data-gathering policy, this issue can involve simply adding noise to the policy. The standard be mitigated by Batch-Constrained Q-learning (BCQ), approach in discrete Q-learning is known as -greedy, in in which the policy is constrained to avoid distribution which a fraction of the time rather than acting optimally, mismatch between the training and on-policy data dis- the agent chooses a random action [Watkins and Dayan, tributions. This approach is reminiscent of imitation 1992; Mnih et al., 2015]. In continuous control tasks learning, but with the benefit of being able to optimize similar noise-based exploration techniques are often used, arbitrary reward functions at offline training time [Fuji- including directly adding iid or correlated noise to the moto et al., 2018b]. actions [Lillicrap et al., 2015; Fujimoto et al., 2018a]. A method known as Goal Exploration Processes (GEP) Pure noise as a source of exploration behavior, while has been proposed for the paradigm of explicit separation simple and requiring few assumptions, has difficulty reach-
ing distant states: the expected exploration time for a 3 Approach random policy to reach a given state grows exponen- In this work, we consider the problem of generating a tially with its distance, leading to the proposal of deep static dataset of robot experience without task knowledge, exploration [Osband et al., 2016] in which an ensemble in order to prepare for learning to solve arbitrary tasks in of policies are trained independently while sharing their the future completely offline. We decompose the problem experience, resulting in consistent behavior policies that into two phases: exploration, in which we execute a state- nonetheless result in diverse coverage of the problem of-the-art exploration algorithm from the literature for a space. GEP [Colas et al., 2018], described above, in- fixed number of timesteps; and offline learning, in which volves a similar technique in which a large number of we train an off-policy RL algorithm to solve arbitrary randomized linear policies form the basis of the explo- tasks that were not known to the exploration agent when ration behavior. the dataset was gathered. Another approach to exploration involves intrinsic mo- 3.1 Exploration tivation [Chentanez et al., 2005], in which the reward func- In this phase, we execute an exploration algorithm on tion of the problem is augmented with an additive bonus the robotic platform for a fixed number of timesteps in that rewards the agent for visiting states in proportion to order to generate a dataset of diverse exploration data their novelty. In count-based exploration methods [Belle- with no prior knowledge of the task. We describe three mare et al., 2016; Tang et al., 2017; Ostrovski et al., 2017], popular exploration algorithms from the literature (RND, the novelty of states is approximated directly in inverse DIAYN, and GEP) as well as a simple baseline and a proportion to their visitation frequency. An indirect way novel exploration algorithm adapted from the literature. to measure familiarity is in terms of prediction error of a model being trained in parallel with the RL agent, Random Network Distillation such as forward predictive models [Pathak et al., 2017; In RND [Burda et al., 2018b], a randomly-initialized Pathak et al., 2019] or the error in predicting the state- and fixed encoding function fteacher (x) → φ is used to dependent output of another arbitrary network [Burda et encode observations from the environment into fixed- al., 2018b]. Diversity of states can be used directly as an length feature vectors. These feature vectors are used objective to optimize, by training a maximum-entropy as the labels of a supervised learning procedure to train RL agent to optimize its distinctiveness from other agents another function fstudent (x) → φ̃. The reward given to as measured by a state-dependent classification network the exploration policy is the same objective that the trained in parallel [Eysenbach et al., 2018]. Intrinsic mo- supervised process is minimizing: tivation has even been shown to achieve impressive task performance in the complete absence of task reward, in RRND (xt ) = k φt − φ̃t k (1) situations in which pure exploration correlates with the task objective [Burda et al., 2018a]. Diversity Is All You Need DIAYN [Eysenbach et al., 2018] is a reinforcement learn- In self-driven learning more generally, model-based ing algorithm that trains an ensemble of diverse skills by approaches can be trained purely self-supervised while rewarding each policy for being distinct as measured by providing analytic gradients to train the policy di- a learned classification algorithm fd (x) → P (skill|x). DI- rectly [Deisenroth and Rasmussen, 2011; Gal et al., ; AYN trains an ensemble of maximum-entropy RL agents Heess et al., 2015]. Model ensembles can be leveraged to to maximize the following reward while acting as ran- provide an estimate of uncertainty in addition to diversity domly as possible: of experience [Kurutach et al., 2018], and uncertainty- aware methods like these could be used to backpropagate RDIAYN (xt ) = log P (skillt |xt ) (2) directly into a learning agent for either seeking or avoiding uncertainty [Henaff et al., 2019] as the need arises. Goal Exploration Processes GEP [Colas et al., 2018] attempts to gather diverse data In this work, we are primarily interested in task- by generating a set of randomized linear policies and agnostic exploration. We consider state-of-the-art executing them to collect experience, which is stored exploration methods Random Network Distillation in memory in the form of a task-dependent descriptor (RND) [Burda et al., 2018b], Diversity Is All You Need extracted from each trajectory. Since we do not allow the (DIAYN) [Eysenbach et al., 2018], and GEP [Colas et al., exploration phase any knowledge of the ultimate tasks, 2018], for the purpose of generating diverse datasets with we simply store the element-wise mean of the states no task knowledge, evaluated according to the perfor- along the trajectory as its descriptor. Once a number of mance of a separate off-policy agent learning to optimize randomized policies have been executed (N = 50 in our entirely new tasks unknown at exploration time. case, as in the original work), random “goal” states are
Figure 2: Experimental environments: HalfCheetah-v1, Hopper-v1, Walker2d-v1, and real FrankaEmika Panda arm with 7 degrees of freedom. sampled from the state space, and the policy in memory also requires the episode termination variable, we also associated with the nearest state to the goal is perturbed learn a termination prediction model fdone (xt ) → D̃t on- with random noise and executed again, adding its new line from data during exploration. We train a Soft Actor- experience to the memory. We continue this procedure Critic (SAC)-style maximum-entropy RL agent [Haarnoja until our exploration dataset is the desired size. et al., 2018] to directly maximize the following quantity, representing the sum of intrinsic rewards plus entropy Random Policies terms: To measure the importance of the goal-sampling step in GEP, we also evaluate a simple baseline in which only the RSSE (xt , at ) = k f1 (xt , at ) − f2 (xt , at ) k (5) randomized policy step is applied. Rather than sampling goals and perturbing policies from memory, this baseline T X randomly initializes a new policy every episode. VSSE = γ t (1 − D̃t )(RSSE (xt , at ) − log(P (at ))) (6) t=0 Self-Supervised Exploration In addition to the existing baselines, we evaluate a novel When trained on simulated forward rollouts from the exploration approach obtained by turning the RL ob- model and backpropagating through time, we obtain jective typically optimized by prediction-error-based in- a model-augmented supervised version of intrinsically- trinsic motivation algorithms into a supervised objec- motivated SAC that we refer to as Self-Supervised Ex- tive through use of the forward model fforward (xt , at ) → ploration (SSE). x̃t+1 that is often trained as a byproduct of these ap- 3.2 Offline Learning proaches [Pathak et al., 2017; Pathak et al., 2019]. The typical prediction-error-based reward being maximized is Given a fixed dataset of robot experience, we are in- of the form: terested in learning to solve arbitrary tasks completely offline with no further interaction with the environment. Rintrinsic (xt , at , xt+1 ) = k xt+1 − x̃t+1 k (3) In this section, we describe the two off-policy algorithms that we evaluate for learning tasks offline on fixed data, RL algorithms usually account for the difficulty of both of which are based on the deep deterministic policy predicting the long-term future by optimizing discounted gradient (DDPG) [Lillicrap et al., 2015] algorithm for rewards: continuous Q-learning. XT VR = γ t (1 − Dt )Rintrinsic (xt , at , xt+1 ) (4) Twin Delayed DDPG t=0 TD3 [Fujimoto et al., 2018a] is an improvement to where γ ∈ [0, 1) is the discount factor, and Dt is an DDPG that aims to reduce the overestimation bias indicator variable describing whether or not the episode that is common when training off-policy value func- has ended by time t. Because Rt and x̃t+1 depend on the tions. Two Q-networks are trained simultaneously, and forward model that is a byproduct of these prediction- the minimum is chosen when evaluating the Q-value error based methods, we can implement an exploration for the purpose of bootstrapping (as in [Hasselt, 2010; technique in which the gradient of the agent’s policy is Van Hasselt et al., 2016]), which corresponds to pes- estimated by backpropagation directly through our pre- simistic estimation in the face of uncertainty. Further- dictions of the future. We take inspiration from [Pathak more, noise is added to the output of the policy during et al., 2019], but rather than assume the next state does training to encourage smoothness of estimation in small not depend on the action, we train a pair of forward mod- regions around observed experience. Finally, as in the els f{1,2} (xt , at ) → x̃t+1 and optimize for their divergence original work, the policy is trained half as frequently as as a proxy for novelty. Because the sum of future rewards the value networks.
Batch-Constrained Deep Q-learning Learning Method Exploration Method BCQ [Fujimoto et al., 2018b] achieves improved offline Rand DIAYN RND SSD learning by training a state-conditional generative model of the actions in the batch, which can then be used to TD3 0.46 0.51 0.49 0.30 sample actions that reflect the actions present in the BCQ 0.47 0.30 0.52 0.44 dataset. Keeping the policy action close to the buffer distribution reduces the extrapolation error that would Table 1: Distances of closest tooltip positions in meters otherwise accumulate due to distribution mismatch and for each evaluated method on the real robot. overestimation. in radians, and joint velocities in radians per second, re- sulting in observation vectors of length 14. Actions were 4 Experiments sent at 20Hz in the form of joint velocities in radians In this work, we consider exploration methods to generate per second, clipped in the range [−0.5, 0.5], and episodes diverse datasets of robot experience for learning arbitrary were reset after 1000 timesteps or when the robot violated tasks completely offline. We first conduct a thorough physical safety limits such as self-collision. We collected evaluation of the approaches described in Section 3 on data on the physical robot in a manner similar to the sim- three standard simulated continuous control tasks as ulated domains, but we limited the dataset size to 200K in [Fujimoto et al., 2018b]. We then evaluate the best transitions due to the additional time cost of physical performing approaches to explore and then learn reaching experiments. Policies were trained using TD3 and BCQ tasks offline on a physical FrankaEmika Panda robot arm offline to solve reaching tasks starting from a determin- with 7 degrees of freedom. All environments are shown istic “home” configuration to one of four different goals, in Figure 2. specified differently for each task. Results for the real robot experiments are shown in Table 1, comparing the 4.1 Simulation Experiments distance to the target point for each exploration method We first evaluate on a standard benchmark suite of 3 and offline learning algorithm. OpenAI Gym MuJoCo environments: HalfCheetah-v1, Hopper-v1, and Walker2d-v1 [Todorov et al., 2012; 5 Discussion Brockman et al., 2016]. As described in Section 3, we Somewhat counter-intuitively, the state-of-the-art RL partition the experiments into a task-agnostic exploration exploration methods we evaluated did not perform par- phase followed by a task-aware offline training phase. In ticularly well in our experiments, as shown in the the exploration phase, each exploration method generates HalfCheetah-v1 results in Figure 3. Particularly sur- 1 million transitions of experience in the simulated do- prising is the result that randomly generating a new main in the form of (xt , at , xt+1 ), and this data is saved linear policy every episode seems to outperform many as a static dataset. of the other baselines by a wide margin. This suggests In the offline training phase, we choose a task with a that despite achieving impressive results during online state-dependent reward function that was not known to training, current methods of exploration are not well the exploration agent, and train an off-policy RL algo- suited to the pure exploration paradigm, as described rithm to solve the task on the data in the static dataset in this paper. Furthermore, BCQ did not perform as for 300,000 training steps. For task rewards, we use the well as expected, but this is reasonable as it was not standard locomotion tasks of maximizing forward veloc- designed to learn from purely task-agnostic data. Also ity, as provided by the three environments. We evaluate of interest is that the best performing algorithms on the RND, DIAYN, GEP, random networks, and our proposal real robot did correspond to the best performance on the SSE as exploration methods, and TD3 and BCQ as off- simulation tasks. Note however that we did not engage in policy learning algorithms. In all cases we use the default heavy parameter tuning of the offline learning algorithms hyperparameters from the original papers that proposed we used, and Rainbow-style improvements [Hessel et al., the algorithms, and code for our experiments is freely 2018] to off-policy algorithms may provide improvements available at https://github.com/qutrobotlearning/ to the result regardless of the exploration method used. batchlearning. Results for the simulation experiments During the exploration phase on the real robot we ob- are shown in Figure 3. served that the randomly generated linear policies seemed to generate vastly diverse actions that in totality cov- 4.2 Physical Robot ered a larger portion of the state space compared to the In order to validate the approach on a real robot, we other exploration methods. This may imply that while consider the FrankaEmika Panda robot arm platform systematically covering the state space might be useful with 7 degrees of freedom. Agents observed joint angles for exploration given an unlimited dataset, with the re-
(a) TD3 (b) BCQ Figure 3: Simulation Results strictions of a limited static dataset it is vital to explore dataset to train robots with a lifetime worth of skills on regions in the state space that are far apart. The greater demand. diversity in the dataset may increase generalization ca- Our experiments showed interesting and unexpected pabilities of the agent to nearby previously-unexplored results for the state-of-the-art exploration methods and states while reducing the chances of visiting states com- off-policy algorithms in this setting. Since exploration pletely different from those in the dataset. has been shown to be an important component of RL per- formance, we were expecting the established exploration 6 Conclusion algorithms to generate diverse enough data to train tasks offline, but in domains such as Hopper-v1 and Walker-2d, In this work, we consider the paradigm of task-agnostic ex- purely self-directed exploration without a task seems to ploration for generating datasets that are diverse enough be very challenging. We believe that this justifies further to train policies to solve arbitrary tasks with no further research in this paradigm given the potential benefits to interaction with the environment. This is an important robotics from single-dataset offline training. goal for robotics, potentially enabling a single diverse We make our algorithms, experiments, and hyper-
parameters freely available on https://github.com/ motion planning on humanoids. Frontiers in neuro- qutrobotlearning/batchlearning. robotics, 2014. [Fujimoto et al., 2018a] Scott Fujimoto, Herke Hoof, and References David Meger. Addressing function approximation error [Bellemare et al., 2016] Marc Bellemare, Sriram Srini- in actor-critic methods. In International Conference vasan, Georg Ostrovski, Tom Schaul, David Saxton, on Machine Learning, pages 1582–1591, 2018. and Remi Munos. Unifying count-based exploration [Fujimoto et al., 2018b] Scott Fujimoto, David Meger, and intrinsic motivation. In Advances in Neural Infor- and Doina Precup. Off-policy deep reinforce- mation Processing Systems, 2016. ment learning without exploration. arXiv preprint [Brockman et al., 2016] Greg Brockman, Vicki Cheung, arXiv:1812.02900, 2018. Ludwig Pettersson, Jonas Schneider, John Schulman, [Gal et al., ] Yarin Gal, Rowan McAllister, and Carl Ed- Jie Tang, and Wojciech Zaremba. Openai gym. arXiv ward Rasmussen. Improving pilco with bayesian neural preprint arXiv:1606.01540, 2016. network dynamics models. [Bruce et al., 2017] Jake Bruce, Niko Sünderhauf, Piotr [Gu et al., 2017] Shixiang Gu, Ethan Holly, Timothy Lil- Mirowski, Raia Hadsell, and Michael Milford. One- licrap, and Sergey Levine. Deep reinforcement learning shot reinforcement learning for robot navigation with for robotic manipulation with asynchronous off-policy interactive replay. arXiv preprint arXiv:1711.10137, updates. In 2017 IEEE International Conference on 2017. Robotics and Automation (ICRA), pages 3389–3396. [Burda et al., 2018a] Yuri Burda, Harri Edwards, IEEE, 2017. Deepak Pathak, Amos Storkey, Trevor Darrell, and [Haarnoja et al., 2018] Tuomas Haarnoja, Aurick Zhou, Alexei A Efros. Large-scale study of curiosity-driven Pieter Abbeel, and Sergey Levine. Soft actor- learning. arXiv preprint arXiv:1808.04355, 2018. critic: Off-policy maximum entropy deep reinforce- [Burda et al., 2018b] Yuri Burda, Harrison Edwards, ment learning with a stochastic actor. arXiv preprint Amos Storkey, and Oleg Klimov. Exploration arXiv:1801.01290, 2018. by random network distillation. arXiv preprint [Hasselt, 2010] Hado V Hasselt. Double q-learning. In arXiv:1810.12894, 2018. Advances in Neural Information Processing Systems, [Chentanez et al., 2005] Nuttapong Chentanez, An- 2010. drew G Barto, and Satinder P Singh. Intrinsically [Heess et al., 2015] Nicolas Heess, Gregory Wayne, motivated reinforcement learning. In Advances in David Silver, Timothy Lillicrap, Tom Erez, and Yuval neural information processing systems, 2005. Tassa. Learning continuous control policies by stochas- [Colas et al., 2018] Cédric Colas, Olivier Sigaud, and tic value gradients. In Advances in Neural Information Pierre-Yves Oudeyer. Gep-pg: Decoupling exploration Processing Systems, 2015. and exploitation in deep reinforcement learning al- [Henaff et al., 2019] Mikael Henaff, Alfredo Canziani, gorithms. In International Conference on Machine and Yann LeCun. Model-predictive policy learning Learning (ICML), 2018. with uncertainty regularization for driving in dense [Deisenroth and Rasmussen, 2011] Marc Deisenroth and traffic. arXiv preprint arXiv:1901.02705, 2019. Carl E Rasmussen. Pilco: A model-based and data- [Hessel et al., 2018] Matteo Hessel, Joseph Modayil, efficient approach to policy search. In Proceedings of Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will the 28th International Conference on machine learning Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, (ICML-11), 2011. and David Silver. Rainbow: Combining improvements [Ernst et al., 2005] Damien Ernst, Pierre Geurts, and in deep reinforcement learning. In Thirty-Second AAAI Louis Wehenkel. Tree-based batch mode reinforce- Conference on Artificial Intelligence, 2018. ment learning. Journal of Machine Learning Research, [Jaakkola et al., 1994] Tommi Jaakkola, Michael I Jor- 6(Apr):503–556, 2005. dan, and Satinder P Singh. Convergence of stochastic [Eysenbach et al., 2018] Benjamin Eysenbach, Abhishek iterative dynamic programming algorithms. In Ad- Gupta, Julian Ibarz, and Sergey Levine. Diversity is vances in neural information processing systems, pages all you need: Learning skills without a reward function. 703–710, 1994. arXiv preprint arXiv:1802.06070, 2018. [Kalashnikov et al., 2018] Dmitry Kalashnikov, Alex Ir- [Frank et al., 2014] Mikhail Frank, Jürgen Leitner, Mar- pan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric ijn Stollenga, Alexander Förster, and Jürgen Schmid- Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrish- huber. Curiosity driven reinforcement learning for nan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep
reinforcement learning for vision-based robotic manip- ings of the IEEE Conference on Computer Vision and ulation. arXiv preprint arXiv:1806.10293, 2018. Pattern Recognition Workshops, pages 16–17, 2017. [Kurutach et al., 2018] Thanard Kurutach, Ignasi Clav- [Pathak et al., 2019] Deepak Pathak, Dhiraj Gandhi, era, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model- and Abhinav Gupta. Beyond games: Bringing ex- ensemble trust-region policy optimization. arXiv ploration to robots in real-world, 2019. preprint arXiv:1802.10592, 2018. [Rusu et al., 2016] Andrei A Rusu, Mel Vecerik, Thomas [Lange et al., 2012] Sascha Lange, Thomas Gabel, and Rothörl, Nicolas Heess, Razvan Pascanu, and Raia Martin Riedmiller. Batch reinforcement learning. In Hadsell. Sim-to-real robot learning from pixels with Reinforcement learning, pages 45–73. Springer, 2012. progressive nets. arXiv preprint arXiv:1610.04286, 2016. [Levine et al., 2016] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep [Tang et al., 2017] Haoran Tang, Rein Houthooft, Davis visuomotor policies. The Journal of Machine Learning Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, Research, 17(1):1334–1373, 2016. John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for [Lillicrap et al., 2015] Timothy P Lillicrap, Jonathan J deep reinforcement learning. In Advances in neural Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, information processing systems, 2017. Yuval Tassa, David Silver, and Daan Wierstra. Contin- [Thrun, 1992] Sebastian B Thrun. Efficient exploration uous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015. in reinforcement learning. 1992. [Todorov et al., 2012] Emanuel Todorov, Tom Erez, and [Liu et al., 2015] De-Rong Liu, Hong-Liang Li, and Ding Yuval Tassa. Mujoco: A physics engine for model-based Wang. Feature selection and feature learning for high- control. In 2012 IEEE/RSJ International Conference dimensional batch reinforcement learning: A survey. on Intelligent Robots and Systems, pages 5026–5033. International Journal of Automation and Computing, IEEE, 2012. 2015. [Van Hasselt et al., 2016] Hado Van Hasselt, Arthur [Mnih et al., 2015] Volodymyr Mnih, Koray Guez, and David Silver. Deep reinforcement learning Kavukcuoglu, David Silver, Andrei A Rusu, Joel with double q-learning. In Thirtieth AAAI Conference Veness, Marc G Bellemare, Alex Graves, Martin on Artificial Intelligence, 2016. Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement [Večerík et al., 2017] Matej Večerík, Todd Hester, learning. Nature, 518(7540):529, 2015. Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, [Mnih et al., 2016] Volodymyr Mnih, Adria Puig- and Martin Riedmiller. Leveraging demonstrations domenech Badia, Mehdi Mirza, Alex Graves, Timothy for deep reinforcement learning on robotics problems Lillicrap, Tim Harley, David Silver, and Koray with sparse rewards. arXiv preprint arXiv:1707.08817, Kavukcuoglu. Asynchronous methods for deep 2017. reinforcement learning. In International conference on [Watkins and Dayan, 1992] Christopher JCH Watkins machine learning, pages 1928–1937, 2016. and Peter Dayan. Q-learning. Machine learning, 8(3- [Osband et al., 2016] Ian Osband, Charles Blundell, 4):279–292, 1992. Alexander Pritzel, and Benjamin Van Roy. Deep [Zhang et al., 2015] Fangyi Zhang, Jürgen Leitner, exploration via bootstrapped dqn. In D. D. Lee, Michael Milford, Ben Upcroft, and Peter Corke. M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar- Towards vision-based deep reinforcement learn- nett, editors, Advances in Neural Information Process- ing for robotic motion control. arXiv preprint ing Systems 29, pages 4026–4034. Curran Associates, arXiv:1511.03791, 2015. Inc., 2016. [Ostrovski et al., 2017] Georg Ostrovski, Marc G Belle- mare, Aäron van den Oord, and Rémi Munos. Count- based exploration with neural density models. In Pro- ceedings of the 34th International Conference on Ma- chine Learning-Volume 70. JMLR. org, 2017. [Pathak et al., 2017] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceed-
You can also read