Evaluating task-agnostic exploration for fixed-batch learning of arbitrary future tasks

Page created by Kurt Juarez

Careers

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Evaluating task-agnostic exploration for fixed-batch learning of arbitrary future tasks

Evaluating task-agnostic exploration for fixed-batch learning of
                                                                     arbitrary future tasks
                                                      Vibhavari Dasagi1 , Robert Lee1 , Jake Bruce1,2 , and Jürgen Leitner1
                                                          1
                                                            Queensland University of Technology (QUT), Brisbane, Australia
                                                                              2
                                                                                DeepMind, London, UK
                                                                      Contact: vibhavari.dasagi@hdr.qut.edu.au
arXiv:1911.08666v1 [cs.LG] 20 Nov 2019

                                                                Abstract
                                             Deep reinforcement learning has been shown
                                             to solve challenging tasks where large amounts
                                             of training experience is available, usually ob-
                                             tained online while learning the task. Robotics
                                             is a significant potential application domain for
                                             many of these algorithms, but generating robot
                                             experience in the real world is expensive, espe-
                                             cially when each task requires a lengthy online
                                             training procedure. Off-policy algorithms can                    (a) Exploration phase        (b) Learning phase
                                             in principle learn arbitrary tasks from a diverse
                                             enough fixed dataset. In this work, we evalu-               Figure 1: In this work, we separate the phases of data
                                             ate popular exploration methods by generating               gathering and policy learning. We evaluate the perfor-
                                             robotics datasets for the purpose of learning to            mance of state-of-the-art exploration methods by using
                                             solve tasks completely offline without any fur-             the data they collect to learn to solve arbitrary tasks
                                             ther interaction in the real world. We present              completely offline.
                                             results on three popular continuous control tasks
                                             in simulation, as well as continuous control of             learning [Watkins and Dayan, 1992; Ernst et al., 2005;
                                             a high-dimensional real robot arm. Code docu-               Lange et al., 2012; Liu et al., 2015]: a well known classical
                                             menting all algorithms, experiments, and hyper-             paradigm that has been relatively under-explored in the
                                             parameters is available at https://github.                  context of modern robotics research.
                                             com/qutrobotlearning/batchlearning.                            While many contemporary approaches can be described
                                                                                                         as on-policy in which all the training data in each update
                                         1    Introduction                                               is generated directly by the current version of the agent
                                         Recent research in the field of model-free deep reinforce-      being trained [Mnih et al., 2016], off-policy algorithms
                                         ment learning (RL) has enabled complex, expressive poli-        can in principle learn from the data generated by any
                                         cies to be learned from experience for many challenging         arbitrary source of behavior, making them ideal candi-
                                         simulated and virtual task domains [Mnih et al., 2015;          dates for batch learning. However, these algorithms have
                                         Lillicrap et al., 2015]. The success of these methods sug-      been demonstrated to be unstable when learning from
                                         gests potential applications to robotics, and some progress     fixed datasets with insufficient coverage due to overesti-
                                         has been made in this direction [Frank et al., 2014;            mation bias in unfamiliar states [Fujimoto et al., 2018b;
                                         Zhang et al., 2015; Rusu et al., 2016; Levine et al., 2016;     Kalashnikov et al., 2018].
                                         Gu et al., 2017; Večerík et al., 2017]. A key limitation in        Batch RL has the potential to represent a significant
                                         learning robotics skills with deep reinforcement learning       step forward for robot learning, allowing robotics practi-
                                         is the cost of gathering new experience. Since different        tioners to collect powerful calibration datasets of robot
                                         control tasks with the same robot often involve similar         experience without requiring detailed task knowledge
                                         observations, actions, and dynamics, it would be conve-         in advance, while enabling completely offline training
                                         nient to gather a single dataset with diversity sufficient      on arbitrary tasks that were not known at exploration
                                         for agents to learn to solve arbitrary future tasks com-        time [Bruce et al., 2017]. In this work, we would like to
                                         pletely offline; this setting is known as batch reinforcement   call attention to this relatively under-explored paradigm,

and aim to take a step toward a solution by evaluating         of task-agnostic exploration and task-aware exploitation
the performance of various state-of-the-art exploration        phases in reinforcement learning [Colas et al., 2018], in
approaches for diverse task-agnostic experience collection,    which randomized linear policies are used to generate
for offline learning of arbitrary tasks that were not known    a bootstrap sample followed by a random perturbation
at dataset generation time.                                    procedure to encourage diversity of state visitation as
                                                               measured by hand-specified task-relevant features. The
2     Related Work                                             data gathered by GEP is then used to initialize the ex-
In this section we review the literature relating to off-      perience memory of a state-of-the-art continuous RL
policy learning on a fixed dataset, and the state-of-the-art   agent [Lillicrap et al., 2015], after which on-task training
in exploration methods that might be used to produce a         proceeded as usual.
maximally diverse training dataset without knowing the            In this paper, we consider a related but different
ultimate tasks of interest in advance.                         paradigm in which the dataset is collected entirely in
                                                               advance with no knowledge of the ultimate tasks the
2.1    Off-Policy Learning                                     agent will be trained on, and with no task-relevant fea-
Reinforcement learning algorithms can be broadly clas-         tures known ahead of time. Given this fixed dataset,
sified on a spectrum from on-policy in which training          we then initialize a state-dependent reward function and
data always comes from the current version of the agent,       train a task policy completely offline without any further
to off-policy in which the agent is able to learn from         interaction with the environment. The need for extremely
arbitrary experience; we are motivated by learning com-        diverse data in advance in order to cover arbitrary fu-
pletely offline, so we focus our attention on the latter.      ture tasks puts extra pressure on the exploration method,
Off-policy reinforcement learning has the potential to be      which forms the main focus of our evaluations in this
particularly applicable in our situation, because it opens     paper.
up the possibility of learning from many sources of expe-
                                                               2.2    Exploration
rience beyond that collected by the current policy [Gu
et al., 2017]. Impressive results have been achieved in        Learning generally requires exposure to diverse train-
domains such as robotic grasping by making use of task-        ing data. In reinforcement learning, generating diverse
relevant datasets from previous policies and even from         training data is typically achieved by an exploration mech-
entirely different experimental runs, in which diverse         anism internal to the agent in question, and exploration
data collection was identified as an essential requirement     techniques have been an active area of research in the
for offline learning [Kalashnikov et al., 2018]. These ap-     field from early on [Thrun, 1992]. Exploration is usually
proaches can be susceptible to overestimation bias in          considered in the context of online learning, in which the
unfamiliar states due to optimistic backup of estimated        agent must not only optimize its objective, but also take
future value; this bias can be mitigated to some degree        unexplored actions in order to learn the consequences
with pessimism in the face of uncertainty by trusting the      thereof. In this work, we are interested in exploration
minimum of two independent estimates [Hasselt, 2010;           from a slightly different angle: how to generate diverse
Van Hasselt et al., 2016; Fujimoto et al., 2018a].             datasets in the absence of any task feedback whatsoever.
   In the case of far-off-policy learning, where the distri-      Classical results show that the Q-learning algorithm is
bution mismatch grows very large between the training          provably convergent in the tabular case, given complete
dataset and the state-action visitation induced by the         exploration of the problem [Jaakkola et al., 1994]. Al-
policy, this overestimation bias can lead to instability       though tabular guarantees no longer hold in the context
and complete learning failure as estimation errors com-        of modern function approximation, it is intuitive that
pound indefinitely without the possibility of correction       effectively covering the space of the problem is important
by actively visiting those overestimated states online.        for convergent offline training.
When the desired task is sufficiently similar to the task         The simplest and most common exploration techniques
performed by the data-gathering policy, this issue can         involve simply adding noise to the policy. The standard
be mitigated by Batch-Constrained Q-learning (BCQ),            approach in discrete Q-learning is known as -greedy, in
in which the policy is constrained to avoid distribution       which a fraction  of the time rather than acting optimally,
mismatch between the training and on-policy data dis-          the agent chooses a random action [Watkins and Dayan,
tributions. This approach is reminiscent of imitation          1992; Mnih et al., 2015]. In continuous control tasks
learning, but with the benefit of being able to optimize       similar noise-based exploration techniques are often used,
arbitrary reward functions at offline training time [Fuji-     including directly adding iid or correlated noise to the
moto et al., 2018b].                                           actions [Lillicrap et al., 2015; Fujimoto et al., 2018a].
   A method known as Goal Exploration Processes (GEP)             Pure noise as a source of exploration behavior, while
has been proposed for the paradigm of explicit separation      simple and requiring few assumptions, has difficulty reach-

ing distant states: the expected exploration time for a          3     Approach
random policy to reach a given state grows exponen-              In this work, we consider the problem of generating a
tially with its distance, leading to the proposal of deep        static dataset of robot experience without task knowledge,
exploration [Osband et al., 2016] in which an ensemble           in order to prepare for learning to solve arbitrary tasks in
of policies are trained independently while sharing their        the future completely offline. We decompose the problem
experience, resulting in consistent behavior policies that       into two phases: exploration, in which we execute a state-
nonetheless result in diverse coverage of the problem            of-the-art exploration algorithm from the literature for a
space. GEP [Colas et al., 2018], described above, in-            fixed number of timesteps; and offline learning, in which
volves a similar technique in which a large number of            we train an off-policy RL algorithm to solve arbitrary
randomized linear policies form the basis of the explo-          tasks that were not known to the exploration agent when
ration behavior.                                                 the dataset was gathered.
   Another approach to exploration involves intrinsic mo-        3.1    Exploration
tivation [Chentanez et al., 2005], in which the reward func-
                                                                 In this phase, we execute an exploration algorithm on
tion of the problem is augmented with an additive bonus
                                                                 the robotic platform for a fixed number of timesteps in
that rewards the agent for visiting states in proportion to
                                                                 order to generate a dataset of diverse exploration data
their novelty. In count-based exploration methods [Belle-
                                                                 with no prior knowledge of the task. We describe three
mare et al., 2016; Tang et al., 2017; Ostrovski et al., 2017],
                                                                 popular exploration algorithms from the literature (RND,
the novelty of states is approximated directly in inverse
                                                                 DIAYN, and GEP) as well as a simple baseline and a
proportion to their visitation frequency. An indirect way
                                                                 novel exploration algorithm adapted from the literature.
to measure familiarity is in terms of prediction error of
a model being trained in parallel with the RL agent,             Random Network Distillation
such as forward predictive models [Pathak et al., 2017;          In RND [Burda et al., 2018b], a randomly-initialized
Pathak et al., 2019] or the error in predicting the state-       and fixed encoding function fteacher (x) → φ is used to
dependent output of another arbitrary network [Burda et          encode observations from the environment into fixed-
al., 2018b]. Diversity of states can be used directly as an      length feature vectors. These feature vectors are used
objective to optimize, by training a maximum-entropy             as the labels of a supervised learning procedure to train
RL agent to optimize its distinctiveness from other agents       another function fstudent (x) → φ̃. The reward given to
as measured by a state-dependent classification network          the exploration policy is the same objective that the
trained in parallel [Eysenbach et al., 2018]. Intrinsic mo-      supervised process is minimizing:
tivation has even been shown to achieve impressive task
performance in the complete absence of task reward, in                            RRND (xt ) = k φt − φ̃t k              (1)
situations in which pure exploration correlates with the
task objective [Burda et al., 2018a].                            Diversity Is All You Need
                                                                 DIAYN [Eysenbach et al., 2018] is a reinforcement learn-
  In self-driven learning more generally, model-based            ing algorithm that trains an ensemble of diverse skills by
approaches can be trained purely self-supervised while           rewarding each policy for being distinct as measured by
providing analytic gradients to train the policy di-             a learned classification algorithm fd (x) → P (skill|x). DI-
rectly [Deisenroth and Rasmussen, 2011; Gal et al., ;            AYN trains an ensemble of maximum-entropy RL agents
Heess et al., 2015]. Model ensembles can be leveraged to         to maximize the following reward while acting as ran-
provide an estimate of uncertainty in addition to diversity      domly as possible:
of experience [Kurutach et al., 2018], and uncertainty-
aware methods like these could be used to backpropagate                        RDIAYN (xt ) = log P (skillt |xt )        (2)
directly into a learning agent for either seeking or avoiding
uncertainty [Henaff et al., 2019] as the need arises.            Goal Exploration Processes
                                                                 GEP [Colas et al., 2018] attempts to gather diverse data
  In this work, we are primarily interested in task-             by generating a set of randomized linear policies and
agnostic exploration.      We consider state-of-the-art          executing them to collect experience, which is stored
exploration methods Random Network Distillation                  in memory in the form of a task-dependent descriptor
(RND) [Burda et al., 2018b], Diversity Is All You Need           extracted from each trajectory. Since we do not allow the
(DIAYN) [Eysenbach et al., 2018], and GEP [Colas et al.,         exploration phase any knowledge of the ultimate tasks,
2018], for the purpose of generating diverse datasets with       we simply store the element-wise mean of the states
no task knowledge, evaluated according to the perfor-            along the trajectory as its descriptor. Once a number of
mance of a separate off-policy agent learning to optimize        randomized policies have been executed (N = 50 in our
entirely new tasks unknown at exploration time.                  case, as in the original work), random “goal” states are

Figure 2: Experimental environments: HalfCheetah-v1, Hopper-v1, Walker2d-v1, and real FrankaEmika Panda arm
with 7 degrees of freedom.

sampled from the state space, and the policy in memory            also requires the episode termination variable, we also
associated with the nearest state to the goal is perturbed        learn a termination prediction model fdone (xt ) → D̃t on-
with random noise and executed again, adding its new              line from data during exploration. We train a Soft Actor-
experience to the memory. We continue this procedure              Critic (SAC)-style maximum-entropy RL agent [Haarnoja
until our exploration dataset is the desired size.                et al., 2018] to directly maximize the following quantity,
                                                                  representing the sum of intrinsic rewards plus entropy
Random Policies
                                                                  terms:
To measure the importance of the goal-sampling step in
GEP, we also evaluate a simple baseline in which only the                 RSSE (xt , at ) = k f1 (xt , at ) − f2 (xt , at ) k   (5)
randomized policy step is applied. Rather than sampling
goals and perturbing policies from memory, this baseline                    T
                                                                            X
randomly initializes a new policy every episode.                   VSSE =         γ t (1 − D̃t )(RSSE (xt , at ) − log(P (at ))) (6)
                                                                            t=0
Self-Supervised Exploration
In addition to the existing baselines, we evaluate a novel          When trained on simulated forward rollouts from the
exploration approach obtained by turning the RL ob-               model and backpropagating through time, we obtain
jective typically optimized by prediction-error-based in-         a model-augmented supervised version of intrinsically-
trinsic motivation algorithms into a supervised objec-            motivated SAC that we refer to as Self-Supervised Ex-
tive through use of the forward model fforward (xt , at ) →       ploration (SSE).
x̃t+1 that is often trained as a byproduct of these ap-
                                                                  3.2   Offline Learning
proaches [Pathak et al., 2017; Pathak et al., 2019]. The
typical prediction-error-based reward being maximized is          Given a fixed dataset of robot experience, we are in-
of the form:                                                      terested in learning to solve arbitrary tasks completely
                                                                  offline with no further interaction with the environment.
         Rintrinsic (xt , at , xt+1 ) = k xt+1 − x̃t+1 k   (3)    In this section, we describe the two off-policy algorithms
                                                                  that we evaluate for learning tasks offline on fixed data,
  RL algorithms usually account for the difficulty of
                                                                  both of which are based on the deep deterministic policy
predicting the long-term future by optimizing discounted
                                                                  gradient (DDPG) [Lillicrap et al., 2015] algorithm for
rewards:
                                                                  continuous Q-learning.
              XT
        VR =      γ t (1 − Dt )Rintrinsic (xt , at , xt+1 ) (4)   Twin Delayed DDPG
               t=0                                                TD3 [Fujimoto et al., 2018a] is an improvement to
   where γ ∈ [0, 1) is the discount factor, and Dt is an          DDPG that aims to reduce the overestimation bias
indicator variable describing whether or not the episode          that is common when training off-policy value func-
has ended by time t. Because Rt and x̃t+1 depend on the           tions. Two Q-networks are trained simultaneously, and
forward model that is a byproduct of these prediction-            the minimum is chosen when evaluating the Q-value
error based methods, we can implement an exploration              for the purpose of bootstrapping (as in [Hasselt, 2010;
technique in which the gradient of the agent’s policy is          Van Hasselt et al., 2016]), which corresponds to pes-
estimated by backpropagation directly through our pre-            simistic estimation in the face of uncertainty. Further-
dictions of the future. We take inspiration from [Pathak          more, noise is added to the output of the policy during
et al., 2019], but rather than assume the next state does         training to encourage smoothness of estimation in small
not depend on the action, we train a pair of forward mod-         regions around observed experience. Finally, as in the
els f{1,2} (xt , at ) → x̃t+1 and optimize for their divergence   original work, the policy is trained half as frequently as
as a proxy for novelty. Because the sum of future rewards         the value networks.

Batch-Constrained Deep Q-learning                               Learning Method            Exploration Method
BCQ [Fujimoto et al., 2018b] achieves improved offline                                 Rand     DIAYN      RND     SSD
learning by training a state-conditional generative model
of the actions in the batch, which can then be used to                 TD3               0.46      0.51     0.49   0.30
sample actions that reflect the actions present in the                 BCQ               0.47      0.30     0.52   0.44
dataset. Keeping the policy action close to the buffer
distribution reduces the extrapolation error that would         Table 1: Distances of closest tooltip positions in meters
otherwise accumulate due to distribution mismatch and           for each evaluated method on the real robot.
overestimation.
                                                                in radians, and joint velocities in radians per second, re-
                                                                sulting in observation vectors of length 14. Actions were
4     Experiments                                               sent at 20Hz in the form of joint velocities in radians
In this work, we consider exploration methods to generate       per second, clipped in the range [−0.5, 0.5], and episodes
diverse datasets of robot experience for learning arbitrary     were reset after 1000 timesteps or when the robot violated
tasks completely offline. We first conduct a thorough           physical safety limits such as self-collision. We collected
evaluation of the approaches described in Section 3 on          data on the physical robot in a manner similar to the sim-
three standard simulated continuous control tasks as            ulated domains, but we limited the dataset size to 200K
in [Fujimoto et al., 2018b]. We then evaluate the best          transitions due to the additional time cost of physical
performing approaches to explore and then learn reaching        experiments. Policies were trained using TD3 and BCQ
tasks offline on a physical FrankaEmika Panda robot arm         offline to solve reaching tasks starting from a determin-
with 7 degrees of freedom. All environments are shown           istic “home” configuration to one of four different goals,
in Figure 2.                                                    specified differently for each task. Results for the real
                                                                robot experiments are shown in Table 1, comparing the
4.1    Simulation Experiments                                   distance to the target point for each exploration method
We first evaluate on a standard benchmark suite of 3            and offline learning algorithm.
OpenAI Gym MuJoCo environments: HalfCheetah-v1,
Hopper-v1, and Walker2d-v1 [Todorov et al., 2012;               5   Discussion
Brockman et al., 2016]. As described in Section 3, we           Somewhat counter-intuitively, the state-of-the-art RL
partition the experiments into a task-agnostic exploration      exploration methods we evaluated did not perform par-
phase followed by a task-aware offline training phase. In       ticularly well in our experiments, as shown in the
the exploration phase, each exploration method generates        HalfCheetah-v1 results in Figure 3. Particularly sur-
1 million transitions of experience in the simulated do-        prising is the result that randomly generating a new
main in the form of (xt , at , xt+1 ), and this data is saved   linear policy every episode seems to outperform many
as a static dataset.                                            of the other baselines by a wide margin. This suggests
   In the offline training phase, we choose a task with a       that despite achieving impressive results during online
state-dependent reward function that was not known to           training, current methods of exploration are not well
the exploration agent, and train an off-policy RL algo-         suited to the pure exploration paradigm, as described
rithm to solve the task on the data in the static dataset       in this paper. Furthermore, BCQ did not perform as
for 300,000 training steps. For task rewards, we use the        well as expected, but this is reasonable as it was not
standard locomotion tasks of maximizing forward veloc-          designed to learn from purely task-agnostic data. Also
ity, as provided by the three environments. We evaluate         of interest is that the best performing algorithms on the
RND, DIAYN, GEP, random networks, and our proposal              real robot did correspond to the best performance on the
SSE as exploration methods, and TD3 and BCQ as off-             simulation tasks. Note however that we did not engage in
policy learning algorithms. In all cases we use the default     heavy parameter tuning of the offline learning algorithms
hyperparameters from the original papers that proposed          we used, and Rainbow-style improvements [Hessel et al.,
the algorithms, and code for our experiments is freely          2018] to off-policy algorithms may provide improvements
available at https://github.com/qutrobotlearning/               to the result regardless of the exploration method used.
batchlearning. Results for the simulation experiments              During the exploration phase on the real robot we ob-
are shown in Figure 3.                                          served that the randomly generated linear policies seemed
                                                                to generate vastly diverse actions that in totality cov-
4.2    Physical Robot                                           ered a larger portion of the state space compared to the
In order to validate the approach on a real robot, we           other exploration methods. This may imply that while
consider the FrankaEmika Panda robot arm platform               systematically covering the state space might be useful
with 7 degrees of freedom. Agents observed joint angles         for exploration given an unlimited dataset, with the re-

(a) TD3                                                      (b) BCQ
                                               Figure 3: Simulation Results

strictions of a limited static dataset it is vital to explore   dataset to train robots with a lifetime worth of skills on
regions in the state space that are far apart. The greater      demand.
diversity in the dataset may increase generalization ca-           Our experiments showed interesting and unexpected
pabilities of the agent to nearby previously-unexplored         results for the state-of-the-art exploration methods and
states while reducing the chances of visiting states com-       off-policy algorithms in this setting. Since exploration
pletely different from those in the dataset.                    has been shown to be an important component of RL per-
                                                                formance, we were expecting the established exploration
6    Conclusion                                                 algorithms to generate diverse enough data to train tasks
                                                                offline, but in domains such as Hopper-v1 and Walker-2d,
In this work, we consider the paradigm of task-agnostic ex-     purely self-directed exploration without a task seems to
ploration for generating datasets that are diverse enough       be very challenging. We believe that this justifies further
to train policies to solve arbitrary tasks with no further      research in this paradigm given the potential benefits to
interaction with the environment. This is an important          robotics from single-dataset offline training.
goal for robotics, potentially enabling a single diverse           We make our algorithms, experiments, and hyper-

parameters freely available on https://github.com/               motion planning on humanoids. Frontiers in neuro-
qutrobotlearning/batchlearning.                                 robotics, 2014.
                                                              [Fujimoto et al., 2018a] Scott Fujimoto, Herke Hoof, and
References                                                       David Meger. Addressing function approximation error
[Bellemare et al., 2016] Marc Bellemare, Sriram Srini-           in actor-critic methods. In International Conference
  vasan, Georg Ostrovski, Tom Schaul, David Saxton,             on Machine Learning, pages 1582–1591, 2018.
   and Remi Munos. Unifying count-based exploration           [Fujimoto et al., 2018b] Scott Fujimoto, David Meger,
   and intrinsic motivation. In Advances in Neural Infor-        and Doina Precup.          Off-policy deep reinforce-
  mation Processing Systems, 2016.                               ment learning without exploration. arXiv preprint
[Brockman et al., 2016] Greg Brockman, Vicki Cheung,            arXiv:1812.02900, 2018.
   Ludwig Pettersson, Jonas Schneider, John Schulman,         [Gal et al., ] Yarin Gal, Rowan McAllister, and Carl Ed-
  Jie Tang, and Wojciech Zaremba. Openai gym. arXiv             ward Rasmussen. Improving pilco with bayesian neural
   preprint arXiv:1606.01540, 2016.                              network dynamics models.
[Bruce et al., 2017] Jake Bruce, Niko Sünderhauf, Piotr       [Gu et al., 2017] Shixiang Gu, Ethan Holly, Timothy Lil-
   Mirowski, Raia Hadsell, and Michael Milford. One-             licrap, and Sergey Levine. Deep reinforcement learning
   shot reinforcement learning for robot navigation with         for robotic manipulation with asynchronous off-policy
   interactive replay. arXiv preprint arXiv:1711.10137,          updates. In 2017 IEEE International Conference on
  2017.                                                         Robotics and Automation (ICRA), pages 3389–3396.
[Burda et al., 2018a] Yuri Burda, Harri Edwards,                 IEEE, 2017.
   Deepak Pathak, Amos Storkey, Trevor Darrell, and           [Haarnoja et al., 2018] Tuomas Haarnoja, Aurick Zhou,
  Alexei A Efros. Large-scale study of curiosity-driven          Pieter Abbeel, and Sergey Levine.           Soft actor-
   learning. arXiv preprint arXiv:1808.04355, 2018.              critic: Off-policy maximum entropy deep reinforce-
[Burda et al., 2018b] Yuri Burda, Harrison Edwards,              ment learning with a stochastic actor. arXiv preprint
  Amos Storkey, and Oleg Klimov.              Exploration       arXiv:1801.01290, 2018.
   by random network distillation. arXiv preprint             [Hasselt, 2010] Hado V Hasselt. Double q-learning. In
  arXiv:1810.12894, 2018.                                       Advances in Neural Information Processing Systems,
[Chentanez et al., 2005] Nuttapong Chentanez, An-               2010.
   drew G Barto, and Satinder P Singh. Intrinsically          [Heess et al., 2015] Nicolas Heess, Gregory Wayne,
   motivated reinforcement learning. In Advances in              David Silver, Timothy Lillicrap, Tom Erez, and Yuval
  neural information processing systems, 2005.                  Tassa. Learning continuous control policies by stochas-
[Colas et al., 2018] Cédric Colas, Olivier Sigaud, and           tic value gradients. In Advances in Neural Information
   Pierre-Yves Oudeyer. Gep-pg: Decoupling exploration          Processing Systems, 2015.
   and exploitation in deep reinforcement learning al-        [Henaff et al., 2019] Mikael Henaff, Alfredo Canziani,
   gorithms. In International Conference on Machine              and Yann LeCun. Model-predictive policy learning
  Learning (ICML), 2018.                                        with uncertainty regularization for driving in dense
[Deisenroth and Rasmussen, 2011] Marc Deisenroth and             traffic. arXiv preprint arXiv:1901.02705, 2019.
   Carl E Rasmussen. Pilco: A model-based and data-           [Hessel et al., 2018] Matteo Hessel, Joseph Modayil,
   efficient approach to policy search. In Proceedings of        Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will
  the 28th International Conference on machine learning          Dabney, Dan Horgan, Bilal Piot, Mohammad Azar,
  (ICML-11), 2011.                                               and David Silver. Rainbow: Combining improvements
[Ernst et al., 2005] Damien Ernst, Pierre Geurts, and            in deep reinforcement learning. In Thirty-Second AAAI
   Louis Wehenkel. Tree-based batch mode reinforce-             Conference on Artificial Intelligence, 2018.
   ment learning. Journal of Machine Learning Research,       [Jaakkola et al., 1994] Tommi Jaakkola, Michael I Jor-
   6(Apr):503–556, 2005.                                         dan, and Satinder P Singh. Convergence of stochastic
[Eysenbach et al., 2018] Benjamin Eysenbach, Abhishek            iterative dynamic programming algorithms. In Ad-
   Gupta, Julian Ibarz, and Sergey Levine. Diversity is         vances in neural information processing systems, pages
   all you need: Learning skills without a reward function.     703–710, 1994.
  arXiv preprint arXiv:1802.06070, 2018.                      [Kalashnikov et al., 2018] Dmitry Kalashnikov, Alex Ir-
[Frank et al., 2014] Mikhail Frank, Jürgen Leitner, Mar-         pan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric
   ijn Stollenga, Alexander Förster, and Jürgen Schmid-         Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrish-
   huber. Curiosity driven reinforcement learning for            nan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep

reinforcement learning for vision-based robotic manip-      ings of the IEEE Conference on Computer Vision and
  ulation. arXiv preprint arXiv:1806.10293, 2018.             Pattern Recognition Workshops, pages 16–17, 2017.
[Kurutach et al., 2018] Thanard Kurutach, Ignasi Clav-      [Pathak et al., 2019] Deepak Pathak, Dhiraj Gandhi,
  era, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-        and Abhinav Gupta. Beyond games: Bringing ex-
  ensemble trust-region policy optimization. arXiv            ploration to robots in real-world, 2019.
  preprint arXiv:1802.10592, 2018.                          [Rusu et al., 2016] Andrei A Rusu, Mel Vecerik, Thomas
[Lange et al., 2012] Sascha Lange, Thomas Gabel, and          Rothörl, Nicolas Heess, Razvan Pascanu, and Raia
  Martin Riedmiller. Batch reinforcement learning. In         Hadsell. Sim-to-real robot learning from pixels with
  Reinforcement learning, pages 45–73. Springer, 2012.        progressive nets. arXiv preprint arXiv:1610.04286,
                                                              2016.
[Levine et al., 2016] Sergey Levine, Chelsea Finn, Trevor
  Darrell, and Pieter Abbeel. End-to-end training of deep   [Tang et al., 2017] Haoran Tang, Rein Houthooft, Davis
  visuomotor policies. The Journal of Machine Learning        Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan,
  Research, 17(1):1334–1373, 2016.                            John Schulman, Filip DeTurck, and Pieter Abbeel. #
                                                              exploration: A study of count-based exploration for
[Lillicrap et al., 2015] Timothy P Lillicrap, Jonathan J      deep reinforcement learning. In Advances in neural
   Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,          information processing systems, 2017.
  Yuval Tassa, David Silver, and Daan Wierstra. Contin-
                                                            [Thrun, 1992] Sebastian B Thrun. Efficient exploration
   uous control with deep reinforcement learning. arXiv
   preprint arXiv:1509.02971, 2015.                           in reinforcement learning. 1992.
                                                            [Todorov et al., 2012] Emanuel Todorov, Tom Erez, and
[Liu et al., 2015] De-Rong Liu, Hong-Liang Li, and Ding
                                                              Yuval Tassa. Mujoco: A physics engine for model-based
  Wang. Feature selection and feature learning for high-
                                                              control. In 2012 IEEE/RSJ International Conference
   dimensional batch reinforcement learning: A survey.
                                                              on Intelligent Robots and Systems, pages 5026–5033.
  International Journal of Automation and Computing,
                                                              IEEE, 2012.
  2015.
                                                            [Van Hasselt et al., 2016] Hado Van Hasselt, Arthur
[Mnih et al., 2015] Volodymyr       Mnih,         Koray       Guez, and David Silver. Deep reinforcement learning
  Kavukcuoglu, David Silver, Andrei A Rusu, Joel              with double q-learning. In Thirtieth AAAI Conference
  Veness, Marc G Bellemare, Alex Graves, Martin               on Artificial Intelligence, 2016.
  Riedmiller, Andreas K Fidjeland, Georg Ostrovski,
  et al. Human-level control through deep reinforcement     [Večerík et al., 2017] Matej Večerík, Todd Hester,
  learning. Nature, 518(7540):529, 2015.                      Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal
                                                              Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe,
[Mnih et al., 2016] Volodymyr Mnih, Adria Puig-               and Martin Riedmiller. Leveraging demonstrations
  domenech Badia, Mehdi Mirza, Alex Graves, Timothy           for deep reinforcement learning on robotics problems
  Lillicrap, Tim Harley, David Silver, and Koray              with sparse rewards. arXiv preprint arXiv:1707.08817,
  Kavukcuoglu.      Asynchronous methods for deep             2017.
  reinforcement learning. In International conference on
                                                            [Watkins and Dayan, 1992] Christopher JCH Watkins
  machine learning, pages 1928–1937, 2016.
                                                              and Peter Dayan. Q-learning. Machine learning, 8(3-
[Osband et al., 2016] Ian Osband, Charles Blundell,           4):279–292, 1992.
  Alexander Pritzel, and Benjamin Van Roy. Deep
                                                            [Zhang et al., 2015] Fangyi Zhang, Jürgen Leitner,
  exploration via bootstrapped dqn. In D. D. Lee,
                                                              Michael Milford, Ben Upcroft, and Peter Corke.
  M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar-
                                                              Towards vision-based deep reinforcement learn-
  nett, editors, Advances in Neural Information Process-
                                                              ing for robotic motion control.   arXiv preprint
  ing Systems 29, pages 4026–4034. Curran Associates,
                                                              arXiv:1511.03791, 2015.
  Inc., 2016.
[Ostrovski et al., 2017] Georg Ostrovski, Marc G Belle-
  mare, Aäron van den Oord, and Rémi Munos. Count-
  based exploration with neural density models. In Pro-
  ceedings of the 34th International Conference on Ma-
  chine Learning-Volume 70. JMLR. org, 2017.
[Pathak et al., 2017] Deepak Pathak, Pulkit Agrawal,
  Alexei A Efros, and Trevor Darrell. Curiosity-driven
  exploration by self-supervised prediction. In Proceed-