Reset-Free Reinforcement Learning via Multi-Task Learning: Learning Dexterous Manipulation Behaviors without Human Intervention
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Reset-Free Reinforcement Learning via Multi-Task Learning: Learning Dexterous Manipulation Behaviors without Human Intervention Abhishek Gupta∗1 Justin Yu∗1 Tony Z. Zhao∗1 Vikash Kumar∗2 Aaron Rovinsky 1 Kelvin Xu 1 Thomas Devlin 1 Sergey Levine1 1 UC Berkeley 2 University of Washington arXiv:2104.11203v1 [cs.LG] 22 Apr 2021 Abstract— Reinforcement Learning (RL) algorithms can in principle acquire complex robotic skills by learning from large amounts of data in the real world, collected via trial and error. However, most RL algorithms use a carefully engineered setup in order to collect data, requiring human supervision and intervention to provide episodic resets. This is particularly evident in challenging robotics problems, such as dexterous manipulation. To make data collection scalable, such applications require reset-free algorithms that are able to learn autonomously, without explicit instrumentation or human intervention. Most prior work in this area handles single-task learning. However, we might also want robots that can perform large repertoires of skills. At first, this would appear to only make the problem harder. However, the key observation we make in this work is that an appropriately chosen multi-task RL setting actually alleviates the reset-free learning challenge, with minimal additional machinery required. In effect, solving a multi-task problem can directly solve the reset-free problem Fig. 1: Reset-free learning of dexterous manipulation behaviors since different combinations of tasks can serve to perform by leveraging multi-task learning. When multiple tasks are learned resets for other tasks. By learning multiple tasks together and together, different tasks can serve to reset each other, allowing appropriately sequencing them, we can effectively learn all of for uninterrupted continuous learning of all of the tasks. This the tasks together reset-free. This type of multi-task learning allows for the learning of dexterous manipulation tasks like in- can effectively scale reset-free learning schemes to much more hand manipulation and pipe insertion with a 4-fingered robotic hand, complex problems, as we demonstrate in our experiments. We without any human intervention, with over 60 hours of uninterrupted propose a simple scheme for multi-task learning that tackles training the reset-free learning problem, and show its effectiveness at learning to solve complex dexterous manipulation tasks in both hardware and simulation without any explicit resets. This work clear lens on the reset-free learning problem. For instance, a shows the ability to learn dexterous manipulation behaviors in the real world with RL without any human intervention. dexterous hand performing in-hand manipulation, as shown in Figure 1 (right), must delicately balance the forces on I. I NTRODUCTION the object to keep it in position. Early on in training, Reinforcement learning algorithms have shown promise the policy will frequently drop the object, necessitating in enabling robotic tasks in simulation [1], [2], and even a particularly complex reset procedure. Prior work has some tasks in the real world [3], [4]. RL algorithms in addressed this manually by having a human involved in principle can learn generalizable behaviors with minimal the training process [9], [10], [11], instrumenting a reset manual engineering, simply by collecting their own data in the environment [12], [13], or even by programming a via trial and error. This approach works particularly well in separate robot to place the object back in the hand [7]. Though simulation, where data is cheap and abundant. Success in the some prior techniques have sought to learn some form of a real world has been restricted to settings where significant “reset” controller [14], [15], [16], [8], [17], [18], none of these environment instrumentation and engineering is available to are able to successfully scale to solve a complex dexterous enable autonomous reward calculation and episodic resets manipulation problem without hand-designed reset systems [5], [6], [7], [8]. To fully realize the promise of robotic RL due to the challenge of learning robust reset behaviors. in the real world, we need to be able to learn even in the However, general-purpose robots deployed in real-world absence of environment instrumentation. settings will likely be tasked with performing many different In this work, we focus specifically on reset-free learning behaviors. While multi-task learning algorithms in these set- for dexterous manipulation, which presents an especially tings have typically been studied in the context of improving ∗ Authors contributed equally to this work. sample efficiency and generalization [19], [20], [21], [22], https://sites.google.com/view/mtrf [23], [24], in this work we make the observation that multi-
task algorithms naturally lend themselves to the reset-free leveraging multi-task learning. learning problem. We hypothesize that the reset-free RL Our work is certainly not the first to consider the problem problem can be addressed by reformulating it as a multi-task of reset-free RL [52]. [14] proposes a scheme that interleaves problem, and appropriately sequencing the tasks commanded attempts at the task with episodes from an explicitly learned and learned during online reinforcement learning. As outlined reset controller, trained to reach the initial state. Building in Figure 6, solving a collection of tasks simultaneously on this work, [8] shows how to learn simple dexterous presents the possibility of using some tasks as a reset for manipulation tasks without instrumentation using a pertur- others, thereby removing the need of explicit per task resets. bation controller exploring for novelty instead of a reset For instance, if we consider the problem of learning a variety controller. [53], [54] demonstrate learning of multi-stage tasks of dexterous hand manipulation behaviors, such as in-hand by progressively sequencing a chain of forward and backward reorientation, then learning and executing behaviors such controllers. Perhaps the most closely related work to ours as recenter and pickup can naturally reset the other tasks algorithmically is the framework proposed in [28], where the in the event of failure (as we describe in Section IV and agent learns locomotion by leveraging multi-task behavior. Figure 6). We show that by learning multiple different tasks However, this work studies tasks with cyclic dependencies simultaneously and appropriately sequencing behavior across specifically tailored towards the locomotion domain. Our work different tasks, we can learn all of the tasks without episodic shows that having a variety of tasks and learning them all resets required at all. This allows us to effectively learn a together via multi-task RL can allow solutions to challenging “network” of reset behaviors, each of which is easier than reset free problems in dexterous manipulation domains. learning a complete reset controller, but together can execute Our work builds on the framework of multi-task learn- and learn more complex behavior. ing [19], [20], [21], [22], [23], [24], [55], [56] which have The main contribution of this work is to propose a learning been leveraged to learn a collection of behaviors, improve system that can learn dexterous manipulation behaviors on generalization as well as sample efficiency, and have even without the need for episodic resets. We do so by leveraging applied to real world robotics problems [57], [58], [59]. In the paradigm of multi-task reinforcement learning to make this work, we take a different view on multi-task RL as a the reset free problem less challenging. The system accepts means to solve reset-free learning problems. a diverse set of tasks that are to be learned, and then trains reset-free, leveraging progress in some tasks to provide resets III. P RELIMINARIES for other tasks. To validate this algorithm for reset-free robotic learning, we perform both simulated and hardware We build on the framework of Markov decision processes experiments on a dexterous manipulation system with a four for reinforcement learning. We refer the reader to [60] fingered anthropomorphic robotic hand. To our knowledge, for a more detailed overview. RL algorithms consider the these results demonstrate the first instance of a combined problem of learning a policy π(a|s) such that the expected hand-arm system learning dexterous in-hand manipulation sum of rewards R(st , at ) obtained under such a policy is with deep RL entirely in the real world with minimal human maximized when starting from an initial state distribution µ0 intervention during training, simultaneously acquiring both and dynamics P(st+1 |st , at ). This objective is given by: the in-hand manipulation skill and the skills needed to retry " T X # the task. We also show the ability of this system to learn J(π) = E s0 ∼µ0 t γ R(st , at ) (1) other dexterous manipulation behaviors like pipe insertion at ∼π(at |st ) t=0 st+1 ∼P(st+1 |st ,at ) via uninterrupted real world training, as well as several tasks in simulation. While many algorithms exist to optimize this objective, in this work we build on the framework of actor-critic II. R ELATED W ORK algorithms [61]. Although we build on actor critic framework, RL algorithms have been applied to a variety of robotics we emphasize that our framework can be effectively used problems in simulation [25], [1], [26], [27], and have also seen with many standard reinforcement learning algorithms with application to real-world problems, such as locomotion [28], minimal modifications. [29], [30], grasping [6], [31], [32], [33], manipulation of As we note in the following section, we address the reset- articulated objects [34], [35], [13], [36], [37], and even free RL problem via multi-task RL. Multi-task RL attempts dexterous manipulation [38], [10]. Several prior works [39] to learn multiple tasks simultaneously. Under this setting, have shown how to acquire dexterous manipulation behaviors each of K tasks involves a separate reward function Ri , with optimization [40], [41], [42], [43], [44], reinforcement different initial state distribution µi0 and potentially different learning in simulation [1], [45], [46], and even in the real optimal policy πi . Given a distribution over the tasks p(i), world [47], [9], [48], [7], [15], [12], [10], [49], [50], [51]. the multi-task problem can then be described as These techniques have leaned on highly instrumented setups " # to provide episodic resets and rewards. For instance, prior X t J(π0 , . . . , πK−1 ) = Ei∼p(i) E s0 ∼µi0 γ Ri (st , at ) work uses a scripted second arm [7] or separate servo motors at ∼πi (st |at ) t [12] to perform resets. Contrary to these, our work focuses st+1 ∼P(st ,at ) on removing the need for explicit environment resets, by (2)
In the following section, we will discuss how viewing Our proposed learning system, which we call Multi-Task reset-free learning through the lens of multi-task learning can learning for Reset-Free RL (MTRF), attempts to jointly learn naturally address the challenges in reset-free RL. K different policies πi , one for each of the defined tasks, by leveraging off-policy RL and continuous data collection IV. L EARNING D EXTEROUS M ANIPULATION B EHAVIORS in the environment. The system is initialized by randomly R ESET-F REE VIA M ULTI -TASK RL sampling a task and state s0 sampled from the task’s initial One of the main advantages of dexterous robots is their state distribution. The robot collects a continuous stream of ability to perform a wide range of different tasks. Indeed, we data without any subsequent resets in the environment by might imagine that a real-world dexterous robotic system sequencing the K policies according to a meta-controller deployed in a realistic environment, such as a home or (referred to as a “task-graph”) G(s) : S → {0, 1, . . . , K − 1}. office, would likely need to perform a repertoire of different Given the current state of the environment and the learning behaviors, rather than just a single skill. While this may at process, the task-graph makes a decision once every T time first seem like it would only make the problem of learning steps on which of the tasks should be executed and trained for without resets more difficult, the key observation we make in the next T time steps. This task-graph decides what order the this work is that the multi-task setting can actually facilitate tasks should be learned and which of the policies should be reset-free learning without manually provided instrumentation. used for data collection. The learning proceeds by iteratively When a large number of diverse tasks are being learned collecting data with a policy πi chosen by the task-graph for simultaneously, some tasks can naturally serve as resets T time steps, after which the collected data is saved to a for other tasks during learning. Learning each of the tasks task-specific replay buffer Bi and the task-graph is queried individually without resets is made easier by appropriately again for which task to execute next, and the whole process learning and sequencing together other tasks in the right order. repeats. By doing so, we can replace the simple forward, reset behavior We assume that, for each task, successful outcomes for dichotomy with a more natural “network” of multiple tasks tasks that lead into that task according to the task graph that can perform complex reset behaviors between each other. (i.e., all incoming edges) will result in valid initial states for Let us ground this intuition in a concrete example. Given that task. This assumption is reasonable: intuitively, it states a dexterous table-top manipulation task, shown in Fig 6 and that an edge from task A to task B implies that successful Fig 2 , our reset-free RL procedure might look like this: let outcomes of task A are valid initial states for task B. This us say the robot starts with the object in the palm and is means that, if task B is triggered after task A, it will learn to trying to learn how to manipulate it in-hand so that it is succeed from these valid initial states under µB 0 . While this oriented in a particular direction (in-hand reorient). While does not always guarantee that the downstream controller for doing so, it may end up dropping the object. When learning task B will see all of the initial states from µB 0 , since the with resets, a person would need to pick up the object and upstream controller is not explicitly optimizing for coverage, place it back in the hand to continue training. However, in practice we find that this still performs very well. However, since we would like the robot to learn without such manual we expect that it would also be straightforward to introduce interventions, the robot itself needs to retrieve the object and coverage into this method by utilizing state marginal matching resume practicing. To do so, the robot must first re-center methods [62]. We leave this for future work. and the object so that it is suitable for grasping, and then The individual policies can continue to be trained by actually lift and the flip up so it’s in the palm to resume leveraging the data collected in their individual replay buffers practicing. In case any of the intermediate tasks (say lifting Bi via off-policy RL. As individual tasks become more the object) fails, the recenter task can be deployed to attempt and more successful, they can start to serve as effective picking up again, practicing these tasks themselves in the resets for other tasks, forming a natural curriculum. The process. Appropriately sequencing the execution and learning proposed framework is general and capable of learning a of different tasks, allows for the autonomous practicing of diverse collection of tasks reset-free when provided with a in-hand manipulation behavior, without requiring any human task graph that leverages the diversity of tasks. This leaves or instrumented resets. open the question of how to actually define the task-graph G to effectively sequence tasks. In this work, we assume that A. Algorithm Description a task-graph defining the various tasks and the associated In this work, we directly leverage this insight to build a transitions is provided to the agent by the algorithm designer. dexterous robotic system that learns in the absence of resets. In practice, providing such a graph is simple for a human user, We assume that we are provided with K different tasks that although it could in principle be learned from experiential need to be learned together. These tasks each represent some data. We leave this extension for future work. Interestingly, distinct capability of the agent. As described above, in the many other reset-free learning methods [14], [53], [54] can dexterous manipulation domain the tasks might involve re- be seen as special cases of the framework we have described centering, picking up the object, and reorienting an object above. In our experiments we incorporate one of the tasks as in-hand. Each of these K different tasks is provided with its a “perturbation” task. While prior work considered doing this own reward function Ri (st , at ), and at test-time is evaluated with a single forward controller [8], we show that this type against its distinct initial state distribution µi0 . of perturbation can generally be applied by simply viewing
Fig. 2: Depiction of some steps of reset-free training for the in-hand manipulation task family on hardware. Reset-free training uses the task graph to choose which policy to execute and train at every step. For executions that are not successful (e.g., the pickup in step 1), other tasks (recenter in step 2) serve to provide a reset so that pickup can be attempted again. Once pickup is successful, the next task (flip up) can be attempted. If the flip-up policy is successful, then the in-hand reorientation task can be attempted and if this drops the object then the re-centering task is activated to continue training. it as another task. We incorporate this perturbation task in V. TASK AND S YSTEM S ETUP our instantiation of the algorithm, but we do not show it in To study MTRF in the context of challenging robotic the task graph figures for simplicity. tasks, such as dexterous manipulation, we designed an anthropomorphic manipulation platform in both simulation B. Practical Instantiation and hardware. Our system (Fig 5) consists of a 22-DoF To instantiate the algorithmic idea described above as anthropomorphic hand-arm system. We use a self-designed a deep reinforcement learning framework that is capable and manufactured four-fingered, 16 DoF robot hand called of solving dexterous manipulation tasks without resets, we the D’Hand, mounted on a 6 DoF Sawyer robotic arm to can build on the framework of actor-critic algorithms. We allow it to operate in an extended workspace in a table-top learn separate policies πi for each of the K provided tasks, setting. We built this hardware to be particularly amenable with separate critics Qi and replay buffers Bi for each of to our problem setting due it’s robustness and ease of long the tasks. Each of the policies πi is a deep neural network term operation. The D’Hand can operate for upwards of 100 Gaussian policy with parameters θi , which is trained using a hours in contact rich tasks without any breakages, whereas standard actor-critic algorithm, such as soft actor-critic [63], previous hand based systems are much more fragile. Given using data sampled from its own replay buffer Bi . The task the modular nature of the hand, even if a particular joint graph G is represented as a user-provided state machine, as malfunctions, it is quick to repair and continue training. In our shown in Fig 6, and is queried every T steps to determine experimental evaluation, we use two different sets of dexterous which task policy πi to execute and update next. Training manipulation tasks in simulation and two different sets of proceeds by starting execution from a particular state s0 in the tasks in the real world. Details can be found in Appendix A environment, querying the task-graph G to determine which and at https://sites.google.com/view/mtrf policy i = G(s0 ) to execute, and then collecting T time-steps A. Simulation Domains of data using the policy πi , transitioning the environment to a new state sT (Fig 6). The task-graph is then queried again and the process is repeated until all the tasks are learned. Algorithm 1 MTRF 1: Given: K tasks with rewards Ri (st , at ), along with a task graph mapping states to a task index G(s) : S → {0, 1, . . . , K − 1} 2: Let î represent the task index associated with the forward task that is being learned. 3: Initialize πi , Qi , Bi ∀i ∈ {0, 1, . . . , K − 1} 4: Initialize the environment in task î with initial state sî ∼ µî (sî ) 5: for iteration n = 1, 2, ... do 6: Obtain current task i to execute by querying task graph at the current environment state i = G(scurr ) 7: for iteration j = 1, 2, ..., T do Fig. 3: Tasks and transitions for lightbulb insertion in simulation. 8: Execute πi in environment, receiving task-specific The goal is to recenter a lightbulb, lift it, flip it over, and then insert rewards Ri storing data in the buffer Bi it into a lamp. 9: Train the current task’s policy and value functions Lightbulb insertion tasks. The first family of tasks involves πi , Qi by sampling a batch from the replay buffer inserting a lightbulb into a lamp in simulation with the containing this task’s experience Bi , according to dexterous hand-arm system. The tasks consist of centering SAC [63]. the object on the table, pickup, in-hand reorientation, and 10: end for insertion into the lamp. The multi-task transition task graph 11: end for is shown in Fig 3. These tasks all involve coordinated finger
and arm motion and require precise movement to insert the from dropping the object during the in-hand reorientation, lightbulb. which ordinarily would require some sort of reset mechanism Basketball tasks. The second family of tasks involves dunk- (as seen in prior work [7]). However, MTRF enables the robot ing a basketball into a hoop. This consists of repositioning to utilize such “failures” as an opportunity to practice the the ball, picking it up, positioning the hand over the basket, tabletop re-centering, pickup, and flip-up tasks, which serve and dunking the ball. This task has a natural cyclic nature, to “reset” the object into a pose where the reorientation can and allows tasks to reset each other as shown in Fig 4, while be attempted again.1 The configuration of the 22-DoF hand- requiring fine-grained behavior to manipulate the ball midair. arm system mirrors that in simulation. The object is tracked using a motion capture system. Our policy directly controls each joint of the hand and the position of the end-effector. The system is set up to allow for extended uninterrupted operation, allowing for over 60 hours of training without any human intervention. We show how our proposed technique allows the robot to learn this task in the following section. Fig. 4: Tasks and transitions for basketball domain in simulation. The goal here is to reposition a basketball object, pick it up and then dunk it in a hoop. B. Hardware Tasks Fig. 6: Tasks and transitions for the in-hand manipulation domain on hardware. The goal here is to rotate a 3 pronged valve object to a particular orientation in the palm of the hand, picking it up if it falls down to continue practicing. Fig. 5: Real-world hand-arm manipulation platform. The system Pipe insertion: For the second task on hardware, we set up comprises a 16 DoF hand mounted on a 6 DoF Sawyer arm. The goal of the task is to perform in-hand reorientation, as illustrated in a pipe insertion task, where the goal is to pick up a cylindrical Fig 10 or pipe insertion as shown in Fig 11 pipe object and insert it into a hose attachment on the wall, as shown in Fig 7. This task not only requires mastering the We also evaluate MTRF on the real-world hand-arm robotic contacts required for a successful pickup, but also accurate and system, training a set of tasks in the real world, without fine-grained arm motion to insert the pipe into the attachment any simulation or special instrumentation. We considered 2 in the wall. The task graph corresponding to this domain is different task families - in hand manipulation of a 3 pronged shown in Fig 7. In this domain, the agent learns to pickup the valve object, as well as pipe insertion of a cylindrical pipe object and then insert it into the attachment in the wall. If the into a hose attachment mounted on the wall. We describe object is dropped, it is then re-centered and picked up again each of these setups in detail below: to allow for another attempt at insertion. 2 As in the previous In-Hand Manipulation: For the first task on hardware, we domain, our policy directly controls each joint of the hand use a variant of the in-hand reorienting task, where the goal is and the position of the end-effector. The system is set up to to pick up an object and reorient it in the palm into a desired allow for extended uninterrupted operation, allowing for over configuration, as shown in Fig 6. This task not only requires mastering the contacts required for a successful pickup, but 1 For the pickup task, the position of the arm’s end-effector is scripted also fine-grained finger movements to reorient the object in and only D’Hand controls are learned to reduce training time. 2 For the pickup task, the position of the arm’s end-effector is scripted and the palm, while at the same time balancing it so as to avoid only D’Hand controls are learned to reduce training time. For the insertion dropping. The task graph corresponding to this domain is task, the fingers are frozen since it is largely involving accurate motion of shown in Fig 6. A frequent challenge in this domain stems the arm.
30 hours of training without any human intervention. we roll out its final policy starting from states randomly sampled from the distribution induced by all the tasks that can transition to the task under evaluation, and report performance in terms of their success in solving the task. Additional details, videos of all tasks and hyperparameters can be found at [https://sites.google.com/view/mtrf]. B. Reset-Free Learning Comparisons in Simulation We present results for reset-free learning, using our algorithm and prior methods, in Fig 8, corresponding to each of the tasks in simulation in Section V. We see that MTRF is able to successfully learn all of the tasks jointly, as evidenced by Fig 8. We measure evaluation performance after training by loading saved policies and running the policy corresponding to the “forward” task for each of the task families (i.e. lightbulb insertion and basketball dunking). This indicates that we can solve all the tasks, and as a result can learn reset free more effectively than prior algorithms. In comparison, we see that the prior algorithms for off-policy RL – the reset controller [14] and perturbation Fig. 7: Tasks and transitions for pipe insertion domain on hardware. controller [8] – are not able to learn the more complex of The goal here is to reposition a cylindrical pipe object, pick it up the tasks as effectively as our method. While these methods and then insert it into a hose attachment on the wall. are able to make some amount of progress on tasks that are VI. E XPERIMENTAL E VALUATION constructed to be very easy such as the pincer task family shown in Appendix B, we see that they struggle to scale well We focus our experiment on the following questions: to the more challenging tasks (Fig 8). Only MTRF is able 1) Are existing off-policy RL algorithms effective when to learn these tasks, while the other methods never actually deployed under reset-free settings to solve dexterous reach the later tasks in the task graph. manipulation problems? To understand how the tasks are being sequenced during the 2) Does simultaneously learning a collection of tasks learning process, we show task transitions experienced during under the proposed multi-task formulation with MTRF training for the basketball task in Fig 9. We observe that early alleviate the need for resets when solving dexterous in training the transitions are mostly between the recenter manipulation tasks? and perturbation tasks. As MTRF improves, the transitions 3) Does learning multiple tasks simultaneously allow for add in pickup and then basketball dunking, cycling between reset-free learning of more complex tasks than previous re-centering, pickup and basketball placement in the hoop. reset free algorithms? 4) Does MTRF enable real-world reinforcement learning C. Learning Real-World Dexterous Manipulation Skills without resets or human interventions? Next, we evaluate the performance of MTRF on the real- world robotic system described in Section V, studying the A. Baselines and Prior Methods dexterous manipulation tasks described in Section V-B. We compare MTRF (Section IV) to three prior baseline a) In-Hand Manipulation: Let us start by considering algorithms. Our first comparison is to a state-of-the-art the in-hand manipulation tasks shown in Fig 6. This task off-policy RL algorithm, soft actor-critic [63] (labeled as is challenging because it requires delicate handling of SAC). The actor is executed continuously and reset-free in finger-object contacts during training, the object is easy the environment, and the experienced data is stored in a to drop during the flip-up and in-hand manipulation, and replay pool. This algorithm is representative of efficient the reorientation requires a coordinated finger gait to off-policy RL algorithms. We next compare to a version rotate the object. In fact, most prior work that aims to of a reset controller [14] (labeled as Reset Controller), learn similar in-hand manipulation behaviors either utilizes which trains a forward controller to perform the task and simulation or employs a hand-designed reset procedure, a reset controller to reset the state back to the initial potentially involving human interventions [7], [12], [10]. state. Lastly, we compare with the perturbation controller To the best of our knowledge, our work is the first to technique [8] introduced in prior work, which alternates show a real-world robotic system learning such a task between learning and executing a forward task-directed policy entirely in the real world and without any manually and a perturbation controller trained purely with novelty provided or hand-designed reset mechanism. We visualize a bonuses [64] (labeled as Perturbation Controller). For all sequential execution of the tasks (after training) in Fig 10, the experiments we used the same RL algorithm, soft actor- and encourage the reader to watch a video of this task, critic [63], with default hyperparameters. To evaluate a task, as well as the training process, on the project website:
Fig. 8: Comparison of MTRF with baseline methods in simulation when run without resets. In comparison to the prior reset-free RL methods, MTRF is able to learn the tasks more quickly and with higher average success rates, even in cases where none of the prior methods can master the full task set. MTRF is able to solve all of the tasks without requiring any explicit resets. reorientation. For lifting and flipping over, success is defined as lifting the object to a particular height above the table, and for reorient success is defined by the difference between the current object orientation and the target orientation of the object. As shown in Fig 12, MTRF is able to autonomously learn all tasks in the task graph in 60 hours, and achieves an 70% success rate for the in-hand reorient task. This experiment illustrates how MTRF can enable a complex real-world robotic system to learn an in-hand manipulation behavior while at the same time autonomously retrying the task during a lengthy unattended training run, without any simulation, special instrumentation, or manual interventions. This experiment suggests that, when MTRF is provided with an appropriate set of tasks, learning of complex manipulation skills can be carried out entirely autonomously in the real world, even for highly complex robotic manipulators such as Fig. 9: Visualization of task frequency in the basketball task Family. While initially recentering and pickup are common, as these get multi-fingered hands. better they are able to provide resets for other tasks. b) Pipe Insertion: We also considered the second task variant which involves manipulating a cylindrical pipe to insert it into a hose attachment on the wall as [https://sites.google.com/view/mtrf]. Over shown in Fig 7. This task is challenging because it requires the course of training, the robot must first learn to recenter accurate grasping and repositioning of the object during the object on the table, then learn to pick it up (which training in order to accurately insert it into the hose requires learning an appropriate grasp and delicate control of attachment, requiring coordination of both the arm and the fingers to maintain grip), then learn to flip up the object the hand. Training this task without resets requires a so that it rests in the palm, and finally learn to perform the combination of repositioning, lifting, insertion and removal orientation. Dropping the object at any point in this process to continually keep training and improving. We visualize a requires going back to the beginning of the sequence, and sequential execution of the tasks (after training) in Fig 11, initially most of the training time is spent on re-centering, and encourage the reader to watch a video of this task, which provides resets for the pickup. The entire training as well as the training process, on the project website: process takes about 60 hours of real time, learning all of [https://sites.google.com/view/mtrf]. Over the tasks simultaneously. Although this time requirement the course of training, the robot must first learn to recenter is considerable, it is entirely autonomous, making this the object on the table, then learn to pick it up (which approach scalable even without any simulation or manual requires learning an appropriate grasp), then learn to actually instrumentation. The user only needs to position the objects move the arm accurately to insert the pipe to the attachment, for training, and switch on the robot. and finally learn to remove it to continue training and For a quantitative evaluation, we plot the success rate of sub- practicing. Initially most of the training time is spent on tasks including re-centering, lifting, flipping over, and in-hand re-centering, which provides resets for the pickup, which
Fig. 10: Film strip illustrating partial training trajectory of hardware system for in-hand manipulation of the valve object. This shows various behaviors encountered during the training - picking up the object, flipping it over in the hand and then in-hand manipulation to get it to a particular orientation. As seen here, MTRF is able to successfully learn how to perform in-hand manipulation without any human intervention. Fig. 11: Film strip illustrating partial training trajectory of hardware system for pipe insertion. This shows various behaviors encountered during the training - repositioning the object, picking it up, and then inserting it into the wall attachment. As seen here, MTRF is able to successfully learn how to do pipe insertion without any human intervention. then provides resets for the insertion and removal. The entire Karol Hausman, Corey Lynch for helpful discussions and training process takes about 25 hours of real time, learning comments.This research was supported by the Office of Naval all of the tasks simultaneously. Research, the National Science Foundation, and Berkeley DeepDrive. R EFERENCES [1] A. Rajeswaran, V. Kumar, A. Gupta, J. Schulman, E. Todorov, and S. Levine, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,” CoRR, vol. abs/1709.10087, 2017. [Online]. Available: http://arxiv.org/abs/1709. 10087 [2] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,” ACM Trans. Graph., vol. 37, no. 4, pp. 143:1–143:14, Jul. 2018. [Online]. Available: http://doi.acm.org/10.1145/3197517.3201311 [3] S. Levine, N. Wagener, and P. Abbeel, “Learning contact-rich manipu- Fig. 12: Success rate of various tasks on dexterous manipulation lation skills with guided policy search,” in Robotics and Automation task families on hardware. Left: In-hand manipulation We can (ICRA), 2015 IEEE International Conference on. IEEE, 2015. see that all of the tasks are able to successfully learn with more [4] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to than 70% success rate. Right: Pipe insertion We can see that the grasp from 50k tries and 700 robot hours,” in 2016 IEEE international final pipe insertion task is able to successfully learn with more than conference on robotics and automation (ICRA). IEEE, 2016, pp. 60% success rate. 3406–3413. [5] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, VII. D ISCUSSION J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al., “Learning dexterous in-hand manipulation,” The International Journal of Robotics In this work, we introduced a technique for learning Research, vol. 39, no. 1, pp. 3–20, 2020. dexterous manipulation behaviors reset-free, without the need [6] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Qt- for any human intervention during training. This was done by opt: Scalable deep reinforcement learning for vision-based robotic leveraging the multi-task RL setting for reset-free learning. manipulation,” arXiv preprint arXiv:1806.10293, 2018. When learning multiple tasks simultaneously, different tasks [7] A. Nagabandi, K. Konolige, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation,” in Conference on Robot serve to reset each other and assist in uninterrupted learning. Learning, 2020, pp. 1101–1112. This algorithm allows a dexterous manipulation system to [8] H. Zhu, J. Yu, A. Gupta, D. Shah, K. Hartikainen, A. Singh, V. Kumar, learn manipulation behaviors uninterrupted, and also learn and S. Levine, “The ingredients of real-world robotic reinforcement learning,” arXiv preprint arXiv:2004.12570, 2020. behavior that allows it to continue practicing. Our experiments [9] K. Ploeger, M. Lutter, and J. Peters, “High acceleration reinforcement show that this approach can enable a real-world hand-arm learning for real-world juggling with binary rewards,” 2020. robotic system to learn an in-hand reorientation task, including [10] V. Kumar, E. Todorov, and S. Levine, “Optimal control with learned pickup and repositioning, in about 60 hours of training, as local models: Application to dexterous manipulation,” in 2016 IEEE International Conference on Robotics and Automation, ICRA 2016, well as a pipe insertion task in around 25 hours of training Stockholm, Sweden, May 16-21, 2016, D. Kragic, A. Bicchi, and without human intervention or special instrumentation. A. D. Luca, Eds. IEEE, 2016, pp. 378–383. [Online]. Available: https://doi.org/10.1109/ICRA.2016.7487156 VIII. ACKNOWLEDGEMENTS [11] A. Ghadirzadeh, A. Maki, D. Kragic, and M. Björkman, “Deep predictive policy training using reinforcement learning,” CoRR, vol. The authors would like to thank Anusha Nagabandi, Greg abs/1703.00727, 2017. [Online]. Available: http://arxiv.org/abs/1703. Kahn, Archit Sharma, Aviral Kumar, Benjamin Eysenbach, 00727
[12] H. Zhu, A. Gupta, A. Rajeswaran, S. Levine, and V. Kumar, “Dexterous [28] S. Ha, P. Xu, Z. Tan, S. Levine, and J. Tan, “Learning to walk in the manipulation with deep reinforcement learning: Efficient, general, real world with minimal human effort,” 2020. and low-cost,” in 2019 International Conference on Robotics and [29] X. B. Peng, E. Coumans, T. Zhang, T.-W. Lee, J. Tan, and S. Levine, Automation (ICRA). IEEE, 2019, pp. 3651–3657. “Learning agile robotic locomotion skills by imitating animals,” 2020. [13] Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine, [30] R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesian “Path integral guided policy search,” CoRR, vol. abs/1610.00529, 2016. optimization for learning gaits under uncertainty - an experimental [Online]. Available: http://arxiv.org/abs/1610.00529 comparison on a dynamic bipedal walker,” Ann. Math. Artif. [14] B. Eysenbach, S. Gu, J. Ibarz, and S. Levine, “Leave no trace: Learning Intell., vol. 76, no. 1-2, pp. 5–23, 2016. [Online]. Available: to reset for safe and autonomous reinforcement learning,” arXiv preprint https://doi.org/10.1007/s10472-015-9463-9 arXiv:1711.06782, 2017. [31] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning [15] M. Ahn, H. Zhu, K. Hartikainen, H. Ponte, A. Gupta, S. Levine, hand-eye coordination for robotic grasping with deep learning and V. Kumar, “Robel: Robotics benchmarks for learning with low- and large-scale data collection,” CoRR, vol. abs/1603.02199, 2016. cost robots,” in Conference on Robot Learning. PMLR, 2020, pp. [Online]. Available: http://arxiv.org/abs/1603.02199 1300–1313. [32] T. Baier-Löwenstein and J. Zhang, “Learning to grasp everyday [16] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, objects using reinforcement-learning with automatic value cut-off,” in “Learning to poke by poking: Experiential learning of intuitive 2007 IEEE/RSJ International Conference on Intelligent Robots and physics,” CoRR, vol. abs/1606.07419, 2016. [Online]. Available: Systems, October 29 - November 2, 2007, Sheraton Hotel and Marina, http://arxiv.org/abs/1606.07419 San Diego, California, USA. IEEE, 2007, pp. 1551–1556. [Online]. [17] S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, and Available: https://doi.org/10.1109/IROS.2007.4399053 R. Fergus, “Intrinsic motivation and automatic curricula via asymmetric [33] B. Wu, I. Akinola, J. Varley, and P. Allen, “Mat: Multi-fingered adaptive self-play,” arXiv preprint arXiv:1703.05407, 2017. tactile grasping via deep reinforcement learning,” 2019. [18] C. Richter and N. Roy, “Safe visual navigation via deep learning [34] B. Nemec, L. Zlajpah, and A. Ude, “Door opening by joining and novelty detection,” in Robotics: Science and Systems XIII, reinforcement learning and intelligent control,” in 18th International Massachusetts Institute of Technology, Cambridge, Massachusetts, Conference on Advanced Robotics, ICAR 2017, Hong Kong, China, USA, July 12-16, 2017, N. M. Amato, S. S. Srinivasa, N. Ayanian, July 10-12, 2017. IEEE, 2017, pp. 222–228. [Online]. Available: and S. Kuindersma, Eds., 2017. [Online]. Available: http://www. https://doi.org/10.1109/ICAR.2017.8023522 roboticsproceedings.org/rss13/p64.html [35] Y. Urakami, A. Hodgkinson, C. Carlin, R. Leu, L. Rigazio, and [19] Y. W. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, P. Abbeel, “Doorgym: A scalable door opening environment and R. Hadsell, N. Heess, and R. Pascanu, “Distral: Robust multitask baseline agent,” CoRR, vol. abs/1908.01887, 2019. [Online]. Available: reinforcement learning,” in Advances in Neural Information http://arxiv.org/abs/1908.01887 Processing Systems 30: Annual Conference on Neural Information [36] S. Gu, E. Holly, T. P. Lillicrap, and S. Levine, “Deep reinforcement Processing Systems 2017, 4-9 December 2017, Long Beach, CA, learning for robotic manipulation,” CoRR, vol. abs/1610.00633, 2016. USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, [Online]. Available: http://arxiv.org/abs/1610.00633 R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, [37] A. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, pp. 4496–4506. [Online]. Available: http://papers.nips.cc/paper/ “Visual reinforcement learning with imagined goals,” in Advances 7036-distral-robust-multitask-reinforcement-learning in Neural Information Processing Systems 31: Annual Conference [20] A. A. Rusu, S. G. Colmenarejo, Ç. Gülçehre, G. Desjardins, on Neural Information Processing Systems 2018, NeurIPS 2018, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell, 3-8 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, “Policy distillation,” in 4th International Conference on Learning H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, 2018, pp. 9209–9220. [Online]. Available: http://papers.nips.cc/paper/ Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. 8132-visual-reinforcement-learning-with-imagined-goals [Online]. Available: http://arxiv.org/abs/1511.06295 [38] H. van Hoof, T. Hermans, G. Neumann, and J. Peters, “Learning [21] E. Parisotto, L. J. Ba, and R. Salakhutdinov, “Actor-mimic: Deep robot in-hand manipulation with tactile features,” in Humanoid Robots multitask and transfer reinforcement learning,” in 4th International (Humanoids). IEEE, 2015. Conference on Learning Representations, ICLR 2016, San Juan, [39] A. M. Okamura, N. Smaby, and M. R. Cutkosky, “An overview of dex- Puerto Rico, May 2-4, 2016, Conference Track Proceedings, terous manipulation,” in Robotics and Automation, 2000. Proceedings. Y. Bengio and Y. LeCun, Eds., 2016. [Online]. Available: ICRA’00. IEEE International Conference on, vol. 1. IEEE, 2000, pp. http://arxiv.org/abs/1511.06342 255–262. [22] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, [40] N. Furukawa, A. Namiki, S. Taku, and M. Ishikawa, “Dynamic “Gradient surgery for multi-task learning,” CoRR, vol. abs/2001.06782, regrasping using a high-speed multifingered hand and a high-speed 2020. [Online]. Available: https://arxiv.org/abs/2001.06782 vision system,” in Proceedings 2006 IEEE International Conference [23] O. Sener and V. Koltun, “Multi-task learning as multi-objective on Robotics and Automation, 2006. ICRA 2006. IEEE, 2006, pp. optimization,” CoRR, vol. abs/1810.04650, 2018. [Online]. Available: 181–187. http://arxiv.org/abs/1810.04650 [41] Y. Bai and C. K. Liu, “Dexterous manipulation using both palm and [24] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, fingers,” in ICRA 2014. IEEE, 2014. and S. Levine, “Meta-world: A benchmark and evaluation for [42] I. Mordatch, Z. Popović, and E. Todorov, “Contact-invariant op- multi-task and meta reinforcement learning,” in 3rd Annual Conference timization for hand manipulation,” in Proceedings of the ACM on Robot Learning, CoRL 2019, Osaka, Japan, October 30 - SIGGRAPH/Eurographics symposium on computer animation. Euro- November 1, 2019, Proceedings, ser. Proceedings of Machine graphics Association, 2012. Learning Research, L. P. Kaelbling, D. Kragic, and K. Sugiura, [43] V. Kumar, Y. Tassa, T. Erez, and E. Todorov, “Real-time behaviour Eds., vol. 100. PMLR, 2019, pp. 1094–1100. [Online]. Available: synthesis for dynamic hand-manipulation,” in 2014 IEEE International http://proceedings.mlr.press/v100/yu20a.html Conference on Robotics and Automation, ICRA 2014, Hong Kong, [25] S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The China, May 31 - June 7, 2014, 2014, pp. 6808–6815. [Online]. robot learning benchmark and learning environment,” 2019. Available: https://doi.org/10.1109/ICRA.2014.6907864 [26] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, [44] K. Yamane, J. J. Kuffner, and J. K. Hodgins, “Synthesizing animations Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. A. Riedmiller, of human manipulation tasks,” in ACM SIGGRAPH 2004 Papers. CRC and D. Silver, “Emergence of locomotion behaviours in rich press, 2004, pp. 532–539. environments,” CoRR, vol. abs/1707.02286, 2017. [Online]. Available: [45] P. Mandikal and K. Grauman, “Dexterous robotic grasping with object- http://arxiv.org/abs/1707.02286 centric visual affordances,” 2020. [27] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, [46] V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous B. McGrew, J. W. Pachocki, J. Pachocki, A. Petron, M. Plappert, manipulation policies from experience and imitation,” arXiv preprint G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, arXiv:1611.05095, 2016. L. Weng, and W. Zaremba, “Learning dexterous in-hand manipulation,” [47] OpenAI, “Learning dexterous in-hand manipulation,” CoRR, vol. CoRR, vol. abs/1808.00177, 2018. [Online]. Available: http: abs/1808.00177, 2018. [Online]. Available: http://arxiv.org/abs/1808. //arxiv.org/abs/1808.00177 00177
[48] H. van Hoof, T. Hermans, G. Neumann, and J. Peters, “Learning A PPENDIX A. R EWARD F UNCTIONS AND A DDITIONAL robot in-hand manipulation with tactile features,” in 15th IEEE-RAS TASK D ETAILS International Conference on Humanoid Robots, Humanoids 2015, Seoul, South Korea, November 3-5, 2015. IEEE, 2015, pp. 121–127. [Online]. A. In-Hand Manipulation Tasks on Hardware Available: https://doi.org/10.1109/HUMANOIDS.2015.7363524 [49] A. Gupta, C. Eppner, S. Levine, and P. Abbeel, “Learning The rewards for this family of tasks are defined as follows. dexterous manipulation for a soft robotic hand from human θx represent the Sawyer end-effector’s wrist euler angle, demonstrations,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2016, Daejeon, South Korea, x, y, z represent the object’s 3D position, and θ̂z represents October 9-14, 2016, 2016, pp. 3786–3793. [Online]. Available: the object’s z euler angle. xgoal and others represent the https://doi.org/10.1109/IROS.2016.7759557 task’s goal position or angle. The threshold for determining [50] C. Choi, W. Schwarting, J. DelPreto, and D. Rus, “Learning object grasping for soft robot hands,” IEEE Robotics and Automation Letters, whether the object has been lifted is subjected to the real vol. 3, no. 3, pp. 2370–2377, 2018. world arena size. We set it to 0.1m in our experiment. [51] V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous manipulation policies from experience and imitation,” arXiv preprint arXiv:1611.05095, 2016. x xhand [52] W. Montgomery, A. Ajay, C. Finn, P. Abbeel, and S. Levine, x x Rrecenter = −3|| − goal || − || y − yhand || “Reset-free guided policy search: Efficient deep reinforcement learning y ygoal with stochastic initial states,” CoRR, vol. abs/1610.01112, 2016. z zhand [Online]. Available: http://arxiv.org/abs/1610.01112 [53] W. Han, S. Levine, and P. Abbeel, “Learning compound multi-step controllers under unknown dynamics,” in 2015 IEEE/RSJ International Rlif t = −|z − zgoal | Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 6435–6442. [54] L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine, “Avid: Learning multi-stage tasks via pixel-level translation of human videos,” n o arXiv preprint arXiv:1912.04443, 2019. Rf lipup = − 5|θx − θgoal x | − 50(1 z < threshold )+ [55] S. Ruder, “An overview of multi-task learning in deep neural n o networks,” CoRR, vol. abs/1706.05098, 2017. [Online]. Available: 10(1 |θx − θgoal x | < 0.15 AND z > threshold ) http://arxiv.org/abs/1706.05098 [56] R. Yang, H. Xu, Y. Wu, and X. Wang, “Multi-task reinforcement learning with soft modularization,” CoRR, vol. abs/2003.13661, 2020. [Online]. Available: https://arxiv.org/abs/2003.13661 [57] M. A. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, Rreorient = −|θ̂z − θ̂goal z | T. V. de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playing - solving sparse reward tasks from scratch,” A rollout is considered a success (as reported in the figures) CoRR, vol. abs/1802.10567, 2018. [Online]. Available: http: if it reaches a state where the valve is in-hand and flipped //arxiv.org/abs/1802.10567 [58] M. Wulfmeier, A. Abdolmaleki, R. Hafner, J. T. Springenberg, facing up: M. Neunert, T. Hertweck, T. Lampe, N. Y. Siegel, N. Heess, and M. A. Riedmiller, “Regularized hierarchical policies for compositional z > threshold AND |θx − θgoal x | < 0.15 transfer in robotics,” CoRR, vol. abs/1906.11228, 2019. [Online]. Available: http://arxiv.org/abs/1906.11228 B. Pipe Insertion Tasks on Hardware [59] M. P. Deisenroth, P. Englert, J. Peters, and D. Fox, “Multi-task policy search for robotics,” in 2014 IEEE International Conference on The rewards for this family of tasks are defined as follows. Robotics and Automation, ICRA 2014, Hong Kong, China, May 31 x, y, z represent the object’s 3D position, and q represent the - June 7, 2014. IEEE, 2014, pp. 3876–3881. [Online]. Available: https://doi.org/10.1109/ICRA.2014.6907421 joint positions of the D’Hand. xgoal and others represent the [60] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, task’s goal position or angle. The threshold for determining 2018. whether the object has been lifted is subjected to the real [61] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems 12, [NIPS world arena size. We set it to 0.1m in our experiment. To Conference, Denver, Colorado, USA, November 29 - December reduce collision in real world experiments, we have two tasks 4, 1999], S. A. Solla, T. K. Leen, and K. Müller, Eds. for insertion. One approaches the peg and the other attempts The MIT Press, 1999, pp. 1008–1014. [Online]. Available: http://papers.nips.cc/paper/1786-actor-critic-algorithms insertion. [62] L. Lee, B. Eysenbach, E. Parisotto, E. P. Xing, S. Levine, and R. Salakhutdinov, “Efficient exploration via state marginal matching,” CoRR, vol. abs/1906.05274, 2019. [Online]. Available: x xhand x xgoal http://arxiv.org/abs/1906.05274 Rrecenter = −3|| − || − || y − yhand || [63] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, y ygoal z zhand V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018. [64] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration by random network distillation,” in 7th International Conference Rlif t = −2|z − zgoal | − 2|q − qgoal | on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https://openreview.net/forum?id=H1lJJnR5Ym n o RInsert1 = −d1 + 10(1 d1 < 0.1 ) n o RInsert2 = −d2 + 10(1 d2 < 0.1 )
where A rollout is considered a success (as reported in the figures) x xgoal1 if it reaches a state where the ball is positioned very close to d1 = || y − ygoal1 || the goal position above the basket: z zgoal1 x x || − goal || < 0.1 AND |z − zgoal | < 0.15 y ygoal x xgoal2 d2 = || y − ygoal2 || A PPENDIX B. A DDITIONAL D OMAINS z zgoal2 In addition to the test domains described in Section V-A, we also tested our method in simulation on simpler tasks x xarena_center such as a 2D “pincer” and a simplified lifting task on the RRemove = −|| y − yarena_center || Sawyer and “D’Hand” setup. The pincer task is described in z zarena_center Figure 13. Figure 13 shows the performance of our method A rollout is considered a success (as reported in the figures) as well as baseline comparisons. if it reaches a state where the valve is in-hand and pipe is inserted: z > threshold AND d2 < 0.05 C. Lightbulb Insertion Tasks in Simulation The rewards for this family of tasks include Rrecenter , Rpickup , Rf lipup defined in the previous section, as well as the following bulb insertion reward: x x Rbulb = − || − goal || − 2(|z − zgoal |)+ y ygoal n x x o goal 1 || − || < 0.1 y ygoal + Pincer Fill Task Success Rate Comparisons 1.0 All Tasks Success Rate for Pincer Family 1.0 n x x o 0.8 − goal || < 0.1 AND |z − zgoal | < 0.1 )− 0.8 10(1 || y ygoal Success 0.6 Success 0.6 n o 0.4 0.4 1 z < threshold 0.2 0.2 A rollout is considered a success (as reported in the figures) 0.0 0 50 100 150 Iterations 200 250 300 0.0 0 50 100 150 Iterations 200 250 300 if it reaches a state where the bulb is positioned very close Ours Perturbation Fill Success Pull Success to the goal position in the lamp: Reset Controller SAC Pick Success x x Fig. 13: Pincer domain - object grasping, filling the drawer with || − goal || < 0.1 AND |z − zgoal | < 0.1 the object, pulling open the drawer. These tasks naturally form a y ygoal cycle - once an object is picked up, it can be filled in the drawer, D. Basketball Tasks in Simulation following which the drawer can be pulled open and grasping and filling can be practiced again. The rewards for this family of tasks include Rrecenter , Rpickup defined in the previous section, as well as the following basket dunking reward: A PPENDIX C. H YPERPARAMETER D ETAILS SAC x xgoal Learning Rate 3 × 10−4 Rbasket = − || y − ygoal ||+ Discount Factor γ 0.99 z zgoal Policy Type Gaussian Policy Hidden Sizes (512, 512) n x xgoal o RL Batch Size 1024 20(1 || y − ygoal || < 0.2 ) Reward Scaling 1 Replay Buffer Size 500, 000 z zgoal Q Hidden Sizes (512, 512) n x xgoal o Q Hidden Activation ReLU Q Weight Decay 0 + 50(1 || y − ygoal || < 0.1 )− Q Learning Rate 3 × 10−4 z zgoal Target Network τ 5 × 10−3 n o 1 z < threshold TABLE I: Hyperparameters used across all domains.
You can also read