Continual Learning from Demonstration of Robotic Skills - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Continual Learning from Demonstration of Robotic Skills Sayantan Auddy1∗ Jakob Hollenstein1 Matteo Saveriano1,3 Antonio Rodríguez-Sánchez1 Justus Piater1,2 Abstract— Methods for teaching motion skills to robots focus on training for a single skill at a time. Robots capable of learning from demonstration can considerably benefit from the added ability to learn new movements without forgetting past knowledge. To this end, we propose an approach for continual learning from demonstration using hypernetworks arXiv:2202.06843v2 [cs.RO] 15 Feb 2022 and neural ordinary differential equation solvers. We empir- ically demonstrate the effectiveness of our approach in re- membering long sequences of trajectory learning tasks without the need to store any data from past demonstrations. Our results show that hypernetworks outperform other state-of- the-art regularization-based continual learning approaches for Fig. 1. A robot, continually trained using learning from demonstration to learning from demonstration. In our experiments, we use the write single letters, can reproduce all the trajectories that it has learned in the popular LASA trajectory benchmark, and a new dataset of past with a single network and without having access to training data from kinesthetic demonstrations that we introduce in this paper past tasks. Video is available at https://youtu.be/cTfVfYyyeXk. called the HelloWorld dataset. We evaluate our approach using both trajectory error metrics and continual learning metrics, and we propose two new continual learning metrics. Our code, along with the newly collected dataset, is available at a single motion skill. To naively learn multiple motion skills, https://github.com/sayantanauddy/clfd. one would need to train a different model for each skill, or Index Terms— Continual learning, learning from demonstra- jointly train on the demonstrations for all skills. tion, hypernetwork, neural ordinary differential equation solver In this paper, we propose an approach for continual learning from demonstration in which a robot learns individual motions I. I NTRODUCTION sequentially without retraining on past demonstrations. The learned skills are incorporated into a single unified model. Robots deployed in unstructured real-world environments After learning many types of motion, our robot can reproduce will face new tasks and challenges over time, requiring all the trajectories it has learned in the past (Fig. 1). To the capabilities that cannot be fully anticipated at the beginning. best of our knowledge, this is the first continual learning These robots need to learn continually, which implies that they approach for learning from kinesthetic demonstrations. should be able to acquire new capabilities without forgetting More specifically, we show that a single Hypernetwork [5], the previously learned ones. Furthermore, a continual learning that generates the parameters of a Neural Ordinary Differential robot should be able to do this without the need to store and Equation (NODE) solver [13], remembers a long sequence retrain on the training data of all the previously learned skills. of motion skills as well as when learning each task with a Continual learning can be effective in expanding a robot’s separate NODE. The hypernetwork grows by a negligible repertoire of skills and in increasing the ease of use for non- amount for each new task, making it suitable for potential expert human users. However, apart from a few approaches deployment on resource-constrained, non-networked robotic for robotics [1], [2], the current continual learning research platforms. We also demonstrate the effectiveness of chunked mostly focuses on vision-based tasks such as incrementally hypernetworks [5] which are even smaller in size than the learning classification of new image categories [3]–[5]. NODEs they generate. Our results show how using the time Continually acquiring perceptive skills is important for a index as an additional, direct input to a NODE increases its robot that interacts with its environment, but equally important prediction accuracy for complex trajectories. We evaluate our is the ability to incrementally learn new movement skills. approach on the popular LASA trajectory learning benchmark Learning from demonstration [6] is a popular and tangible way [8]. We also introduce a new dataset, named HelloWorld, to impart motion skills to robots, for instance via kinesthetic which consists of two-dimensional demonstrations collected teaching, where a human user teaches new skills by guiding with a Franka Emika Panda robot. It serves as an additional the robot. A recent trend in learning from demonstration is benchmark to evaluate our approach, both quantitatively to encode observations into a vector field [7]–[12]. These and qualitatively on a real robot. Finally, we propose two methods, like many other works in the field, focus on learning new easily-computable metrics which, together with existing 1 Department of Computer Science, University of Innsbruck, Techniker- ones [14], gauge continual learning performance. strasse 21a, 6020 Innsbruck, Austria. {name.surname}@uibk.ac.at To summarize, our contribution in this paper is 3-fold: 2 Digital Science Center (DiSC), University of Innsbruck, Austria. 3 Department of Industrial Engineering, University of Trento, Italy. • We propose an approach for learning from demonstration ∗ Corresponding author. with hypernetworks and NODEs for continually learning
new tasks without reusing training data of previous tasks. learning from demonstration is a mature research field, most • We release a new dataset containing 7 tasks collected methods assume that different tasks are encoded in different with a real robot using kinesthetic teaching. representations, i.e., one has to fit a new model for each • We propose two new continual learning metrics. task the robot has to execute. In this paper, we take the continual learning perspective on learning by demonstration II. R ELATED W ORK and propose an approach capable of continuously learning A. Continual Learning new tasks without accessing the training data from past tasks. Popular strategies for continual learning include replaying III. BACKGROUND data from past tasks or regularizing trainable parameters to avoid catastrophic forgetting [15]. Replay-based methods In this paper, we utilize Neural Ordinary Differential cache samples of real data from past tasks [16], or use Equation (NODE) solvers [13] for learning trajectories and generative models to create pseudo-samples of past data different state-of-the-art continual learning approaches [4], [3], which are interleaved with the current task’s data [5], [17] to alleviate catastrophic forgetting [15]. during training. Regularization-based methods [4], [17] add A. Trajectory Learning a regularization term to the learning objective to minimize Neural Ordinary Differential Equation solver: Consider a changes to parameters important for solving previous tasks. (0) (N −1) set of N observed trajectories D = {y0:T −1 , . . . , y0:T −1 }, Relatively few existing approaches address continual learn- (i) ing for robotics. Gao et al. [1] present an approach for where each trajectory y0:T −1 is a sequence of T observations (i) (i) continual imitation learning that relies on deep generative yt ∈ Rd . Each observation yt is a perturbation of an (i) replay [3] and action-conditioned video prediction to generate unknown true state xt generated by an unknown underlying state and action trajectories of past tasks. This pseudo-data is vector field ftrue [21]: interleaved with demonstrations of the current task to train a Z t policy network that controls the robot’s actions. The authors xt = x0 + ftrue (xτ ) dτ, (1) 0 note that the generation of high-quality video frames can be problematic for a long sequence of tasks. where x0 is the true starting state of the trajectory. The goal Xie and Finn’s [18] approach for lifelong robotic rein- of a Neural Ordinary Differential Equation (NODE) solver forcement learning seeks to improve the forward transfer [13] is to learn a neural network fθ parameterized by θ that performance while learning a new current task by pre-training approximates the true underlying dynamics of the observed on the entire experience collected from all previous tasks. system such that fθ ≈ ftrue . As we do not have access to The problem of catastrophic forgetting is not considered. ftrue but only to the noisy observed trajectories, we compute Our approach is similar to Huang et al. [2], who also the loss L based on the difference of the forward simulated utilize hypernetworks for continually training a robot. In states of the NODE ŷt and the observations yt : their work, a task-conditioned hypernetwork generates the 1X Z t 2 parameters of the dynamics model for reinforcement learning L= k yt − ŷt k2 where ŷt = ŷ0 + fθ (ŷτ ) dτ 2 t 0 tasks such as opening doors or pushing blocks. In contrast, we (2) use hypernetworks for generating parameters for a trajectory learning NODE in a setup for learning from demonstration. B. Continual Learning with Regularization We follow a supervised approach and do not need to rely Synaptic Intelligence: Synaptic Intelligence (SI) [17] is a on robot simulators. Compared to [2], we evaluate on much regularization-based continual learning approach. Each neural longer sequences of tasks and also investigate the effectiveness network parameter is assigned an importance measure based of chunked hypernetworks [5]. In addition, we qualitatively on its contribution to the change in the loss. The loss for the evaluate our approach on a physical robot. mth task is defined as: X 2 ∗ B. Trajectory Learning from Demonstration L̃m = Lm + c Ωmk (θk − θk ) , (3) Learning from demonstration enables users without exper- k tise in robotics to train robots [6]. Approaches in the field where c is the regularization constant which trades off between can be categorized into two groups: Probabilistic approaches learning a new task and remembering previously learned tasks, [7]–[9] use generative models to fit a distribution from θk∗ denotes the value of the k th parameter before starting to the training data; Non-probabilistic approaches [10]–[12] learn the mth task, and θk is the current value of the k th exploit function approximators like neural networks to fit parameter. The per-parameter regularization strength Ωm k [17] the training data. In both groups, training data can be used is given by to learn a static mapping (time input → desired position) X ωkl or a dynamic mapping (input position → desired velocity). Ωmk = l , (4) l
Memory Aware Synapses: Memory Aware Synapses (MAS) and all the chunk embedding vectors are combined in a batch [4] is also a regularization-based continual learning approach. and fed into the hypernetwork to produce the target network The loss for the mth task for MAS has the same form as SI parameters for a task in one forward pass [5]. (3). MAS differs from SI in the way Ωm k is computed: the IV. M ETHODS importance of a trainable parameter depends on the gradient of the squared L2 norm of the network’s output: In our experiments, we employ two variants of NODEs, which are enhanced with different continual learning methods N N 1 X 1 X ∂L22 (fθ (xn )) to enable the NODEs to learn continually (Fig. 2). Ωm k = ||gk (xn )|| = . (5) N n=1 N n=1 ∂θk A. NODE Variants The above summation is performed over N input data points. Along with a basic NODE fθ (ŷt ) (Sec. III-A), we use another variant where the NODE neural network is a function Hypernetworks: A hypernetwork [5] is a meta-model that of both state and time, fθ (ŷt , t). This explicit time input generates the parameters of a target network that solves the results in the NODE learning a time-evolving vector field. We task we are interested in. It uses a trainable task embedding show empirically that this improves the accuracy of predicted vector as an input to generate the network parameters trajectories, especially for those containing loops. We refer for a task. Though the parameters h of the hypernetwork to this time-dependent NODE as NODET , and to the time- fh are regularized, the parameters θm+1 produced by a independent one as NODEI . hypernetwork for the (m+1)th task can be arbitrarily far away B. Continual Learning NODE Models in parameter space from the parameters θm produced for the Single NODE per task (SG): A simple way to learn M previous mth task. Intuitively, this gives a hypernetwork more tasks is to use a dedicated, newly-initialized NODE to learn a freedom to find good solutions for both the mth and (m+1)th task and to freeze it afterwards. At the end we get M NODEs, tasks than other regularization-based approaches [17] [4]. from which we can pick one at prediction time to reproduce A two-step optimization process is used for training a the desired trajectory (Fig. 2(a)). In this setting, which acts as hypernetwork [5]. First, a candidate change ∆h for the an upper-performance baseline, catastrophic forgetting [15] hypernetwork parameters is computed which minimizes the is eliminated because the parameters of a NODE trained on task-specific loss Lm for the (current) mth task w.r.t. θm : a task are not affected when a new NODE is trained on the Lm = Lm (θm , ym ) where θm = fh (em , h) (6) next task. However, this also means that we end up with M times the number of parameters of a single NODE. Here em is the task embedding vector and ym is the data for Finetuning (FT): A single NODE is sequentially finetuned on the mth task. Next, ∆h is considered to be fixed and the actual M tasks. To tell the NODE which task it should reproduce, i.e. change for the hypernetwork parameters h is learned [5] by to make it task-conditioned, we use an additional input in the minimizing the regularized loss L̃m w.r.t. θm = fh (em , h): form of a trainable vector known as a task embedding vector. L̃m = Lm θm , ym This is similar to the approach followed for hypernetworks. m−1 After the mth task is learned, the trained task embedding β X 2 vector em for that task is saved. To reproduce the trajectory + fh (el , h∗ ) − fh (el , h + ∆h) (7) m−1 for the mth task, we pick the corresponding task embedding l=0 vector and use it as the additional network input. The Here h∗ denotes the hypernetwork parameters before NODE parameters are finetuned to minimize the loss on the learning the mth task, and β is a hyperparameter that controls current task without any mechanism for avoiding catastrophic the regularization strength. To calculate the second part of (7), forgetting [15]. In this setting (Fig. 2(b)), we would expect the stored task embedding vectors {e0 , e1 , . . . , el , . . . , em−1 } the NODE to only remember the latest task and so this acts for all tasks before the mth task are used. In each learning as a lower-performance baseline. step, the current task embedding vector em is also updated to minimize the task-specific loss Lm [5]. Note that the Synaptic Intelligence (SI): To learn M tasks, the NODE parameters of the target network θm for the mth task are parameters are regularized with the SI loss L̃m (3). The task- simply the hypernetwork outputs and are not directly trainable. specific part (Lm ) of L̃m corresponds to the NODE loss (2). Chunked hypernetworks [5] produce the parameters of Similar to FT, we make the SI NODE task-conditioned using the target network in segments known as chunks. A regular a trainable task embedding vector, as shown in Fig. 2(b). hypernetwork has a very high-dimensional output, but a Memory Aware Synapses (MAS): For MAS [4], we also chunked hypernetwork’s output is of a much smaller di- follow the architecture in Fig. 2(b). The NODE parameters mension, leading to a lower hypernetwork parameter size. A are learned using (3) and we use (5) to compute Ωm k . As chunked hypernetwork requires additional inputs in the form before, we apply (2) as the task-specific loss Lm . The MAS of trainable chunk embedding vectors. While each task has its NODE is also made task-conditioned with a trainable task dedicated task embedding vector, chunk embedding vectors embedding vector. are shared across tasks and are regularized in the same way Hypernetworks (HN): We use a hypernetwork to generate as the hypernetwork parameters. The task embedding vector the parameters of a NODE. We first compute the candidate
(a) (b) (c) (d) Time steps Start state Task Emb. Time steps Start state Task Emb. Time steps Start state Task Emb. { Chunk Embs. } Time steps Start state Hypernetwork Hypernetwork Integrator Integrator Integrator Integrator NODE NODE NODE NODE State Trajectory State Trajectory State Trajectory State Trajectory Finetuning (FT) {Chunks} Synaptic Intelligence (SI) Single NODE/task (SG) Memory Aware Synapses (MAS) Hypernetwork (HN) Chunked Hypernetwork (CHN) Fig. 2. Continual learning models used in our experiments. Non-regularized trainable parameters are saved after each task is learned (shown with ). Regularized trainable parameters are protected from catastrophic forgetting while learning a sequence of tasks (shown with ). Other inputs and outputs are not trainable (shown with ). Given a start state and time steps, a NODE generates the state trajectory for the time steps. (a) SG: A single NODE learns only a single task. This forms the upper-performance baseline. (b) Architecture for Finetuning (FT), Synaptic Intelligence (SI) and Memory Aware Synapses (MAS). NODE parameters are regularized for MAS and SI, and finetuned for FT. (c) Hypernetworks (HN) produce all the NODE parameters using a task embedding vector. (d) Chunked Hypernetworks (CHN) use chunk embedding vectors together with a task embedding vector to produce the NODE parameters in segments called chunks. HN and CHN (highlighted in purple) are our proposed solutions for continual learning from demonstration. change ∆h for the hypernetwork parameters by minimizing D HW = {D0:6 } consists of 7 tasks, each containing 8 slightly the NODE loss (2). This acts as our task-specific loss Lm : varying demonstrations of a letter. Each demonstration is a sequence of 1000 2-D points. After training on all the 1X Lm = Lm (θm , ym ) = k ytm − ŷtm k22 (8) tasks, the objective is to make the robot write the words 2 t “hello world”. Our motivation for using this dataset is to Z t test our approach on complicated trajectories (with loops) where θm m m m = fh (e , h) and ŷt = ŷ0 + fθm (ŷτm ) dτ and to show that it also works on kinesthetically recorded 0 demonstrations using a real robot. This dataset is available Symbols have the same meaning as in equations (2) and at https://github.com/sayantanauddy/clfd. (6). We use (7) for training the hypernetwork in the second optimization step. The structure of HN is shown in Fig. 2(c). B. Metrics Chunked Hypernetworks (CHN): As shown in Fig. 2(d), Trajectory Metrics: We report the Swept Area error [19], we use a chunked hypernetwork to generate the parameters Frechet distance [9], and Dynamic Time Warping (DTW) of a NODE. For this, equations (8) and (7) are employed as error [9] , which measure how close the predicted trajectories the loss functions in the 2-step optimization process. are to the ground-truth demonstrations. We treat SG, FT, SI and MAS as comparison baselines, Continual Learning Metrics: We report Accuracy (ACC), and propose HN and CHN (highlighted in Fig. 2) as solutions Remembering (REM), and Model Size Efficiency (MS) [14]. for continual learning from demonstration. ACC is a measure of the average accuracy for the current and past tasks. REM measures how well past tasks are V. E XPERIMENTS remembered. MS measures how much the size of a model A. Datasets grows compared to its size after learning the first task. LASA Dataset: LASA [8] is a widely-used benchmark Additionally, we introduce two new easy-to-compute con- for evaluating motion generation algorithms. It contains 30 tinual learning metrics: Time Efficiency (TE) and Final Model patterns, each with 7 similar demonstrations. We refer to each Size (FS). TE measures the increase in training duration with pattern Dm as a task. Of the 30 tasks, we use the first 26 tasks: the number of tasks, relative to the training time for the first D LASA = {D0:25 }. We omit the last 4 tasks, each of which task. TE only needs the training times to be logged, and it contains 2 or 3 dissimilar patterns merged together. Each reflects the extra effort needed in the training loop (e.g. due demonstration of a task is a sequence of 1000 2-D points. to extra regularization steps) with an increase in the number We arrange the 26 tasks alphabetically and train sequentially of tasks. For M tasks, TE is defined as on each one without accessing the data of past tasks. ( M −1 ) T0 X 1 HelloWorld Dataset: We further evaluate our approach on TE = min 1, , (9) M i=0 Ti a dataset of demonstrations we collected using the Franka Emika Panda robot. The x and y coordinates of the robot’s where Ti is the time required by the model to learn task i. end-effector were recorded while a human user guided it FS is a measure of the absolute parameter size, which kinesthetically to write the 7 lower-case letters h,e,l,o,w,r,d contrasts with MS which only measures the parameter growth one at a time on a horizontal surface. The HelloWorld dataset relative to the size after learning the first task. A model which
has a large number of parameters for the first task and adds per task, same as SI, MAS and FT. Thus, CHN and especially a relatively small number of parameters for subsequent tasks HN perform similar to the upper baseline SG, while their will achieve a high score for MS, but will fare worse in terms parameter size is close to the lower baseline FT. Fig. 5 shows of FS if models of other compared methods have a smaller examples of trajectories predicted by SG, CHN, and HN for absolute size. FS is defined as a selection of tasks after learning the last task. After training on all 26 tasks, we compute the errors of the FS = 1 − Memnorm (θM −1 ) (10) trajectories predicted for tasks 0 to 25. We plot the overall M −1 Memnorm (θ ) is the parameter size after learning M tasks, normalized by the size of the largest compared model among (Frechet error) (Swept Area error) SG, FT, SI ,MAS, HN and CHN. With these 5 metrics, we compute the overall P continual learning metrics P proposed in 3 log10 [14]: CLscore = c∈C c and CLstability = 1 − c∈C stdev(c), where C = {ACC, REM, MS, TE, FS}. All the continual 2 learning metrics lie in the range 0 (worst) to 1 (best). C. Hyperparameters 2 log10 All models are trained for 15 × 103 and 40 × 103 iterations 1 per task for D LASA and D HW respectively. In all experiments, we use fully connected networks and the Adam optimizer with 0 a learning rate of 10−4 . The NODEs for SG, FT, SI, MAS, and 5 (DTW error) CHN have 3 hidden layers with 1000 units each. For HN, the log10 4 target NODE has 3 hidden layers with 100 units each to keep its parameter size comparable to the other models. The smooth 3 ELU activation [22] is used in all NODEs. Task embedding 0 5 10 15 20 25 vectors have a dimension of 256 wherever they are used. Task ID For CHN, we use 256-dimensional chunk embedding vectors SG FT SI MAS CHN HN and 8192-dimensional output chunks. The hypernetworks in HN and CHN have 3 hidden layers with 200 ReLU units Fig. 3. Trajectory errors for the LASA dataset (lower is better). The x-axis each. For regularization, we use: SI [17]: c = 0.3, ξ = 0.3, shows the current task. After learning a task (using NODET ), all current and previous tasks are evaluated. Plots for SG and HN overlap with each MAS [4]: c = 0.1, HN and CHN [5]: β = 5 × 10−3 . These other. Lines show medians and shaded regions denote the lower and upper are based on values used in the aforementioned papers. In quartiles of the errors over 5 independent seeds. Sec. V-D we show that our proposed methods (HN and CHN) are robust to changes in regularization hyperparameters. Parameters (×106) 50 SG MAS 40 FT CHN D. Results 30 SI HN LASA Dataset: We train each model on the 26 tasks of 20 D LASA sequentially. Fig. 3 shows the median errors of the 10 predictions for tasks D0 –Dm after training on task Dm (using 0 0 5 10 15 20 25 NODET ) for m = 0, 1, . . . 25, e.g. the value for task 7 denotes Task ID the evaluation errors for all trajectories from tasks 0 to 7 after training on task 7. SG’s performance does not deteriorate with Fig. 4. Growth of parameter size with new tasks for the LASA dataset (using NODET ). SG has a high rate of growth since it uses a separate increasing tasks, leading to a nearly horizontal line (overlaps network for each task. All other models grow by only 256 parameters for with HN). A drastic increase in the error is observed for FT each new task. Plots for CHN and FT overlap with each other. as more tasks are learned, since FT optimizes its parameters only for the current task. After the first task, the errors for SI Task IDs 1 5 9 13 17 21 25 and MAS also increase steeply. Among the continual learning SG models, CHN and HN perform the best (red and green lines in Fig. 3). CHN’s forgetting increases with the number of tasks, as shown by the upward slope in its error plot. HN CHN does not suffer much from catastrophic forgetting, and its error plot overlaps with that of the upper baseline SG. Although HN’s performance after learning 26 tasks is very HN similar to that of SG, its parameter size is 4.3×106 compared to SG’s combined size of 52.2 × 106 parameters, as shown Vector field Initial value Demonstration Prediction in Fig. 4. CHN’s final parameter size is 1.9 × 106 . Also, the parameter count for SG grows by 2.1 × 106 per task, whereas Fig. 5. Example of trajectories predicted by SG, CHN and HN using CHN and HN grow at a much smaller rate of 256 parameters NODET for a selection of LASA tasks after learning the last task.
log10(DTW error) Since there is no preexisting procedure for this, we follow log10(DTW error) 5 5 the following steps. We set a threshold on the DTW error, 4 4 such that predictions with an error less than the threshold are considered accurate. As each task has multiple ground truth 3 3 demonstrations, we first compute the DTW error between 2 2 all pairs of demonstrations for each task. We then find the SG FT SI MAS CHN HN SG FT SI MAS CHN HN Method Method maximum value from this list and multiply it by 3, to allow (a) NODET (b) NODEI some room for error such that a predicted trajectory with the same general shape as its demonstration is considered Fig. 6. DTW errors (lower is better) of trajectories predicted for all past accurate. Doing so, we arrive at a DTW threshold value of tasks together after learning the last task of the LASA dataset. Results are obtained using 5 independent seeds. 2191 for D LASA , and use it to evaluate the metrics in Tab. I. For both NODE variants, HN significantly outperforms all the compared models in terms of CLscore . For NODET , METHOD ACC REM MS TE FS CLscore CLstability HN performs close to the upper baseline SG in terms of SG 0.8742 1.0000 0.1482 0.8679 0.0000 0.5781 0.5832 both ACC and REM. The additional regularization needed FT 0.0594 0.1569 0.9986 0.9579 0.9565 0.6259 0.5759 for training hypernetworks leads to a comparatively lower SI 0.0427 0.3714 0.9997 1.0000 0.7830 0.6394 0.6236 MAS 0.0179 0.8716 0.9996 0.8312 0.8264 0.7094 0.6486 score for the time efficiency metric TE for CHN and HN. CHN 0.4766 0.7943 0.9983 0.5270 0.9636 0.7520 0.7838 A very high parameter growth rate for SG results in poor HN 0.8840 0.9710 0.9993 0.5327 0.9173 0.8609 0.8311 scores for MS and FS. The extra time input in NODET also (a) NODET leads to better overall performance for SG and HN. METHOD ACC REM MS TE FS CLscore CLstability Robustness to Hyperparameter Changes: To test the sen- SG 0.8107 1.0000 0.1482 0.8614 0.0000 0.5641 0.5925 sitivity of the methods to changes in the regularization hyper- FT 0.0590 0.2040 0.9986 0.8961 0.9565 0.6228 0.5949 parameters, we create sets of 5 hyperparameters each for SI, SI 0.0452 0.3785 0.9997 0.9371 0.7830 0.6287 0.6368 MAS 0.0273 0.8215 0.9996 0.8769 0.8264 0.7103 0.6525 MAS, HN and CHN by drawing independently and uniformly CHN 0.5385 0.8409 0.9983 0.5130 0.9636 0.7709 0.7930 from the following ranges: (SI) c ∈ [0.1, 0.5], ξ ∈ [0.1, 0.5], HN 0.7595 0.9864 0.9993 0.6011 0.9176 0.8528 0.8480 (MAS) c ∈ [0.1, 0.5], (CHN) β ∈ [10−3 , 10−2 ], (HN) (b) NODEI β ∈ [10−3 , 10−2 ] resulting in 20 different configurations. We then repeat the LASA experiment with NODET for all TABLE I these configurations. In terms of CLscore we observe that all C ONTINUAL LEARNING METRICS FOR THE LASA DATASET ( MEDIAN configurations of HN outperform all configurations of CHN, OVER 5 SEEDS ). VALUES RANGE FROM 0 ( WORST ) TO 1 ( BEST ). which in turn are better than all configurations of MAS, followed by SI. This trend is reflected in the medians and inter-quartile ranges (IQR) of the overall continual learning errors for all tasks in Fig. 6 and the errors for a selection of metrics CLscore and CLstability for each method (over its 5 7 tasks in Fig. 7. The similarity in the performance of HN with the upper baseline SG can be seen in both cases. Fig. 7 CLscore CLstability shows that except MAS, all other models can remember the METHOD Median IQR Median IQR last task (task 25) but SG and HN remember the other tasks HN 0.8578 0.0011 0.8324 0.0050 as well. CHN performs worse than HN but much better than CHN 0.7939 0.0098 0.8126 0.0022 FT, SI, and MAS. Note that the trajectory metrics are plotted MAS 0.7104 0.0019 0.6562 0.0062 in the log10 scale to accommodate the high errors for FT, SI, SI 0.6047 0.0065 0.6403 0.0011 and MAS in the same plot as SG, HN and CHN. TABLE II To compute the continual learning metrics [14], each ROBUSTNESS TO CHANGES IN REGULARIZATION HYPERPARAMETERS FOR predicted trajectory needs to be marked as accurate or THE LASA DATASET (5 CONFIGURATIONS FOR EACH METHOD ). inaccurate based on its difference from the ground truth. SG FT SI MAS CHN HN log10(DTW error) 5 4 3 2 1 5 9 13 17 21 25 Task ID Fig. 7. DTW errors (lower is better) of the trajectories predicted for a selection of 7 out of 26 past tasks (shown individually) by the models using NODET after being trained on the last task of the LASA dataset. Results are obtained using 5 independent seeds.
configurations) shown in Tab. II. It can be seen that HN and performance does not deteriorate even after learning all tasks. CHN perform better than the other methods and the variability Fig. 9 shows examples of trajectories predicted by SG, in terms of IQR is very small, thereby showing that they are CHN and HN for past tasks after being trained sequentially robust to changes in the regularization hyperaparameter β. on all D HW tasks. All models exhibit superior performance HelloWorld Dataset: For D HW , which comprises 7 tasks, when using the additional time input in NODET (Fig. 9(a)), we perform the same experiments as D LASA . Fig. 8 shows without which even SG is unable to learn trajectories with the errors in the predicted trajectories for all past and current loops. This can be seen from the errors for the letters e, r and d tasks, as new tasks are learned. The median errors for CHN in Fig. 9(b). This is also evident in Fig. 10 which shows and HN stay nearly unchanged and are similar to the upper the errors in the predictions for all past tasks together after baseline SG. As before, FT, SI, and MAS exhibit severe all the tasks have been learned. Apart from FT, all methods catastrophic forgetting. Due to fewer tasks in D HW , CHN’s have higher median errors when using NODEI (Fig. 10b) compared to NODET (Fig. 10a). The prediction errors for each past task after learning all tasks are shown individually in Fig. 11. It can be seen that all the methods remember the (Frechet error) (Swept Area error) 3 last task, but only SG, CHN, and HN remember earlier tasks. 2 log10 Using the same threshold computation approach we fol- 1 lowed for D LASA , we compute a DTW threshold value of 1821 for D HW . With this, we compute the continual learning 2 metrics shown in Tab. III. The advantage of using NODEs with a time input is clear from the higher values of ACC for 1 NODET compared to NODEI for all the methods. Overall, log10 CHN shows the best performance on account of its small 0 size and also because its ACC score is comparable to HN and SG. HN and CHN also achieve much higher scores for 4 (DTW error) log10 3 log10(DTW error) log10(DTW error) 2 6 6 5 5 h e l o w r d 4 4 Task 3 3 SG FT SI MAS CHN HN 2 2 1 1 SG FT SI MAS CHN HN SG FT SI MAS CHN HN Fig. 8. Trajectory errors for the HelloWorld dataset (lower is better). The Method Method x-axis shows the current task. After learning a task (using NODET ), all current and previous tasks are evaluated. Lines show medians and shaded (a) NODET (b) NODEI regions denote the lower and upper quartiles of the errors over 5 independent seeds. Plots for SG, HN and CHN are close to each other. Fig. 10. DTW errors (lower is better) of trajectories predicted for all past tasks together after learning the last task of the HelloWorld dataset. Results h e l o w r d are obtained using 5 independent seeds. SG METHOD ACC REM MS TE FS CLscore CLstability SG 1.0000 1.0000 0.3704 0.9431 0.0000 0.6627 0.5924 CHN FT 0.2500 0.0774 0.9997 0.9551 0.8388 0.6242 0.6164 SI 0.2500 0.0357 0.9999 0.9519 0.1945 0.4864 0.5939 MAS 0.3839 0.2024 0.9999 0.8622 0.3556 0.5608 0.6884 CHN 0.9420 0.9702 0.9996 0.7791 0.8652 0.9112 0.9202 HN HN 0.9688 0.9821 0.9998 0.7603 0.6930 0.8808 0.8720 (a) NODET SG METHOD ACC REM MS TE FS CLscore CLstability SG 0.7277 1.0000 0.3704 0.9364 0.0000 0.6069 0.6253 FT 0.1741 0.2262 0.9997 0.9276 0.8388 0.6333 0.6423 CHN SI 0.1964 0.3214 0.9999 0.9363 0.1945 0.5297 0.6386 MAS 0.2277 0.4464 0.9999 0.8577 0.3556 0.5774 0.7014 CHN 0.7009 0.9643 0.9996 0.7642 0.8652 0.8588 0.8861 HN 0.7634 1.0000 0.9998 0.7455 0.6943 0.8406 0.8680 HN (b) NODEI Vector field Initial value Demonstration Prediction TABLE III C ONTINUAL LEARNING METRICS FOR THE H ELLOW ORLD DATASET Fig. 9. Example of trajectories predicted by SG, CHN, and HN for all ( MEDIAN OVER 5 SEEDS ). VALUES RANGE FROM 0 ( WORST ) TO 1 ( BEST ). HelloWorld tasks after being trained on the last task.
log10(DTW error) SG FT SI MAS CHN HN 6 5 4 3 2 1 h e l o w r d Task Fig. 11. DTW errors (lower is better) of the trajectories predicted for the 7 past tasks (shown individually) by the models using NODET after being trained on the last task of the HelloWorld dataset. Results are obtained using 5 independent seeds. REM than FT, SI and MAS. [5] J. von Oswald, C. Henning, J. Sacramento, and B. F. Grewe, “Continual Finally, we qualitatively evaluate how the trajectories learning with hypernetworks,” in International Conference on Learning Representations (ICLR), 2019. predicted by HN can be reproduced with a physical robot. [6] A. Billard, S. Calinon, and R. Dillmann, “Learning from humans,” For this, we use the same Franka Emika Panda robot that Springer Handbook of Robotics, 2nd Ed., 2016. was used for recording the demonstrations for D HW . The HN [7] M. Hersch, F. Guenter, S. Calinon, and A. Billard, “Dynamical system modulation for robot learning via kinesthetic demonstrations,” IEEE model trained on the 7 tasks of D HW is queried to produce Transactions on Robotics, vol. 24, no. 6, pp. 1463–1467, 2008. the letters h, e, l, l, o, w, o, r, l, d by using the appropriate task [8] S. M. Khansari-Zadeh and A. Billard, “Learning stable nonlinear embedding vectors in sequence. The trajectory of each letter dynamical systems with Gaussian mixture models,” IEEE Transactions on Robotics, vol. 27, no. 5, pp. 943–957, 2011. is scaled and translated by a constant amount and provided to [9] J. Urain, M. Ginesi, D. Tateo, and J. Peters, “Imitationflow: Learning the robot, which then follows this path with its end-effector. deep stable stochastic dynamic systems by normalizing flows,” in 2020 The z-coordinate and orientation of the end-effector are fixed. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5231–5237. Fig. 1 shows the letters written by the robot. A video of [10] J. Z. Kolter and G. Manek, “Learning stable deep dynamics models,” the robot performing the HelloWorld tasks is available at Advances in Neural Information Processing Systems, vol. 32, pp. 11 128– https://youtu.be/cTfVfYyyeXk. 11 136, 2019. [11] A. J. Ijspeert, J. Nakanishi, and S. Schaal, “Movement imitation with VI. C ONCLUSION nonlinear dynamical systems in humanoid robots,” in International Conference on Robotics and Automation (ICRA), 2002, pp. 1398–1403. In this paper, we presented the first work on continual [12] M. Saveriano, F. J. Abu-Dakka, A. Kramberger, and L. Peternel, learning from kinesthetic demonstrations. We showed the “Dynamic movement primitives in robotics: A tutorial survey,” arXiv preprint arXiv:2102.03861, 2021. effectiveness of hypernetworks which continually consolidate [13] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural the knowledge from a sequence of learned tasks into a single ordinary differential equations,” in Proceedings of the 32nd Interna- network without retraining on any past tasks. Our results also tional Conference on Neural Information Processing Systems, 2018, pp. 6572–6583. show that the relatively small chunked hypernetworks perform [14] N. Díaz-Rodríguez, V. Lomonaco, D. Filliat, and D. Maltoni, “Don’t on par with regular hypernetworks for a limited number of forget, there is more than forgetting: new metrics for continual learning,” tasks, but start forgetting as the number of tasks increases. In arXiv preprint arXiv:1810.13166, 2018. [15] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual the future, we will investigate how the remembering capacity lifelong learning with neural networks: A review,” Neural Networks, of chunked hypernetworks can be improved. Other aspects vol. 113, pp. 54–71, 2019. of future work will include handling trajectories of more [16] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in Proceedings of than two dimensions, either in the robot’s task space or joint the IEEE Conference on Computer Vision and Pattern Recognition, space, and replacing NODEs with more advanced trajectory 2017, pp. 2001–2010. learning approaches [9] [10] with stability guarantees. [17] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synap- tic intelligence,” in International Conference on Machine Learning. R EFERENCES PMLR, 2017, pp. 3987–3995. [18] A. Xie and C. Finn, “Lifelong robotic reinforcement learning by [1] C. Gao, H. Gao, S. Guo, T. Zhang, and F. Chen, “CRIL: Continual retaining experiences,” arXiv preprint arXiv:2109.09180, 2021. robot imitation learning via generative and prediction model,” in 2021 [19] S. M. Khansari-Zadeh and A. Billard, “Learning control lyapunov IEEE/RSJ International Conference on Intelligent Robots and Systems function to ensure stability of dynamical system-based robot reaching (IROS), 2021, pp. 6747–5754. motions,” Robotics and Autonomous Systems, vol. 62, no. 6, pp. 752– [2] Y. Huang, K. Xie, H. Bharadhwaj, and F. Shkurti, “Continual model- 765, 2014. based reinforcement learning with hypernetworks,” in 2021 IEEE [20] M. Saveriano, “An energy-based approach to ensure the stability of International Conference on Robotics and Automation (ICRA). IEEE, learned dynamical systems,” in IEEE International Conference on 2021, pp. 799–805. Robotics and Automation (ICRA), 2020, pp. 4407–4413. [3] H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deep [21] M. Heinonen, C. Yildiz, H. Mannerström, J. Intosalmi, and generative replay,” in Proceedings of the 31st International Conference H. Lähdesmäki, “Learning unknown ODE models with gaussian on Neural Information Processing Systems, 2017, pp. 2994–3003. processes,” in International Conference on Machine Learning. PMLR, [4] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, 2018, pp. 1959–1968. “Memory aware synapses: Learning what (not) to forget,” in Proceedings [22] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep of the European Conference on Computer Vision (ECCV), 2018, pp. network learning by exponential linear units (elus),” in 4th International 139–154. Conference on Learning Representations, ICLR, 2016.
You can also read