Continual Learning from Demonstration of Robotic Skills - arXiv

Page created by Luis Fletcher
 
CONTINUE READING
Continual Learning from Demonstration of Robotic Skills - arXiv
Continual Learning from Demonstration of Robotic Skills
                                         Sayantan Auddy1∗ Jakob Hollenstein1 Matteo Saveriano1,3 Antonio Rodríguez-Sánchez1 Justus Piater1,2

                                            Abstract— Methods for teaching motion skills to robots focus
                                         on training for a single skill at a time. Robots capable of
                                         learning from demonstration can considerably benefit from
                                         the added ability to learn new movements without forgetting
                                         past knowledge. To this end, we propose an approach for
                                         continual learning from demonstration using hypernetworks
arXiv:2202.06843v2 [cs.RO] 15 Feb 2022

                                         and neural ordinary differential equation solvers. We empir-
                                         ically demonstrate the effectiveness of our approach in re-
                                         membering long sequences of trajectory learning tasks without
                                         the need to store any data from past demonstrations. Our
                                         results show that hypernetworks outperform other state-of-
                                         the-art regularization-based continual learning approaches for
                                                                                                                    Fig. 1. A robot, continually trained using learning from demonstration to
                                         learning from demonstration. In our experiments, we use the                write single letters, can reproduce all the trajectories that it has learned in the
                                         popular LASA trajectory benchmark, and a new dataset of                    past with a single network and without having access to training data from
                                         kinesthetic demonstrations that we introduce in this paper                 past tasks. Video is available at https://youtu.be/cTfVfYyyeXk.
                                         called the HelloWorld dataset. We evaluate our approach using
                                         both trajectory error metrics and continual learning metrics,
                                         and we propose two new continual learning metrics. Our
                                         code, along with the newly collected dataset, is available at              a single motion skill. To naively learn multiple motion skills,
                                         https://github.com/sayantanauddy/clfd.                                     one would need to train a different model for each skill, or
                                            Index Terms— Continual learning, learning from demonstra-               jointly train on the demonstrations for all skills.
                                         tion, hypernetwork, neural ordinary differential equation solver              In this paper, we propose an approach for continual learning
                                                                                                                    from demonstration in which a robot learns individual motions
                                                                 I. I NTRODUCTION                                   sequentially without retraining on past demonstrations. The
                                                                                                                    learned skills are incorporated into a single unified model.
                                            Robots deployed in unstructured real-world environments                 After learning many types of motion, our robot can reproduce
                                         will face new tasks and challenges over time, requiring                    all the trajectories it has learned in the past (Fig. 1). To the
                                         capabilities that cannot be fully anticipated at the beginning.            best of our knowledge, this is the first continual learning
                                         These robots need to learn continually, which implies that they            approach for learning from kinesthetic demonstrations.
                                         should be able to acquire new capabilities without forgetting
                                                                                                                       More specifically, we show that a single Hypernetwork [5],
                                         the previously learned ones. Furthermore, a continual learning
                                                                                                                    that generates the parameters of a Neural Ordinary Differential
                                         robot should be able to do this without the need to store and
                                                                                                                    Equation (NODE) solver [13], remembers a long sequence
                                         retrain on the training data of all the previously learned skills.
                                                                                                                    of motion skills as well as when learning each task with a
                                            Continual learning can be effective in expanding a robot’s
                                                                                                                    separate NODE. The hypernetwork grows by a negligible
                                         repertoire of skills and in increasing the ease of use for non-
                                                                                                                    amount for each new task, making it suitable for potential
                                         expert human users. However, apart from a few approaches
                                                                                                                    deployment on resource-constrained, non-networked robotic
                                         for robotics [1], [2], the current continual learning research
                                                                                                                    platforms. We also demonstrate the effectiveness of chunked
                                         mostly focuses on vision-based tasks such as incrementally
                                                                                                                    hypernetworks [5] which are even smaller in size than the
                                         learning classification of new image categories [3]–[5].
                                                                                                                    NODEs they generate. Our results show how using the time
                                            Continually acquiring perceptive skills is important for a
                                                                                                                    index as an additional, direct input to a NODE increases its
                                         robot that interacts with its environment, but equally important
                                                                                                                    prediction accuracy for complex trajectories. We evaluate our
                                         is the ability to incrementally learn new movement skills.
                                                                                                                    approach on the popular LASA trajectory learning benchmark
                                         Learning from demonstration [6] is a popular and tangible way
                                                                                                                    [8]. We also introduce a new dataset, named HelloWorld,
                                         to impart motion skills to robots, for instance via kinesthetic
                                                                                                                    which consists of two-dimensional demonstrations collected
                                         teaching, where a human user teaches new skills by guiding
                                                                                                                    with a Franka Emika Panda robot. It serves as an additional
                                         the robot. A recent trend in learning from demonstration is
                                                                                                                    benchmark to evaluate our approach, both quantitatively
                                         to encode observations into a vector field [7]–[12]. These
                                                                                                                    and qualitatively on a real robot. Finally, we propose two
                                         methods, like many other works in the field, focus on learning
                                                                                                                    new easily-computable metrics which, together with existing
                                            1 Department of Computer Science, University of Innsbruck, Techniker-   ones [14], gauge continual learning performance.
                                         strasse 21a, 6020 Innsbruck, Austria. {name.surname}@uibk.ac.at               To summarize, our contribution in this paper is 3-fold:
                                            2 Digital Science Center (DiSC), University of Innsbruck, Austria.
                                            3 Department of Industrial Engineering, University of Trento, Italy.       • We propose an approach for learning from demonstration
                                            ∗ Corresponding author.                                                       with hypernetworks and NODEs for continually learning
new tasks without reusing training data of previous tasks.   learning from demonstration is a mature research field, most
  •   We release a new dataset containing 7 tasks collected        methods assume that different tasks are encoded in different
      with a real robot using kinesthetic teaching.                representations, i.e., one has to fit a new model for each
  •   We propose two new continual learning metrics.               task the robot has to execute. In this paper, we take the
                                                                   continual learning perspective on learning by demonstration
                     II. R ELATED W ORK                            and propose an approach capable of continuously learning
A. Continual Learning                                              new tasks without accessing the training data from past tasks.
   Popular strategies for continual learning include replaying                          III. BACKGROUND
data from past tasks or regularizing trainable parameters
to avoid catastrophic forgetting [15]. Replay-based methods           In this paper, we utilize Neural Ordinary Differential
cache samples of real data from past tasks [16], or use            Equation (NODE) solvers [13] for learning trajectories and
generative models to create pseudo-samples of past data            different state-of-the-art continual learning approaches [4],
[3], which are interleaved with the current task’s data            [5], [17] to alleviate catastrophic forgetting [15].
during training. Regularization-based methods [4], [17] add        A. Trajectory Learning
a regularization term to the learning objective to minimize
                                                                   Neural Ordinary Differential Equation solver: Consider a
changes to parameters important for solving previous tasks.                                                       (0)     (N −1)
                                                                   set of N observed trajectories D = {y0:T −1 , . . . , y0:T −1 },
   Relatively few existing approaches address continual learn-                                (i)
ing for robotics. Gao et al. [1] present an approach for           where each trajectory y0:T −1 is a sequence of T observations
                                                                    (i)                                 (i)
continual imitation learning that relies on deep generative        yt ∈ Rd . Each observation yt is a perturbation of an
                                                                                          (i)
replay [3] and action-conditioned video prediction to generate     unknown true state xt generated by an unknown underlying
state and action trajectories of past tasks. This pseudo-data is   vector field ftrue [21]:
interleaved with demonstrations of the current task to train a                                    Z t
policy network that controls the robot’s actions. The authors                         xt = x0 +       ftrue (xτ ) dτ,          (1)
                                                                                                  0
note that the generation of high-quality video frames can be
problematic for a long sequence of tasks.                          where x0 is the true starting state of the trajectory. The goal
   Xie and Finn’s [18] approach for lifelong robotic rein-         of a Neural Ordinary Differential Equation (NODE) solver
forcement learning seeks to improve the forward transfer           [13] is to learn a neural network fθ parameterized by θ that
performance while learning a new current task by pre-training      approximates the true underlying dynamics of the observed
on the entire experience collected from all previous tasks.        system such that fθ ≈ ftrue . As we do not have access to
The problem of catastrophic forgetting is not considered.          ftrue but only to the noisy observed trajectories, we compute
   Our approach is similar to Huang et al. [2], who also           the loss L based on the difference of the forward simulated
utilize hypernetworks for continually training a robot. In         states of the NODE ŷt and the observations yt :
their work, a task-conditioned hypernetwork generates the                  1X
                                                                                                                  Z t
                                                                                             2
parameters of the dynamics model for reinforcement learning          L=          k yt − ŷt k2 where ŷt = ŷ0 +      fθ (ŷτ ) dτ
                                                                           2 t                                      0
tasks such as opening doors or pushing blocks. In contrast, we
                                                                                                                                 (2)
use hypernetworks for generating parameters for a trajectory
learning NODE in a setup for learning from demonstration.          B. Continual Learning with Regularization
We follow a supervised approach and do not need to rely            Synaptic Intelligence: Synaptic Intelligence (SI) [17] is a
on robot simulators. Compared to [2], we evaluate on much          regularization-based continual learning approach. Each neural
longer sequences of tasks and also investigate the effectiveness   network parameter is assigned an importance measure based
of chunked hypernetworks [5]. In addition, we qualitatively        on its contribution to the change in the loss. The loss for the
evaluate our approach on a physical robot.                         mth task is defined as:
                                                                                                X                 2
                                                                                                          ∗
B. Trajectory Learning from Demonstration                                        L̃m = Lm + c       Ωmk (θk − θk ) ,           (3)
   Learning from demonstration enables users without exper-                                       k
tise in robotics to train robots [6]. Approaches in the field      where c is the regularization constant which trades off between
can be categorized into two groups: Probabilistic approaches       learning a new task and remembering previously learned tasks,
[7]–[9] use generative models to fit a distribution from           θk∗ denotes the value of the k th parameter before starting to
the training data; Non-probabilistic approaches [10]–[12]          learn the mth task, and θk is the current value of the k th
exploit function approximators like neural networks to fit         parameter. The per-parameter regularization strength Ωm  k [17]
the training data. In both groups, training data can be used       is given by
to learn a static mapping (time input → desired position)                                     X        ωkl
or a dynamic mapping (input position → desired velocity).                             Ωmk =          l
                                                                                                               ,               (4)
                                                                                              l
Memory Aware Synapses: Memory Aware Synapses (MAS)                         and all the chunk embedding vectors are combined in a batch
[4] is also a regularization-based continual learning approach.            and fed into the hypernetwork to produce the target network
The loss for the mth task for MAS has the same form as SI                  parameters for a task in one forward pass [5].
(3). MAS differs from SI in the way Ωm    k is computed: the                                     IV. M ETHODS
importance of a trainable parameter depends on the gradient
of the squared L2 norm of the network’s output:                               In our experiments, we employ two variants of NODEs,
                                                                           which are enhanced with different continual learning methods
              N                    N
           1 X                  1 X   ∂L22 (fθ (xn ))                      to enable the NODEs to learn continually (Fig. 2).
   Ωm
    k =          ||gk (xn )|| =                       .             (5)
           N n=1                N n=1       ∂θk                            A. NODE Variants
The above summation is performed over N input data points.                    Along with a basic NODE fθ (ŷt ) (Sec. III-A), we use
                                                                           another variant where the NODE neural network is a function
Hypernetworks: A hypernetwork [5] is a meta-model that                     of both state and time, fθ (ŷt , t). This explicit time input
generates the parameters of a target network that solves the               results in the NODE learning a time-evolving vector field. We
task we are interested in. It uses a trainable task embedding              show empirically that this improves the accuracy of predicted
vector as an input to generate the network parameters                      trajectories, especially for those containing loops. We refer
for a task. Though the parameters h of the hypernetwork                    to this time-dependent NODE as NODET , and to the time-
fh are regularized, the parameters θm+1 produced by a                      independent one as NODEI .
hypernetwork for the (m+1)th task can be arbitrarily far away              B. Continual Learning NODE Models
in parameter space from the parameters θm produced for the                 Single NODE per task (SG): A simple way to learn M
previous mth task. Intuitively, this gives a hypernetwork more             tasks is to use a dedicated, newly-initialized NODE to learn a
freedom to find good solutions for both the mth and (m+1)th                task and to freeze it afterwards. At the end we get M NODEs,
tasks than other regularization-based approaches [17] [4].                 from which we can pick one at prediction time to reproduce
   A two-step optimization process is used for training a                  the desired trajectory (Fig. 2(a)). In this setting, which acts as
hypernetwork [5]. First, a candidate change ∆h for the                     an upper-performance baseline, catastrophic forgetting [15]
hypernetwork parameters is computed which minimizes the                    is eliminated because the parameters of a NODE trained on
task-specific loss Lm for the (current) mth task w.r.t. θm :               a task are not affected when a new NODE is trained on the
         Lm = Lm (θm , ym ) where θm = fh (em , h)                  (6)    next task. However, this also means that we end up with M
                                                                           times the number of parameters of a single NODE.
Here em is the task embedding vector and ym is the data for                Finetuning (FT): A single NODE is sequentially finetuned on
the mth task. Next, ∆h is considered to be fixed and the actual            M tasks. To tell the NODE which task it should reproduce, i.e.
change for the hypernetwork parameters h is learned [5] by                 to make it task-conditioned, we use an additional input in the
minimizing the regularized loss L̃m w.r.t. θm = fh (em , h):               form of a trainable vector known as a task embedding vector.
  L̃m = Lm θm , ym
                                                                          This is similar to the approach followed for hypernetworks.
                  m−1                                                      After the mth task is learned, the trained task embedding
               β X                                              2          vector em for that task is saved. To reproduce the trajectory
          +           fh (el , h∗ ) − fh (el , h + ∆h)              (7)
              m−1                                                          for the mth task, we pick the corresponding task embedding
                      l=0
                                                                           vector and use it as the additional network input. The
   Here h∗ denotes the hypernetwork parameters before                      NODE parameters are finetuned to minimize the loss on the
learning the mth task, and β is a hyperparameter that controls             current task without any mechanism for avoiding catastrophic
the regularization strength. To calculate the second part of (7),          forgetting [15]. In this setting (Fig. 2(b)), we would expect
the stored task embedding vectors {e0 , e1 , . . . , el , . . . , em−1 }   the NODE to only remember the latest task and so this acts
for all tasks before the mth task are used. In each learning               as a lower-performance baseline.
step, the current task embedding vector em is also updated
to minimize the task-specific loss Lm [5]. Note that the                   Synaptic Intelligence (SI): To learn M tasks, the NODE
parameters of the target network θm for the mth task are                   parameters are regularized with the SI loss L̃m (3). The task-
simply the hypernetwork outputs and are not directly trainable.            specific part (Lm ) of L̃m corresponds to the NODE loss (2).
   Chunked hypernetworks [5] produce the parameters of                     Similar to FT, we make the SI NODE task-conditioned using
the target network in segments known as chunks. A regular                  a trainable task embedding vector, as shown in Fig. 2(b).
hypernetwork has a very high-dimensional output, but a                     Memory Aware Synapses (MAS): For MAS [4], we also
chunked hypernetwork’s output is of a much smaller di-                     follow the architecture in Fig. 2(b). The NODE parameters
mension, leading to a lower hypernetwork parameter size. A                 are learned using (3) and we use (5) to compute Ωm         k . As
chunked hypernetwork requires additional inputs in the form                before, we apply (2) as the task-specific loss Lm . The MAS
of trainable chunk embedding vectors. While each task has its              NODE is also made task-conditioned with a trainable task
dedicated task embedding vector, chunk embedding vectors                   embedding vector.
are shared across tasks and are regularized in the same way                Hypernetworks (HN): We use a hypernetwork to generate
as the hypernetwork parameters. The task embedding vector                  the parameters of a NODE. We first compute the candidate
(a)                          (b)                                      (c)                                              (d)
    Time
    steps
                   Start
                   state
                               Task
                               Emb.
                                        Time
                                        steps
                                                      Start
                                                      state
                                                                     Task
                                                                     Emb.
                                                                                         Time
                                                                                         steps
                                                                                                        Start
                                                                                                        state
                                                                                                                Task
                                                                                                                Emb.    {   Chunk
                                                                                                                            Embs.   }     Time
                                                                                                                                          steps
                                                                                                                                                         Start
                                                                                                                                                         state

                                                                  Hypernetwork                                   Hypernetwork
        Integrator                    Integrator                                             Integrator                                      Integrator
            NODE                        NODE                                                     NODE                                             NODE

     State Trajectory              State Trajectory                                       State Trajectory                                State Trajectory
                                    Finetuning (FT)                                                                {Chunks}
                               Synaptic Intelligence (SI)
  Single NODE/task (SG)      Memory Aware Synapses (MAS)                    Hypernetwork (HN)                          Chunked Hypernetwork (CHN)

Fig. 2. Continual learning models used in our experiments. Non-regularized trainable parameters are saved after each task is learned (shown with ).
Regularized trainable parameters are protected from catastrophic forgetting while learning a sequence of tasks (shown with ). Other inputs and outputs are
not trainable (shown with ). Given a start state and time steps, a NODE generates the state trajectory for the time steps. (a) SG: A single NODE learns
only a single task. This forms the upper-performance baseline. (b) Architecture for Finetuning (FT), Synaptic Intelligence (SI) and Memory Aware Synapses
(MAS). NODE parameters are regularized for MAS and SI, and finetuned for FT. (c) Hypernetworks (HN) produce all the NODE parameters using a
task embedding vector. (d) Chunked Hypernetworks (CHN) use chunk embedding vectors together with a task embedding vector to produce the NODE
parameters in segments called chunks. HN and CHN (highlighted in purple) are our proposed solutions for continual learning from demonstration.

change ∆h for the hypernetwork parameters by minimizing                           D HW = {D0:6 } consists of 7 tasks, each containing 8 slightly
the NODE loss (2). This acts as our task-specific loss Lm :                       varying demonstrations of a letter. Each demonstration is
                                                                                  a sequence of 1000 2-D points. After training on all the
                              1X
Lm = Lm (θm , ym ) =                k ytm − ŷtm k22              (8)             tasks, the objective is to make the robot write the words
                              2 t                                                 “hello world”. Our motivation for using this dataset is to
                                                  Z t
                                                                                  test our approach on complicated trajectories (with loops)
 where θm                 m           m     m
                   = fh (e , h) and ŷt = ŷ0 +       fθm (ŷτm ) dτ              and to show that it also works on kinesthetically recorded
                                                              0
                                                                                  demonstrations using a real robot. This dataset is available
Symbols have the same meaning as in equations (2) and
                                                                                  at https://github.com/sayantanauddy/clfd.
(6). We use (7) for training the hypernetwork in the second
optimization step. The structure of HN is shown in Fig. 2(c).                     B. Metrics
Chunked Hypernetworks (CHN): As shown in Fig. 2(d),                               Trajectory Metrics: We report the Swept Area error [19],
we use a chunked hypernetwork to generate the parameters                          Frechet distance [9], and Dynamic Time Warping (DTW)
of a NODE. For this, equations (8) and (7) are employed as                        error [9] , which measure how close the predicted trajectories
the loss functions in the 2-step optimization process.                            are to the ground-truth demonstrations.
   We treat SG, FT, SI and MAS as comparison baselines,                           Continual Learning Metrics: We report Accuracy (ACC),
and propose HN and CHN (highlighted in Fig. 2) as solutions                       Remembering (REM), and Model Size Efficiency (MS) [14].
for continual learning from demonstration.                                        ACC is a measure of the average accuracy for the current
                                                                                  and past tasks. REM measures how well past tasks are
                           V. E XPERIMENTS
                                                                                  remembered. MS measures how much the size of a model
A. Datasets                                                                       grows compared to its size after learning the first task.
LASA Dataset: LASA [8] is a widely-used benchmark                                    Additionally, we introduce two new easy-to-compute con-
for evaluating motion generation algorithms. It contains 30                       tinual learning metrics: Time Efficiency (TE) and Final Model
patterns, each with 7 similar demonstrations. We refer to each                    Size (FS). TE measures the increase in training duration with
pattern Dm as a task. Of the 30 tasks, we use the first 26 tasks:                 the number of tasks, relative to the training time for the first
D LASA = {D0:25 }. We omit the last 4 tasks, each of which                        task. TE only needs the training times to be logged, and it
contains 2 or 3 dissimilar patterns merged together. Each                         reflects the extra effort needed in the training loop (e.g. due
demonstration of a task is a sequence of 1000 2-D points.                         to extra regularization steps) with an increase in the number
We arrange the 26 tasks alphabetically and train sequentially                     of tasks. For M tasks, TE is defined as
on each one without accessing the data of past tasks.                                                         (       M −1
                                                                                                                               )
                                                                                                                  T0 X 1
HelloWorld Dataset: We further evaluate our approach on                                            TE = min 1,                   ,             (9)
                                                                                                                  M i=0 Ti
a dataset of demonstrations we collected using the Franka
Emika Panda robot. The x and y coordinates of the robot’s                         where Ti is the time required by the model to learn task i.
end-effector were recorded while a human user guided it                              FS is a measure of the absolute parameter size, which
kinesthetically to write the 7 lower-case letters h,e,l,o,w,r,d                   contrasts with MS which only measures the parameter growth
one at a time on a horizontal surface. The HelloWorld dataset                     relative to the size after learning the first task. A model which
has a large number of parameters for the first task and adds         per task, same as SI, MAS and FT. Thus, CHN and especially
a relatively small number of parameters for subsequent tasks         HN perform similar to the upper baseline SG, while their
will achieve a high score for MS, but will fare worse in terms       parameter size is close to the lower baseline FT. Fig. 5 shows
of FS if models of other compared methods have a smaller             examples of trajectories predicted by SG, CHN, and HN for
absolute size. FS is defined as                                      a selection of tasks after learning the last task.
                                                                        After training on all 26 tasks, we compute the errors of the
                    FS = 1 − Memnorm (θM −1 )                (10)
                                                                     trajectories predicted for tasks 0 to 25. We plot the overall
             M −1
Memnorm (θ      ) is the parameter size after learning M tasks,
normalized by the size of the largest compared model among

                                                                      (Frechet error) (Swept Area error)
SG, FT, SI ,MAS, HN and CHN. With these 5 metrics, we
compute the overall
                P continual learning metrics    P proposed in                                              3

                                                                                            log10
[14]: CLscore = c∈C c and CLstability = 1 − c∈C stdev(c),
where C = {ACC, REM, MS, TE, FS}. All the continual
                                                                                                           2
learning metrics lie in the range 0 (worst) to 1 (best).
C. Hyperparameters                                                                                         2

                                                                           log10
   All models are trained for 15 × 103 and 40 × 103 iterations
                                                                                                           1
per task for D LASA and D HW respectively. In all experiments,
we use fully connected networks and the Adam optimizer with                                                0
a learning rate of 10−4 . The NODEs for SG, FT, SI, MAS, and                                               5

                                                                      (DTW error)
CHN have 3 hidden layers with 1000 units each. For HN, the

                                                                         log10
                                                                                                           4
target NODE has 3 hidden layers with 100 units each to keep
its parameter size comparable to the other models. The smooth
                                                                                                           3
ELU activation [22] is used in all NODEs. Task embedding
                                                                                                                  0            5                10             15          20          25
vectors have a dimension of 256 wherever they are used.                                                                                              Task ID
For CHN, we use 256-dimensional chunk embedding vectors                                                         SG            FT           SI             MAS          CHN          HN
and 8192-dimensional output chunks. The hypernetworks in
HN and CHN have 3 hidden layers with 200 ReLU units                  Fig. 3. Trajectory errors for the LASA dataset (lower is better). The x-axis
each. For regularization, we use: SI [17]: c = 0.3, ξ = 0.3,         shows the current task. After learning a task (using NODET ), all current
                                                                     and previous tasks are evaluated. Plots for SG and HN overlap with each
MAS [4]: c = 0.1, HN and CHN [5]: β = 5 × 10−3 . These               other. Lines show medians and shaded regions denote the lower and upper
are based on values used in the aforementioned papers. In            quartiles of the errors over 5 independent seeds.
Sec. V-D we show that our proposed methods (HN and CHN)
are robust to changes in regularization hyperparameters.
                                                                           Parameters (×106)

                                                                                                50
                                                                                                                        SG           MAS
                                                                                                40                      FT           CHN
D. Results
                                                                                                30                      SI           HN
LASA Dataset: We train each model on the 26 tasks of                                            20
D LASA sequentially. Fig. 3 shows the median errors of the                                      10
predictions for tasks D0 –Dm after training on task Dm (using                                              0
                                                                                                                  0            5            10                 15          20       25
NODET ) for m = 0, 1, . . . 25, e.g. the value for task 7 denotes                                                                                    Task ID
the evaluation errors for all trajectories from tasks 0 to 7 after
training on task 7. SG’s performance does not deteriorate with       Fig. 4. Growth of parameter size with new tasks for the LASA dataset
                                                                     (using NODET ). SG has a high rate of growth since it uses a separate
increasing tasks, leading to a nearly horizontal line (overlaps      network for each task. All other models grow by only 256 parameters for
with HN). A drastic increase in the error is observed for FT         each new task. Plots for CHN and FT overlap with each other.
as more tasks are learned, since FT optimizes its parameters
only for the current task. After the first task, the errors for SI                                                                           Task IDs
                                                                                                           1              5          9               13         17         21     25
and MAS also increase steeply. Among the continual learning
                                                                     SG

models, CHN and HN perform the best (red and green lines
in Fig. 3). CHN’s forgetting increases with the number of
tasks, as shown by the upward slope in its error plot. HN
                                                                     CHN

does not suffer much from catastrophic forgetting, and its
error plot overlaps with that of the upper baseline SG.
   Although HN’s performance after learning 26 tasks is very
                                                                     HN

similar to that of SG, its parameter size is 4.3×106 compared
to SG’s combined size of 52.2 × 106 parameters, as shown
                                                                                                               Vector field        Initial value           Demonstration        Prediction
in Fig. 4. CHN’s final parameter size is 1.9 × 106 . Also, the
parameter count for SG grows by 2.1 × 106 per task, whereas          Fig. 5. Example of trajectories predicted by SG, CHN and HN using
CHN and HN grow at a much smaller rate of 256 parameters             NODET for a selection of LASA tasks after learning the last task.
log10(DTW error)
                                                                                                                                Since there is no preexisting procedure for this, we follow

                                                                   log10(DTW error)
                       5                                                              5                                         the following steps. We set a threshold on the DTW error,
                       4                                                              4                                         such that predictions with an error less than the threshold are
                                                                                                                                considered accurate. As each task has multiple ground truth
                       3                                                              3
                                                                                                                                demonstrations, we first compute the DTW error between
                       2                                                              2                                         all pairs of demonstrations for each task. We then find the
                           SG    FT       SI   MAS CHN HN                                 SG    FT   SI   MAS CHN HN
                                          Method                                                     Method                     maximum value from this list and multiply it by 3, to allow
                                  (a) NODET                                                     (b) NODEI                       some room for error such that a predicted trajectory with
                                                                                                                                the same general shape as its demonstration is considered
Fig. 6. DTW errors (lower is better) of trajectories predicted for all past                                                     accurate. Doing so, we arrive at a DTW threshold value of
tasks together after learning the last task of the LASA dataset. Results are
obtained using 5 independent seeds.                                                                                             2191 for D LASA , and use it to evaluate the metrics in Tab. I.
                                                                                                                                   For both NODE variants, HN significantly outperforms
                                                                                                                                all the compared models in terms of CLscore . For NODET ,
            METHOD                    ACC       REM       MS                TE                 FS    CLscore   CLstability      HN performs close to the upper baseline SG in terms of
            SG                     0.8742       1.0000   0.1482    0.8679                  0.0000    0.5781     0.5832          both ACC and REM. The additional regularization needed
            FT                     0.0594       0.1569   0.9986    0.9579                  0.9565    0.6259     0.5759          for training hypernetworks leads to a comparatively lower
            SI                     0.0427       0.3714   0.9997    1.0000                  0.7830    0.6394     0.6236
            MAS                    0.0179       0.8716   0.9996    0.8312                  0.8264    0.7094     0.6486          score for the time efficiency metric TE for CHN and HN.
            CHN                    0.4766       0.7943   0.9983    0.5270                  0.9636    0.7520     0.7838          A very high parameter growth rate for SG results in poor
            HN                     0.8840       0.9710   0.9993    0.5327                  0.9173    0.8609     0.8311
                                                                                                                                scores for MS and FS. The extra time input in NODET also
                                                         (a) NODET                                                              leads to better overall performance for SG and HN.
            METHOD                    ACC       REM       MS                TE                 FS    CLscore   CLstability      Robustness to Hyperparameter Changes: To test the sen-
            SG                     0.8107       1.0000   0.1482    0.8614                  0.0000    0.5641     0.5925          sitivity of the methods to changes in the regularization hyper-
            FT                     0.0590       0.2040   0.9986    0.8961                  0.9565    0.6228     0.5949          parameters, we create sets of 5 hyperparameters each for SI,
            SI                     0.0452       0.3785   0.9997    0.9371                  0.7830    0.6287     0.6368
            MAS                    0.0273       0.8215   0.9996    0.8769                  0.8264    0.7103     0.6525
                                                                                                                                MAS, HN and CHN by drawing independently and uniformly
            CHN                    0.5385       0.8409   0.9983    0.5130                  0.9636    0.7709     0.7930          from the following ranges: (SI) c ∈ [0.1, 0.5], ξ ∈ [0.1, 0.5],
            HN                     0.7595       0.9864   0.9993    0.6011                  0.9176    0.8528     0.8480          (MAS) c ∈ [0.1, 0.5], (CHN) β ∈ [10−3 , 10−2 ], (HN)
                                                         (b) NODEI                                                              β ∈ [10−3 , 10−2 ] resulting in 20 different configurations.
                                                                                                                                We then repeat the LASA experiment with NODET for all
                              TABLE I
                                                                                                                                these configurations. In terms of CLscore we observe that all
      C ONTINUAL LEARNING METRICS FOR THE LASA DATASET ( MEDIAN
                                                                                                                                configurations of HN outperform all configurations of CHN,
                       OVER     5 SEEDS ). VALUES RANGE FROM 0 ( WORST ) TO 1 ( BEST ).
                                                                                                                                which in turn are better than all configurations of MAS,
                                                                                                                                followed by SI. This trend is reflected in the medians and
                                                                                                                                inter-quartile ranges (IQR) of the overall continual learning
errors for all tasks in Fig. 6 and the errors for a selection of                                                                metrics CLscore and CLstability for each method (over its 5
7 tasks in Fig. 7. The similarity in the performance of HN
with the upper baseline SG can be seen in both cases. Fig. 7
                                                                                                                                                               CLscore           CLstability
shows that except MAS, all other models can remember the
                                                                                                                                              METHOD      Median      IQR     Median       IQR
last task (task 25) but SG and HN remember the other tasks
                                                                                                                                              HN          0.8578    0.0011    0.8324   0.0050
as well. CHN performs worse than HN but much better than                                                                                      CHN         0.7939    0.0098    0.8126   0.0022
FT, SI, and MAS. Note that the trajectory metrics are plotted                                                                                 MAS         0.7104    0.0019    0.6562   0.0062
in the log10 scale to accommodate the high errors for FT, SI,                                                                                 SI          0.6047    0.0065    0.6403   0.0011
and MAS in the same plot as SG, HN and CHN.
                                                                                                                                                           TABLE II
   To compute the continual learning metrics [14], each                                                                          ROBUSTNESS TO CHANGES IN REGULARIZATION HYPERPARAMETERS FOR
predicted trajectory needs to be marked as accurate or                                                                              THE LASA DATASET (5 CONFIGURATIONS FOR EACH METHOD ).
inaccurate based on its difference from the ground truth.

                                                                                               SG         FT             SI             MAS         CHN        HN
log10(DTW error)

                   5

                   4

                   3

                   2
                                      1                        5                                      9                         13                  17                   21                      25
                                                                                                                              Task ID

Fig. 7. DTW errors (lower is better) of the trajectories predicted for a selection of 7 out of 26 past tasks (shown individually) by the models using NODET
after being trained on the last task of the LASA dataset. Results are obtained using 5 independent seeds.
configurations) shown in Tab. II. It can be seen that HN and                                                             performance does not deteriorate even after learning all tasks.
CHN perform better than the other methods and the variability                                                               Fig. 9 shows examples of trajectories predicted by SG,
in terms of IQR is very small, thereby showing that they are                                                             CHN and HN for past tasks after being trained sequentially
robust to changes in the regularization hyperaparameter β.                                                               on all D HW tasks. All models exhibit superior performance
HelloWorld Dataset: For D HW , which comprises 7 tasks,                                                                  when using the additional time input in NODET (Fig. 9(a)),
we perform the same experiments as D LASA . Fig. 8 shows                                                                 without which even SG is unable to learn trajectories with
the errors in the predicted trajectories for all past and current                                                        loops. This can be seen from the errors for the letters e, r and d
tasks, as new tasks are learned. The median errors for CHN                                                               in Fig. 9(b). This is also evident in Fig. 10 which shows
and HN stay nearly unchanged and are similar to the upper                                                                the errors in the predictions for all past tasks together after
baseline SG. As before, FT, SI, and MAS exhibit severe                                                                   all the tasks have been learned. Apart from FT, all methods
catastrophic forgetting. Due to fewer tasks in D HW , CHN’s                                                              have higher median errors when using NODEI (Fig. 10b)
                                                                                                                         compared to NODET (Fig. 10a). The prediction errors for
                                                                                                                         each past task after learning all tasks are shown individually
                                                                                                                         in Fig. 11. It can be seen that all the methods remember the
(Frechet error) (Swept Area error)

                                     3

                                                                                                                         last task, but only SG, CHN, and HN remember earlier tasks.
                                     2
                      log10

                                                                                                                            Using the same threshold computation approach we fol-
                                     1
                                                                                                                         lowed for D LASA , we compute a DTW threshold value of
                                                                                                                         1821 for D HW . With this, we compute the continual learning
                                     2                                                                                   metrics shown in Tab. III. The advantage of using NODEs
                                                                                                                         with a time input is clear from the higher values of ACC for
                                     1                                                                                   NODET compared to NODEI for all the methods. Overall,
     log10

                                                                                                                         CHN shows the best performance on account of its small
                                     0
                                                                                                                         size and also because its ACC score is comparable to HN
                                                                                                                         and SG. HN and CHN also achieve much higher scores for
                                     4
(DTW error)
   log10

                                     3
                                                                                                                          log10(DTW error)

                                                                                                                                                                                     log10(DTW error)
                                     2                                                                                                       6                                                          6
                                                                                                                                             5                                                          5
                                          h         e               l         o           w             r           d                        4                                                          4
                                                                             Task                                                            3                                                          3
                                         SG         FT                  SI        MAS           CHN             HN                           2                                                          2
                                                                                                                                             1                                                          1
                                                                                                                                                 SG   FT     SI   MAS CHN HN                                SG    FT    SI   MAS CHN HN
Fig. 8. Trajectory errors for the HelloWorld dataset (lower is better). The                                                                                 Method                                                     Method
x-axis shows the current task. After learning a task (using NODET ), all
current and previous tasks are evaluated. Lines show medians and shaded                                                                               (a) NODET                                                   (b) NODEI
regions denote the lower and upper quartiles of the errors over 5 independent
seeds. Plots for SG, HN and CHN are close to each other.                                                                 Fig. 10. DTW errors (lower is better) of trajectories predicted for all past
                                                                                                                         tasks together after learning the last task of the HelloWorld dataset. Results
                                          h         e           l            o           w          r           d        are obtained using 5 independent seeds.
                          SG

                                                                                                                                  METHOD                   ACC    REM       MS        TE                     FS        CLscore   CLstability
                                                                                                                                  SG                   1.0000     1.0000   0.3704   0.9431                  0.0000     0.6627      0.5924
                          CHN

                                                                                                                                  FT                   0.2500     0.0774   0.9997   0.9551                  0.8388     0.6242      0.6164
                                                                                                                                  SI                   0.2500     0.0357   0.9999   0.9519                  0.1945     0.4864      0.5939
                                                                                                                                  MAS                  0.3839     0.2024   0.9999   0.8622                  0.3556     0.5608      0.6884
                                                                                                                                  CHN                  0.9420     0.9702   0.9996   0.7791                  0.8652     0.9112      0.9202
                          HN

                                                                                                                                  HN                   0.9688     0.9821   0.9998   0.7603                  0.6930     0.8808      0.8720

                                                                                                                                                                            (a) NODET
                         SG

                                                                                                                                  METHOD                   ACC    REM       MS        TE                     FS        CLscore   CLstability
                                                                                                                                  SG                   0.7277     1.0000   0.3704   0.9364                  0.0000     0.6069      0.6253
                                                                                                                                  FT                   0.1741     0.2262   0.9997   0.9276                  0.8388     0.6333      0.6423
                         CHN

                                                                                                                                  SI                   0.1964     0.3214   0.9999   0.9363                  0.1945     0.5297      0.6386
                                                                                                                                  MAS                  0.2277     0.4464   0.9999   0.8577                  0.3556     0.5774      0.7014
                                                                                                                                  CHN                  0.7009     0.9643   0.9996   0.7642                  0.8652     0.8588      0.8861
                                                                                                                                  HN                   0.7634     1.0000   0.9998   0.7455                  0.6943     0.8406      0.8680
                         HN

                                                                                                                                                                            (b) NODEI

                                     Vector field        Initial value              Demonstration           Prediction                               TABLE III
                                                                                                                            C ONTINUAL LEARNING METRICS FOR THE H ELLOW ORLD DATASET
Fig. 9. Example of trajectories predicted by SG, CHN, and HN for all                                                     ( MEDIAN OVER 5 SEEDS ). VALUES RANGE FROM 0 ( WORST ) TO 1 ( BEST ).
HelloWorld tasks after being trained on the last task.
log10(DTW error)                                  SG           FT         SI          MAS           CHN           HN
                   6
                   5
                   4
                   3
                   2
                   1
                       h             e                     l                    o                    w                      r                     d
                                                                               Task

Fig. 11. DTW errors (lower is better) of the trajectories predicted for the 7 past tasks (shown individually) by the models using NODET after being
trained on the last task of the HelloWorld dataset. Results are obtained using 5 independent seeds.

REM than FT, SI and MAS.                                                         [5] J. von Oswald, C. Henning, J. Sacramento, and B. F. Grewe, “Continual
   Finally, we qualitatively evaluate how the trajectories                           learning with hypernetworks,” in International Conference on Learning
                                                                                     Representations (ICLR), 2019.
predicted by HN can be reproduced with a physical robot.                         [6] A. Billard, S. Calinon, and R. Dillmann, “Learning from humans,”
For this, we use the same Franka Emika Panda robot that                              Springer Handbook of Robotics, 2nd Ed., 2016.
was used for recording the demonstrations for D HW . The HN                      [7] M. Hersch, F. Guenter, S. Calinon, and A. Billard, “Dynamical system
                                                                                     modulation for robot learning via kinesthetic demonstrations,” IEEE
model trained on the 7 tasks of D HW is queried to produce                           Transactions on Robotics, vol. 24, no. 6, pp. 1463–1467, 2008.
the letters h, e, l, l, o, w, o, r, l, d by using the appropriate task           [8] S. M. Khansari-Zadeh and A. Billard, “Learning stable nonlinear
embedding vectors in sequence. The trajectory of each letter                         dynamical systems with Gaussian mixture models,” IEEE Transactions
                                                                                     on Robotics, vol. 27, no. 5, pp. 943–957, 2011.
is scaled and translated by a constant amount and provided to                    [9] J. Urain, M. Ginesi, D. Tateo, and J. Peters, “Imitationflow: Learning
the robot, which then follows this path with its end-effector.                       deep stable stochastic dynamic systems by normalizing flows,” in 2020
The z-coordinate and orientation of the end-effector are fixed.                      IEEE/RSJ International Conference on Intelligent Robots and Systems
                                                                                     (IROS). IEEE, 2020, pp. 5231–5237.
Fig. 1 shows the letters written by the robot. A video of                       [10] J. Z. Kolter and G. Manek, “Learning stable deep dynamics models,”
the robot performing the HelloWorld tasks is available at                            Advances in Neural Information Processing Systems, vol. 32, pp. 11 128–
https://youtu.be/cTfVfYyyeXk.                                                        11 136, 2019.
                                                                                [11] A. J. Ijspeert, J. Nakanishi, and S. Schaal, “Movement imitation with
                           VI. C ONCLUSION                                           nonlinear dynamical systems in humanoid robots,” in International
                                                                                     Conference on Robotics and Automation (ICRA), 2002, pp. 1398–1403.
   In this paper, we presented the first work on continual                      [12] M. Saveriano, F. J. Abu-Dakka, A. Kramberger, and L. Peternel,
learning from kinesthetic demonstrations. We showed the                              “Dynamic movement primitives in robotics: A tutorial survey,” arXiv
                                                                                     preprint arXiv:2102.03861, 2021.
effectiveness of hypernetworks which continually consolidate                    [13] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural
the knowledge from a sequence of learned tasks into a single                         ordinary differential equations,” in Proceedings of the 32nd Interna-
network without retraining on any past tasks. Our results also                       tional Conference on Neural Information Processing Systems, 2018,
                                                                                     pp. 6572–6583.
show that the relatively small chunked hypernetworks perform                    [14] N. Díaz-Rodríguez, V. Lomonaco, D. Filliat, and D. Maltoni, “Don’t
on par with regular hypernetworks for a limited number of                            forget, there is more than forgetting: new metrics for continual learning,”
tasks, but start forgetting as the number of tasks increases. In                     arXiv preprint arXiv:1810.13166, 2018.
                                                                                [15] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual
the future, we will investigate how the remembering capacity                         lifelong learning with neural networks: A review,” Neural Networks,
of chunked hypernetworks can be improved. Other aspects                              vol. 113, pp. 54–71, 2019.
of future work will include handling trajectories of more                       [16] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl:
                                                                                     Incremental classifier and representation learning,” in Proceedings of
than two dimensions, either in the robot’s task space or joint                       the IEEE Conference on Computer Vision and Pattern Recognition,
space, and replacing NODEs with more advanced trajectory                             2017, pp. 2001–2010.
learning approaches [9] [10] with stability guarantees.                         [17] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synap-
                                                                                     tic intelligence,” in International Conference on Machine Learning.
                             R EFERENCES                                             PMLR, 2017, pp. 3987–3995.
                                                                                [18] A. Xie and C. Finn, “Lifelong robotic reinforcement learning by
 [1] C. Gao, H. Gao, S. Guo, T. Zhang, and F. Chen, “CRIL: Continual                 retaining experiences,” arXiv preprint arXiv:2109.09180, 2021.
     robot imitation learning via generative and prediction model,” in 2021     [19] S. M. Khansari-Zadeh and A. Billard, “Learning control lyapunov
     IEEE/RSJ International Conference on Intelligent Robots and Systems             function to ensure stability of dynamical system-based robot reaching
     (IROS), 2021, pp. 6747–5754.                                                    motions,” Robotics and Autonomous Systems, vol. 62, no. 6, pp. 752–
 [2] Y. Huang, K. Xie, H. Bharadhwaj, and F. Shkurti, “Continual model-              765, 2014.
     based reinforcement learning with hypernetworks,” in 2021 IEEE             [20] M. Saveriano, “An energy-based approach to ensure the stability of
     International Conference on Robotics and Automation (ICRA). IEEE,               learned dynamical systems,” in IEEE International Conference on
     2021, pp. 799–805.                                                              Robotics and Automation (ICRA), 2020, pp. 4407–4413.
 [3] H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deep      [21] M. Heinonen, C. Yildiz, H. Mannerström, J. Intosalmi, and
     generative replay,” in Proceedings of the 31st International Conference         H. Lähdesmäki, “Learning unknown ODE models with gaussian
     on Neural Information Processing Systems, 2017, pp. 2994–3003.                  processes,” in International Conference on Machine Learning. PMLR,
 [4] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars,          2018, pp. 1959–1968.
     “Memory aware synapses: Learning what (not) to forget,” in Proceedings     [22] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep
     of the European Conference on Computer Vision (ECCV), 2018, pp.                 network learning by exponential linear units (elus),” in 4th International
     139–154.                                                                        Conference on Learning Representations, ICLR, 2016.
You can also read