Unsupervised Continual Learning via Self-Adaptive Deep Clustering Approach

Page created by Matthew Franklin
 
CONTINUE READING
Unsupervised Continual Learning via Self-Adaptive Deep Clustering Approach

                                                                     Mahardhika Pratama1∗ , Andri Ashfahani1 , Edwin Lughofer2
                                                                          1
                                                                            SCSE, Nanyang Technological University, Singapore
                                                                            2
                                                                              DKBMS, Johanes Kepler University, Linz, Austria
                                                                  mpratama@ntu.edu.sg, andriash001@e.ntu.edu.sg, edwin.lughofer@jku.at
arXiv:2106.14563v1 [cs.LG] 28 Jun 2021

                                                                      Abstract                              of catastrophic forgetting thereby actualizing knowledge re-
                                                                                                            tention property of a model. These works can be divided into
                                                 Unsupervised continual learning remains a rela-            three categories: memory-based approach, regularization-
                                                 tively uncharted territory in the existing literature      based approach and structure-based approach. Memory-
                                                 because the vast majority of existing works call           based Approach is designed to handle the continual learning
                                                 for unlimited access of ground truth incurring ex-         problem with the use of external memory storing past sam-
                                                 pensive labelling cost. Another issue lies in the          ples. Past samples are replayed along with new samples of the
                                                 problem of task boundaries and task IDs which              current task such that the catastrophic forgetting problem can
                                                 must be known for model’s updates or model’s pre-          be addressed. Regularization-based approach is put for-
                                                 dictions hindering feasibility for real-time deploy-       ward with the use of additional regularization term in the cost
                                                 ment. Knowledge Retention in Self-Adaptive Deep            function aiming at achieving tradeoff points between old task
                                                 Continual Learner, (KIERA), is proposed in this            and new task. Structure-based approach is developed under
                                                 paper. KIERA is developed from the notion of flex-         the roof of dynamic structure. It adds new network compo-
                                                 ible deep clustering approach possessing an elas-          nents to cope with new tasks while isolating old parameters
                                                 tic network structure to cope with changing en-            from new task to avoid the catastrophic forgetting problem.
                                                 vironments in the timely manner. The centroid-                The area of continual learning still deserves in-depth
                                                 based experience replay is put forward to overcome         study because the vast majority of existing approaches are
                                                 the catastrophic forgetting problem. KIERA does            fully-supervised algorithms limiting their applications in the
                                                 not exploit any labelled samples for model updates         scarcity of labelled samples. Existing approaches rely on a
                                                 while featuring a task-agnostic merit. The advan-          very strong assumption of task boundaries and task IDs. This
                                                 tage of KIERA has been numerically validated in            assumption is impractical because the point of changes is of-
                                                 popular continual learning problems where it shows         ten unknown and the presence of task IDs for inference imply
                                                 highly competitive performance compared to state-          extra domain knowledge.
                                                 of-the art approaches. Our implementation is avail-           An unsupervised continual learning algorithm, namely
                                                 able in https:// github.com/ ContinualAL/ KIERA.           Knowledge Retention in Self-Adaptive Deep Continual
                                                                                                            Learner (KIERA), is proposed in this paper. KIERA does
                                                                                                            not utilize any labelled samples for model updates. That
                                         1       Introduction                                               is, labelled samples are only offered to establish clusters-to-
                                         Continual learning is a machine learning field studying            classes associations required to perform classification tasks.
                                         a learning model which can handle a sequence of tasks              KIERA is constructed from the idea of self-evolving deep
                                         T1 , T2 , T3 , ..., TK [Parisi et al., 2018] where K labels the    clustering network making use of a different-depth network
                                         number of tasks. Unlike conventional learning problem              structure. Every hidden layer generates its own set of clus-
                                         with the i.i.d assumption, every task Tk suffers from non-         ters and produces its own local outputs. This strategy en-
                                         stationary conditions where there exists drifting data distribu-   ables an independent self-evolving clustering mechanism to
                                         tions P (X, Y )k 6= P (X, Y )k+1 , emergence of new classes        be performed in different levels of deep embedding spaces.
                                         or combination between both conditions. The goal of general        KIERA features an elastic network structure in which its hid-
                                         continual learning should be to build a model f (.) which is       den nodes, layers and clusters are self-generated from data
                                         capable of identifying and adapting to changes without suf-        streams in respect to varying data distributions.
                                         fering from the catastrophic forgetting problem. It is done on        The parameter learning mechanism of KIERA is devised
                                         the fly with the absence of samples from previous tasks Tk−1 .     to achieve a clustering-friendly latent space via simultane-
                                            There has been growing research interest in the continual       ous feature learning and clustering. It puts forward recon-
                                         learning domain in which the main goal is to resolve the issue     struction loss and clustering loss minimized simultaneously
                                                                                                            in the framework of stacked autoencoder (SAE) to avoid triv-
                                             ∗
                                                 Contact Author                                             ial solutions. The clustering loss is formed as the K-Means
loss [Yang et al., 2017] inducing the clustering-friendly la-      is proposed in [Li and Hoiem, 2016] where it formulates a
tent space where data samples are forced to be adjacent to         joint optimization problem between the current loss function
the winning cluster, i.e., the most neighboring cluster to a       and the knowledge distillation loss [Hinton et al., 2015]. This
data sample. The centroid-based experience replay strategy         neuron-based regularization notion is presented in [Paik et al.,
is put forward to address the catastrophic interference prob-      2020] where the regularization is performed by adjusting the
lem. That is, the sample selection mechanism is carried out in     learning rates of stochastic gradient descent. In [Mao et al.,
respect to focal-points, i.e., data samples triggering addition    2021], the inter-task synaptic mapping is proposed. Notwith-
of clusters. The selective sampling mechanism is integrated        standing that the regularization-based approach is computa-
to assure a bounded replay buffer. The advantage of KIERA          tionally efficient, this approach requires the task boundaries
is confirmed with rigorous numerical study in popular prob-        and task IDs to be known.
lems where it outperforms prominent algorithms.                    Structure-based Approach: the catastrophic forgetting is
   This paper presents four major contributions: 1) KIERA          overcome in progressive neural networks (PNNs) [Rusu et
is proposed to handle unsupervised continual learning prob-        al., 2016] by introduction of new column for every new task
lems; 2) the flexible deep clustering approach is put forward      while freezing old network parameters. This approach is,
in which hidden layers, nodes and clusters are self-evolved on     however, not scalable for large-scale problem since the struc-
the fly; 3) the different-depth network structure is designed      tural complexity linearly grows as the number of tasks. This
making possible an independent clustering module to be car-        drawback is addressed in [Lee et al., 2018] with the use of se-
ried out in different levels of deep embedding space; 4) the       lective retraining approach. Learn-to-grow is proposed in [Li
centroid-based experience replay method is put forward to          et al., 2019] where it utilizes the neural architecture search to
address the catastrophic interference problem.                     find the best network structure of a given task. The structure-
                                                                   based approach imposes expensive complexities.
2   Related Works
                                                                   3     Problem Formulation
Memory-based Approach: iCaRL [Rebuffi et al., 2017] ex-
emplifies the memory-based approach in the continual learn-        The continual learning problem aims to build a predictive
ing using the exemplar set of each class. The classifi-            model f (.) handling streaming tasks T1 , T2 , T3 , ..., TK where
cation decision is calculated from the similarity degree of        K denotes the number of tasks. Unlike conventional prob-
a new sample to the exemplar set. Another example of               lems assuming the i.i.d condition, each task Tk is influ-
the memory-based approach is Gradient Episodic Memory              enced by non-stationary conditions. There exist two types
(GEM) [Lopez-Paz and Ranzato, 2017] where past samples             of changes in the continual learning where the first one is
are stored to calculate the forgetting case. A forgetting case     known as the problem of changing data distributions while
can be examined from the angle between the gradient vector         the second one is understood as the problem of changing tar-
and the proposed update. This work is extended in [Chaudhry        get classes. The problem of changing data distributions or the
et al., 2019] and called Averaged GEM (AGEM). Its contri-          concept drift problem is defined as a change of joint proba-
bution lies in the modification of the loss function to expedite   bility distribution P (X, Y )k 6= P (X, Y )k+1 . The problem
the model updates. Deep Generative Replay (DGR) [Shin              of changing target classes refers to different target classes of
et al., 2017] does not utilize external memory to address the      each task. Suppose that Lk and Lk0 stand for the label sets
catastrophic forgetting problem rather makes use of gener-         of the k − th task and the k 0 − th task, ∀k, k 0 Lk ∩ Lk0 = ∅
ative adversarial network (GAN) to create representation of        for k 6= k 0 . This problem is also known as the incremen-
old tasks. The catastrophic forgetting is addressed by gener-      tal class problem. Each task normally consists of paired data
ating pseudo samples for experience replay mechanism. Our          samples Tk = {xn , yn }N       n=1 where Nk denotes the size of the
                                                                                                    k

                                                                                              u×u
work is framed under this category, because it can be exe-         k − th task. xn ∈ <             is the n − th input image while yn ∈
cuted without the presence of task IDs or task’s boundaries.       {l1 , l2 , ..., lm } is the target vector formed as a one-hot vector.
Regularization-based Approach: Elastic Weight Consol-              This assumption, however, does not apply in the unsupervised
idation (EWC) [Kirkpatrick et al., 2016] is a prominent            continual learning dealing with the scarcity of labelled sam-
regularization-based approach using the L2-like regulariza-        ples. The access of true class labels are only provided for
tion approach constraining the movement of important net-          the initial batch of each task to associate         the cluster’s cen-
                                                                                                               N0k
work parameters. It constructs Fisher Information Matrix           troids with classes B0k = {xn , yn }n=1          while the remainder
(FIM) to signify the importance of network parameters.             of data samples arrive with the absence of labelled samples
Synaptic Intelligence (SI) [Zenke et al., 2017] offers an al-                   Nk
                                                                   Bjk = {xn }n=1
                                                                               j
                                                                                  . Note that N0k + N1k + Njk ... + NJk = Nk .
ternative approach where it utilizes accumulated gradients to
quantify importance of network importance instead of FIM
incurring prohibitive computational burdens. Memory Aware          4     Learning Policy of KIERA
Synapses (MAS) [Aljundi et al., 2018] is an EWC-like ap-
proach with modification of parameter importance matrix us-        4.1    Network Structure
ing an unsupervised and online criterion. EWC has been ex-         KIERA is structured as a SAE comprising two components:
tended in [Schwarz et al., 2018] named onlineEWC where             feature extraction layer and fully-connected layer. The fea-
Laplace approximation is put forward to construct the param-       ture extraction layer is built upon a convolutional layer or a
eter importance matrix. Learning Without Forgetting (LWF)          nonlinear layer. It maps an input image x into a latent input
0
vector Z ∈
the dynamic of latent input features Z extracted by the feature       wise fashion. Cwin,l stands for the centroid of the winning
extraction layer. Nevertheless, the characteristic of latent in-      cluster of the l − th layer, i.e., the closest cluster to a latent
put features are insensitive to changing data distributions. We       sample win → mins=1,...,Clsl ||hl − Cs,l ||. The first and sec-
apply the drift detection method for the last two consecutive         ond terms L(x, x̂), L(hl , ĥl ) are formed as the mean squared
                      k
data batches A = [ZJ−1    , ZJk ] here. The drift detection mech-     error (MSE) loss function where L(x, x̂) assures data sam-
anism first finds a cutting point cut signifying the increase of      ples to be mapped back to their original representations while
population means  + A ≤ B̂ + B . Â, B̂ are the statis-           L(hl , ĥl ) is to extract meaningful latent features in every hid-
                                      2N k ×u0                    0
tics of two data matrices A  q∈ < j             and B ∈  µDs,l + k3 ∗ σDs,l           (7)      dinality thereby expecting convergence. Only the winning
         s=1,...,Clsl
                                                                      cluster is adjusted here to avoid the issue of cluster’s overlap-
where µDs,l , σDs,l are the mean and standard devia-                  ping. The alternate optimization strategy is implemented in
tion of the distance measure D(Cs,l , hl ) while k3 =                 KIERA where the cluster’s parameters are fixed while adjust-
2 exp −||hl − Cs,l || + 2. (7) is perceived as the SPC method         ing the network parameters and vice versa.
with the dynamic confidence degree assuring that a cluster is
added if it is far from the coverage of existing clusters. (7)        4.4   Centroid-based Experience Replay
hints the presence of a new concept unseen in the previous            KIERA adopts the centroid-based experience replay strategy
observations. Hence, a data sample hl can be regarded as a            to address the catastrophic forgetting problem. This mech-
focal point. A new cluster is integrated by setting the current       anism stores focal-points of previous tasks T1 , T2 , ..., Tk−1
sample as a centroid of a new cluster C(s+1),l = hl while its         into an episodic memory interleaved along with samples of
cardinality is set as Car(s+1),l = 1. Note that the clustering        the current task. Tt to overcome the catastrophic interference
process occurs in different levels of deep embedding space.           issue. Note that unlabelled images are retained in the episodic
                                                                      memory. The sample selection mechanism is driven by the
4.3   Parameter Learning Mechanism                                    cluster growing strategy in (7). That is, an image is con-
Network Parameters: the parameter learning strategy of                sidered as a focal-point x∗ thus being stored in the episodic
network parameters performs simultaneous feature learning             memory Mem   k
                                                                                      = Mem k−1
                                                                                                 ∪ x∗ provided that (7) is observed.
and clustering in which the main goal is to establish cluster-        The sample selection strategy is necessary to control the size
ing friendly latent spaces [Yang et al., 2017]. The loss func-        of memory as well as to substantiate the efficacy of experi-
tion comprises two terms: reconstruction loss and clustering          ence replay mechanism making sure conservation of impor-
loss written as follows:                                              tant images, focal points. Focal points represent varieties of
                     L
                     X sae
                                          λ                           concepts seen thus far.
  Lall = L(x, x̂) +        (L(hl , ĥl ) + ||hl − Cwin,l ||2 ) (8)       The size of memory grows as the increase of tasks mak-
          | {z }                          2{z                         ing the experience replay mechanism intractable for a large
                     l=1 |                                  }
             L1
                                          L2                          problem. On the other hand, the network parameters and the
where λ is a tradeoff constant controlling the influence of           cluster parameters are adjusted with recent samples thereby
each loss function. (8) can be solved using the stochastic gra-       paving possibility of the catastrophic forgetting issue. A
dient descent optimizer where L1 is executed in the end-to-           selective sampling strategy is implemented in the centroid-
end fashion while L2 is carried out per layer, i.e., greedy layer     based experience replay method. The goal of the selec-
tive sampling strategy is to identify the most forgotten fo-         lizes two convolutional layers with 16 and 4 filters respec-
cal points in the episodic memory for the sake of experi-            tively and max-pooling layer in between. The decoder part
ence replay while ignoring other focal-points thereby expe-          applies transposed convolutional layers with 4 and 16 filters
diting the model’s updates. The most forgotten samples are           respectively. For PMNIST problem, the feature extraction
those focal points which do not receive sufficient coverage          layer is formed as a multilayer perceptron (MLP) network
of existing clusters. That is, the cluster posterior probabili-      with two hidden layers where the number of nodes are se-
ties P (Cs,l |x∗ ) = exp −||Cs,l − hl∗ || of the most forgotten      lected as [1000, 500]. Note that the random permutation of
samples are below a midpoint mid determined:                         PMNIST requires a model to consider all image pixels as
                   PNem                                              done in MLP. The fully connected layer is initialized as a sin-
                         maxs=1,...,Clsl P (Csl |x∗ )                gle hidden layer Lsae = 1 with 96 hidden nodes R1 = 96.
          mid = n=1                                        (10)
                                 Nem                                 The ReLU activation function is applied as the hidden nodes
                                                                     and the sigmoid function is implemented in the decoder out-
where Nem denotes the size of episodic memory. The mid-              put to produce normalized reconstructed images.
point indicates the average level of coverage to all focal-
points in the episodic memory. A focal-point is included into        5.4   Hyper-parameters
the replay buffer B ∗ = B ∗ ∪ x∗ to be replayed along with the
                                                                     The hyper-parameters of KIERA are fixed across the four
current concept if it is not sufficiently covered P (Cs,l |x∗ ) <
                                                                     problems to demonstrate non ad-hoc characteristic. The
mid. Finally, the centroid-based experience replay strategy is
                                                                     learning rate, momentum coefficient and weight decay
executed by interleaving images of current data batch B k and
                                                                     strength of the SGD method are selected as 0.01, 0.95, 5 ×
replay buffer B ∗ for the training process B k ∪ B ∗ . The most
                                                                     10−5 while the significant levels of the drift detector are set
forgotten focal points are evaluated by checking the current
                                                                     as α = 0.001, αd = 0.001, αw = 0.005. The trade-off con-
situation of network parameters and cluster parameters por-
                                                                     stant is chosen as λ = 0.01.
trayed by the cluster posterior probability.
                                                                     5.5   Baseline Algorithms
5     Proof of Concepts                                              KIERA is compared against Deep Clustering Networks
5.1    Datasets                                                      (DCN) [Yang et al., 2017], AE+KMeans and STAM [Smith
The performance of KIERA is numerically validated using              et al., 2019]. DCN adopts simultaneous feature learning and
four popular continual learning problems: Permutted MNIST            clustering where the cost function is akin to KIERA (8). Nev-
(PMNIST), Rotated MNIST (RMNIST), Split MNIST (SM-                   ertheless, it adopts a static network structure. AE+KMeans
NIST) and Split CIFAR10 (SCIFAR10). PMNIST is con-                   performs the feature learning first using the reconstruction
structed by applying four random permutations to the original        loss while the KMeans clustering process is carried out after-
MNIST problem thus leading to four tasks K = 4. RMNIST               ward. STAM [Smith et al., 2019] relies on an irregular feature
applies rotations with random angles to the original MNIST           extraction layer based on the concept of patches while hav-
problem: [0, 30], [31, 60], [61, 90], [91, 120] thus creating four   ing a self-clustering mechanism as with KIERA. DCN and
tasks K = 4 in total. The two problems characterize the con-         AE+KMeans are fitted with the Learning Without Forgetting
cept drift problem in each task. The SMNIST presents the in-         (LWF) method [Li and Hoiem, 2016] and Synaptic Intelli-
cremental class problem of five tasks where each task presents       gence (SI) method [Zenke et al., 2017] to overcome the catas-
two mutually exclusive classes,(0/1, 2/3, 4/5, 6/7, 8/9). As         trophic forgetting problem. DCN and AE+KMeans make use
with the SMINST problem, the SCIFAR10 also features the              of the same network structure as KIERA to ensure fair com-
incremental class problem of five tasks where each task pos-         parison. The regularization strength of LWF. is set as β = 5
sesses two non-overlapping target classes.                           while it is allocated as 0.2 for the first task and 0.8 for the
                                                                     remaining task in SI method.
5.2    Implementation Notes of KIERA                                    The hyper-parameters of STAM are selected as per their
KIERA applies an iterative training strategy for the initial         original values but hand-tuned if its performance is surpris-
training process of each task and when a new layer is added.         ingly poor. The baseline algorithms are executed in the same
This mechanism utilizes Ninitk
                                unlabelled samples to be it-         computational environments using their published codes. The
                                              k                      performance of consolidated algorithms are examined using
erated across E number of epochs where Ninit      and E are
                                                                     four evaluation metrics: prequential accuracy (Preq Acc),
respective selected as 1000 and 50 respectively. The training
                                                                     task accuracy (Task Acc), backward transfer (BWT) and for-
process completely runs in the single-pass training mode af-
                         Nk
                                                                     ward transfer (FWT). Preq Acc measures the classification
terward. B0k = {xi , yi }i=1
                           0
                             labelled samples are revealed for       performance of the current task while Task Acc evaluates the
each task to associate a cluster with a target class in which        classification performance of all tasks after completing the
for every class 100 labelled samples are offered. Hence, N0k         learning process. BWT and FWT are put forward in [Lopez-
is 200 for the SMNIST problem and the SCIFAR10 while N0k             Paz and Ranzato, 2017] where FWT indicates knowledge
is 1000 for RMNIST and PMNIST.                                       transfer across task while BWT reveals knowledge retention
                                                                     of a model after learning a new task. BWT and FWT ranges
5.3    Network Structure                                             in −∞ and +∞ with a positive high value being the best
The feature extraction layer of KIERA is realized as the con-        value. All consolidated algorithms are run five times where
volutional neural network (CNN) where the encoder part uti-          the averages are reported in Table 1.
Table 1: Numerical Results of Consolidated Algorithms.                                                                    Table 2: The Final State of KIERA.

 Datasets   Methods        BWT          FWT        Task Acc. (%)    Preq. Acc. (%)                                   Datasets         NoN        NoL          NoC (K)   NoM (K)
 RMNIST     KIERA         -7 ± 1.53    44 ± 1.08     76.84 ± 0.53     79.91 ± 0.47
            STAM         0.9 ± 0.35   30 ± 0.37×     77.58 ± 0.43   74.71 ± 0.29×                                    RMNIST          101 ± 1     1±0      3.2 ± 0.02    1.2 ± 0.05
            DCN+LwF     -15 ± 6.54×   16 ± 5.50×   39.47 ± 10.13×   52.79 ± 13.54×                                   PMNIST         162 ± 12     3±1      5.4 ± 0.64    0.8 ± 0.15
            DCN+SI      -13 ± 6.00×   17 ± 6.92×   44.62 ± 12.46×   55.66 ± 14.83×                                   SMNIST          95 ± 3      1±0      2.8 ± 0.13     1 ± 0.1
            AE+KM+LwF   -18 ± 2.32×   18 ± 1.79×    45.31 ± 1.63×    60.15 ± 1.54×                                   SCIFAR10     (1.5 ± 2.8)K   1±0      4.2 ± 0.44    2.3 ± 0.16
            AE+KM+SI     -9 ± 2.72×   16 ± 2.15×    49.07 ± 0.73×    51.19 ± 1.39×
 PMNIST     KIERA        -22 ± 3.22    2 ± 0.77     59.90 ± 2.48     74.59 ± 0.5           NoN: total number of hidden nodes; NoL: total number of hidden
            STAM         0.3 ± 0.09   1 ± 0.44×    47.97 ± 0.59×    55.37 ± 0.26×
            DCN+LwF     -30 ± 1.70×    3 ± 1.42    35.53 ± 0.78×    56.50 ± 0.54×          layers; NoC: total number of hidden clusters; NoM: number of sam-
            DCN+SI      -43 ± 3.23×   1 ± 1.01×    33.09 ± 2.14×    64.87 ± 0.31×          ples in episodic memory.
            AE+KM+LwF   -28 ± 1.71×    3 ± 1.85    35.53 ± 1.02×    56.27 ± 0.55×
            AE+KM+SI    -35 ± 2.35×   1 ± 1.97×    36.06 ± 1.23×    61.50 ± 0.36×
 SMNIST     KIERA        -9 ± 1.15    15 ± 3.56     84.29 ± 0.92     91.06 ± 0.71

                                                                                     0 00 00 00 00 00 00 00 00
                                                                                       2 4 6 8 10 12 14 16
            STAM         -2 ± 0.13        0         92.18 ± 0.32     91.98 ± 0.32
            DCN+LwF      -7 ± 3.98    18 ± 1.17    52.42 ± 5.02×    53.46 ± 2.25×

                                                                                         Average number of samples
            DCN+SI       -4 ± 1.45    22 ± 2.89    58.82 ± 1.18×    57.00 ± 0.34×
            AE+KM+LwF    -5 ± 1.11    18 ± 1.03    55.12 ± 0.97×    54.69 ± 0.58×
            AE+KM+SI     -3 ± 1.88    22 ± 1.73    58.84 ± 0.71×    56.58 ± 0.28×
 SCIFAR10   KIERA       -15 ± 5.92    15 ± 1.30     25.64 ± 1.85     37.09 ± 2.83
            STAM        -18 ± 2.46       0×        20.60 ± 0.66×     35.43 ± 1.22
            DCN+LwF     -15 ± 1.32    6 ± 1.67×    14.87 ± 1.66×    23.98 ± 1.17×
            DCN+SI      -16 ± 1.68    6 ± 1.90×    17.15 ± 1.14×    25.51 ± 0.50×
            AE+KM+LwF   -12 ± 2.00    8 ± 0.63×    22.12 ± 0.33×    28.77 ± 0.84×
            AE+KM+SI    -13 ± 1.36    9 ± 1.34×    21.91 ± 0.68×    28.95 ± 1.11×
                                                                                                                                                                        Collected memories
×
 : Indicates that the numerical results of the respected baseline and                                                                                                   Replayed memories
KIERA are significantly different.                                                                                     0         10         20        30          40       50           60
                                                                                                                                                 k-th batch

5.6     Numerical Results                                                                  Figure 1: The Comparison of Average Collected and Replayed
                                                                                           Memories on SMNIST Problem.
Referring to Table 1, KIERA delivers the highest Preq accu-
racy and FWT in the rotated MNIST problem with statisti-
cally significant margin while being on par with STAM in the                               each task. As a result, the focalpoints of the episodic memory
context of Task Acc. KIERA outperforms other algorithms in                                 are less than those the number of clusters.
the PMNIST problem in which it obtains the highest Task Acc                                   Fig. 1 exhibits the evolution of episodic memory Mem
and Preq Acc with substantial differences to its competitors.                              and replay buffer B ∗ . It is shown that the episodic mem-
Although the FWT of DCN+LWF is higher than KIERA, its                                      ory Mem grows in much faster rate than the replay buffer
Task Acc and its Preq Acc are poor. Similar finding is ob-                                 B ∗ . This trend becomes obvious as the increase of the tasks.
served in the SCIFAR10 problem, where KIERA outperforms                                    This finding is reasonable because each task possesses con-
other algorithms in the Preq Acc, Task Acc and FWT with                                    cept changes inducing addition of new clusters and thus focal-
statistically significant differences. KIERA does not perform                              points of the episodic memory. The selective sampling is ca-
well only on the SMNIST problem but it is still much better                                pable of finding the most forgotten samples thus leading to
than DCN and AE+KM.                                                                        bounded size of the replay buffer. The number of forgotten
   Despite its competitive performance, STAM takes advan-                                  samples is high in the beginning of each task but reduces as
tage of a non-parametric feature extraction layer making it                                the execution of centroid-based experience replay.
robust against the issue of catastrophic forgetting. This mod-
ule, however, hinders its execution under the GPU environ-                                 6                         Conclusion
ments thus slowing down its execution time. The advantage                                  This paper presents an unsupervised continual learning ap-
of clustering-based approach is seen in the FWT aspect. Al-                                proach termed KIERA built upon the flexible clustering prin-
though STAM adopts the clustering approach, its network pa-                                ciple. KIERA features the self-organizing network struc-
rameters are trained with the absence of clustering loss. The                              ture adapting quickly to concept changes. The centroid-
self-evolving network structure plays vital role here where                                based experience replay is proposed to address the catas-
Task Acc and Preq Acc of KIERA and STAM outperforms                                        trophic forgetting problem. Our numerical study with four
DCN and AE+KMeans having fixed structure in all cases                                      popular continual learning problems confirm the efficacy of
with noticeable margins. The centroid-based experience re-                                 KIERA in attaining high Preq Acc, high Task Acc and high
play mechanism performs well compared to SI and LWF.                                       FWT. KIERA delivers higher BWT than those using SI and
   Table 2 reports the network structures of KIERA and the                                 LWF. The advantage of structural learning mechanism is also
episodic memory. The structural learning of KIERA gen-                                     demonstrated in our numerical study where it produces sig-
erates a compact and bounded network structure where the                                   nificantly better performance than those static network struc-
number of hidden nodes, hidden layer and hidden clusters are                               ture. KIERA is capable of learning and predicting without
much less than the number of data points. The hidden layer                                 the presence of Task ID and Task’s boundaries. Our memory
evolution is seen in the PMNIST problem where additional                                   analysis deduces the effectiveness of the centroid-based ex-
layers are inserted. The centroid-based experience replay ex-                              perience replay in which the size of replay buffer is bounded.
           k
cludes Ninit   unlabelled samples for the pre-training phase of                            Our future study answers the issue of multiple streams.
References                                                         Continual lifelong learning with neural networks: A re-
[Aljundi et al., 2018] Rahaf Aljundi, F. Babiloni, Mohamed         view. 2018.
   Elhoseiny, Marcus Rohrbach, and T. Tuytelaars. Mem-          [Pratama et al., 2019] Mahardhika Pratama, Choiru Za’in,
   ory aware synapses: Learning what (not) to forget. ArXiv,       Andri Ashfahani, Yew Soon Ong, and Weiping Ding. Au-
   abs/1711.09601, 2018.                                           tomatic construction of multi-layer perceptron network
[Chaudhry et al., 2019] Arslan Chaudhry, Marc’Aurelio              from streaming examples. 2019.
   Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Ef-         [Rebuffi et al., 2017] Sylvestre-Alvise             Rebuffi,
   ficient lifelong learning with A-GEM. In 7th International      A. Kolesnikov, Georg Sperl, and Christoph H. Lam-
   Conference on Learning Representations, ICLR 2019,              pert. icarl: Incremental classifier and representation
   New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,            learning. 2017 IEEE Conference on Computer Vision and
   2019.                                                           Pattern Recognition (CVPR), pages 5533–5542, 2017.
[Gama, 2010] Joao Gama. Knowledge Discovery from Data           [Rusu et al., 2016] Andrei A. Rusu, Neil C. Rabinowitz,
   Streams. Chapman & Hall/CRC, 1st edition, 2010.                 G. Desjardins, Hubert Soyer, James Kirkpatrick,
[Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and          K. Kavukcuoglu, Razvan Pascanu, and Raia Hadsell.
   Jeff Dean. Distilling the knowledge in a neural network.        Progressive neural networks. ArXiv, abs/1606.04671,
   arXiv preprint arXiv:1503.02531, 2015.                          2016.
[Kirkpatrick et al., 2016] James Kirkpatrick, Razvan Pas-       [Schwarz et al., 2018] Jonathan Schwarz, Jelena Luketina,
   canu, Neil Rabinowitz, Joel Veness, Guillaume Des-              Wojciech M. Czarnecki, Agnieszka Grabska-Barwinska,
   jardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago         Yee Whye Teh, Razvan Pascanu, and Raia Hadsell.
   Ramalho, Agnieszka Grabska-Barwinska, Demis Hass-               Progress & compress: A scalable framework for contin-
   abis, Claudia Clopath, Dharshan Kumaran, and Raia Had-          ual learning. 2018.
   sell. Overcoming catastrophic forgetting in neural net-      [Shin et al., 2017] Hanul Shin, Jung Kwon Lee, Jaehong
   works. 2016.                                                    Kim, and Jiwon Kim. Continual learning with deep gener-
[Lee et al., 2018] Jeongtae Lee, Jaehong Yoon, E. Yang, and        ative replay. 2017.
   Sung Ju Hwang. Lifelong learning with dynamically ex-        [Smith et al., 2019] James Smith, Seth Baer, Zsolt Kira, and
   pandable networks. ArXiv, abs/1708.01547, 2018.                 Constantine Dovrolis. Unsupervised continual learning
[Li and Hoiem, 2016] Zhizhong Li and Derek Hoiem.                  and self-taught associative memory hierarchies. In 2019
   Learning without forgetting. 2016.                              International Conference on Learning Representations
                                                                   Workshops, 2019.
[Li et al., 2019] Xilai Li, Yingbo Zhou, Tianfu Wu, Richard
   Socher, and Caiming Xiong. Learn to grow: A continual        [Yang et al., 2017] Bo Yang, Xiao Fu, Nicholas D.
   structure learning framework for overcoming catastrophic        Sidiropoulos, and Mingyi Hong. Towards k-means-
   forgetting. volume 97 of Proceedings of Machine Learn-          friendly spaces:      Simultaneous deep learning and
   ing Research, pages 3925–3934, Long Beach, California,          clustering. In Doina Precup and Yee Whye Teh, editors,
   USA, 09–15 Jun 2019. PMLR.                                      Proceedings of the 34th International Conference on
                                                                   Machine Learning, volume 70 of Proceedings of Machine
[Lopez-Paz and Ranzato, 2017] David        Lopez-Paz      and      Learning Research, pages 3861–3870. PMLR, 06–11 Aug
   Marc’Aurelio Ranzato. Gradient episodic memory for              2017.
   continual learning. In Proceedings of the 31st Inter-
   national Conference on Neural Information Processing         [Zenke et al., 2017] Friedemann Zenke, Ben Poole, and
   Systems, NIPS’17, page 6470–6479, Red Hook, NY,                 Surya Ganguli. Continual learning through synaptic in-
   USA, 2017. Curran Associates Inc.                               telligence. 2017.
[Mao et al., 2021] Fubing Mao, Weiwei Weng, Mahardhika
   Pratama, and Edward Yapp Kien Yee. Continual learning
   via inter-task synaptic mapping. Knowledge-Based Sys-
   tems, 222:106947, 2021.
[Paik et al., 2020] Inyoung Paik, Sangjun Oh, Taeyeong
   Kwak, and Injung Kim. Overcoming catastrophic for-
   getting by neuron-level plasticity control. In The Thirty-
   Fourth AAAI Conference on Artificial Intelligence, AAAI
   2020, The Thirty-Second Innovative Applications of Arti-
   ficial Intelligence Conference, IAAI 2020, The Tenth AAAI
   Symposium on Educational Advances in Artificial Intelli-
   gence, EAAI 2020, New York, NY, USA, February 7-12,
   2020, pages 5339–5346. AAAI Press, 2020.
[Parisi et al., 2018] German I. Parisi, Ronald Kemker,
   Jose L. Part, Christopher Kanan, and Stefan Wermter.
You can also read