Unsupervised Continual Learning via Self-Adaptive Deep Clustering Approach

Page created by Matthew Franklin

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Unsupervised Continual Learning via Self-Adaptive Deep Clustering Approach

                                                                     Mahardhika Pratama1∗ , Andri Ashfahani1 , Edwin Lughofer2
                                                                          1
                                                                            SCSE, Nanyang Technological University, Singapore
                                                                            2
                                                                              DKBMS, Johanes Kepler University, Linz, Austria
                                                                  mpratama@ntu.edu.sg, andriash001@e.ntu.edu.sg, edwin.lughofer@jku.at
arXiv:2106.14563v1 [cs.LG] 28 Jun 2021

                                                                      Abstract                              of catastrophic forgetting thereby actualizing knowledge re-
                                                                                                            tention property of a model. These works can be divided into
                                                 Unsupervised continual learning remains a rela-            three categories: memory-based approach, regularization-
                                                 tively uncharted territory in the existing literature      based approach and structure-based approach. Memory-
                                                 because the vast majority of existing works call           based Approach is designed to handle the continual learning
                                                 for unlimited access of ground truth incurring ex-         problem with the use of external memory storing past sam-
                                                 pensive labelling cost. Another issue lies in the          ples. Past samples are replayed along with new samples of the
                                                 problem of task boundaries and task IDs which              current task such that the catastrophic forgetting problem can
                                                 must be known for model’s updates or model’s pre-          be addressed. Regularization-based approach is put for-
                                                 dictions hindering feasibility for real-time deploy-       ward with the use of additional regularization term in the cost
                                                 ment. Knowledge Retention in Self-Adaptive Deep            function aiming at achieving tradeoff points between old task
                                                 Continual Learner, (KIERA), is proposed in this            and new task. Structure-based approach is developed under
                                                 paper. KIERA is developed from the notion of flex-         the roof of dynamic structure. It adds new network compo-
                                                 ible deep clustering approach possessing an elas-          nents to cope with new tasks while isolating old parameters
                                                 tic network structure to cope with changing en-            from new task to avoid the catastrophic forgetting problem.
                                                 vironments in the timely manner. The centroid-                The area of continual learning still deserves in-depth
                                                 based experience replay is put forward to overcome         study because the vast majority of existing approaches are
                                                 the catastrophic forgetting problem. KIERA does            fully-supervised algorithms limiting their applications in the
                                                 not exploit any labelled samples for model updates         scarcity of labelled samples. Existing approaches rely on a
                                                 while featuring a task-agnostic merit. The advan-          very strong assumption of task boundaries and task IDs. This
                                                 tage of KIERA has been numerically validated in            assumption is impractical because the point of changes is of-
                                                 popular continual learning problems where it shows         ten unknown and the presence of task IDs for inference imply
                                                 highly competitive performance compared to state-          extra domain knowledge.
                                                 of-the art approaches. Our implementation is avail-           An unsupervised continual learning algorithm, namely
                                                 able in https:// github.com/ ContinualAL/ KIERA.           Knowledge Retention in Self-Adaptive Deep Continual
                                                                                                            Learner (KIERA), is proposed in this paper. KIERA does
                                                                                                            not utilize any labelled samples for model updates. That
                                         1       Introduction                                               is, labelled samples are only offered to establish clusters-to-
                                         Continual learning is a machine learning field studying            classes associations required to perform classification tasks.
                                         a learning model which can handle a sequence of tasks              KIERA is constructed from the idea of self-evolving deep
                                         T1 , T2 , T3 , ..., TK [Parisi et al., 2018] where K labels the    clustering network making use of a different-depth network
                                         number of tasks. Unlike conventional learning problem              structure. Every hidden layer generates its own set of clus-
                                         with the i.i.d assumption, every task Tk suffers from non-         ters and produces its own local outputs. This strategy en-
                                         stationary conditions where there exists drifting data distribu-   ables an independent self-evolving clustering mechanism to
                                         tions P (X, Y )k 6= P (X, Y )k+1 , emergence of new classes        be performed in different levels of deep embedding spaces.
                                         or combination between both conditions. The goal of general        KIERA features an elastic network structure in which its hid-
                                         continual learning should be to build a model f (.) which is       den nodes, layers and clusters are self-generated from data
                                         capable of identifying and adapting to changes without suf-        streams in respect to varying data distributions.
                                         fering from the catastrophic forgetting problem. It is done on        The parameter learning mechanism of KIERA is devised
                                         the fly with the absence of samples from previous tasks Tk−1 .     to achieve a clustering-friendly latent space via simultane-
                                            There has been growing research interest in the continual       ous feature learning and clustering. It puts forward recon-
                                         learning domain in which the main goal is to resolve the issue     struction loss and clustering loss minimized simultaneously
                                                                                                            in the framework of stacked autoencoder (SAE) to avoid triv-
                                             ∗
                                                 Contact Author                                             ial solutions. The clustering loss is formed as the K-Means

loss [Yang et al., 2017] inducing the clustering-friendly la- is proposed in [Li and Hoiem, 2016] where it formulates a
tent space where data samples are forced to be adjacent to joint optimization problem between the current loss function
the winning cluster, i.e., the most neighboring cluster to a and the knowledge distillation loss [Hinton et al., 2015]. This
data sample. The centroid-based experience replay strategy neuron-based regularization notion is presented in [Paik et al.,
is put forward to address the catastrophic interference prob- 2020] where the regularization is performed by adjusting the
lem. That is, the sample selection mechanism is carried out in learning rates of stochastic gradient descent. In [Mao et al.,
respect to focal-points, i.e., data samples triggering addition 2021], the inter-task synaptic mapping is proposed. Notwith-
of clusters. The selective sampling mechanism is integrated standing that the regularization-based approach is computa-
to assure a bounded replay buffer. The advantage of KIERA tionally efficient, this approach requires the task boundaries
is confirmed with rigorous numerical study in popular prob- and task IDs to be known.
lems where it outperforms prominent algorithms. Structure-based Approach: the catastrophic forgetting is
This paper presents four major contributions: 1) KIERA overcome in progressive neural networks (PNNs) [Rusu et
is proposed to handle unsupervised continual learning prob- al., 2016] by introduction of new column for every new task
lems; 2) the flexible deep clustering approach is put forward while freezing old network parameters. This approach is,
in which hidden layers, nodes and clusters are self-evolved on however, not scalable for large-scale problem since the struc-
the fly; 3) the different-depth network structure is designed tural complexity linearly grows as the number of tasks. This
making possible an independent clustering module to be car- drawback is addressed in [Lee et al., 2018] with the use of se-
ried out in different levels of deep embedding space; 4) the lective retraining approach. Learn-to-grow is proposed in [Li
centroid-based experience replay method is put forward to et al., 2019] where it utilizes the neural architecture search to
address the catastrophic interference problem. find the best network structure of a given task. The structure-
based approach imposes expensive complexities.
2 Related Works
3 Problem Formulation
Memory-based Approach: iCaRL [Rebuffi et al., 2017] ex-
emplifies the memory-based approach in the continual learn- The continual learning problem aims to build a predictive
ing using the exemplar set of each class. The classifi- model f (.) handling streaming tasks T1 , T2 , T3 , ..., TK where
cation decision is calculated from the similarity degree of K denotes the number of tasks. Unlike conventional prob-
a new sample to the exemplar set. Another example of lems assuming the i.i.d condition, each task Tk is influ-
the memory-based approach is Gradient Episodic Memory enced by non-stationary conditions. There exist two types
(GEM) [Lopez-Paz and Ranzato, 2017] where past samples of changes in the continual learning where the first one is
are stored to calculate the forgetting case. A forgetting case known as the problem of changing data distributions while
can be examined from the angle between the gradient vector the second one is understood as the problem of changing tar-
and the proposed update. This work is extended in [Chaudhry get classes. The problem of changing data distributions or the
et al., 2019] and called Averaged GEM (AGEM). Its contri- concept drift problem is defined as a change of joint proba-
bution lies in the modification of the loss function to expedite bility distribution P (X, Y )k 6= P (X, Y )k+1 . The problem
the model updates. Deep Generative Replay (DGR) [Shin of changing target classes refers to different target classes of
et al., 2017] does not utilize external memory to address the each task. Suppose that Lk and Lk0 stand for the label sets
catastrophic forgetting problem rather makes use of gener- of the k − th task and the k 0 − th task, ∀k, k 0 Lk ∩ Lk0 = ∅
ative adversarial network (GAN) to create representation of for k 6= k 0 . This problem is also known as the incremen-
old tasks. The catastrophic forgetting is addressed by gener- tal class problem. Each task normally consists of paired data
ating pseudo samples for experience replay mechanism. Our samples Tk = {xn , yn }N n=1 where Nk denotes the size of the
k

u×u
work is framed under this category, because it can be exe- k − th task. xn ∈ < is the n − th input image while yn ∈
cuted without the presence of task IDs or task’s boundaries. {l1 , l2 , ..., lm } is the target vector formed as a one-hot vector.
Regularization-based Approach: Elastic Weight Consol- This assumption, however, does not apply in the unsupervised
idation (EWC) [Kirkpatrick et al., 2016] is a prominent continual learning dealing with the scarcity of labelled sam-
regularization-based approach using the L2-like regulariza- ples. The access of true class labels are only provided for
tion approach constraining the movement of important net- the initial batch of each task to associate the cluster’s cen-
N0k
work parameters. It constructs Fisher Information Matrix troids with classes B0k = {xn , yn }n=1 while the remainder
(FIM) to signify the importance of network parameters. of data samples arrive with the absence of labelled samples
Synaptic Intelligence (SI) [Zenke et al., 2017] offers an al- Nk
Bjk = {xn }n=1
j
. Note that N0k + N1k + Njk ... + NJk = Nk .
ternative approach where it utilizes accumulated gradients to
quantify importance of network importance instead of FIM
incurring prohibitive computational burdens. Memory Aware 4 Learning Policy of KIERA
Synapses (MAS) [Aljundi et al., 2018] is an EWC-like ap-
proach with modification of parameter importance matrix us- 4.1 Network Structure
ing an unsupervised and online criterion. EWC has been ex- KIERA is structured as a SAE comprising two components:
tended in [Schwarz et al., 2018] named onlineEWC where feature extraction layer and fully-connected layer. The fea-
Laplace approximation is put forward to construct the param- ture extraction layer is built upon a convolutional layer or a
eter importance matrix. Learning Without Forgetting (LWF) nonlinear layer. It maps an input image x into a latent input

0
vector Z ∈

the dynamic of latent input features Z extracted by the feature wise fashion. Cwin,l stands for the centroid of the winning
extraction layer. Nevertheless, the characteristic of latent in- cluster of the l − th layer, i.e., the closest cluster to a latent
put features are insensitive to changing data distributions. We sample win → mins=1,...,Clsl ||hl − Cs,l ||. The first and sec-
apply the drift detection method for the last two consecutive ond terms L(x, x̂), L(hl , ĥl ) are formed as the mean squared
k
data batches A = [ZJ−1 , ZJk ] here. The drift detection mech- error (MSE) loss function where L(x, x̂) assures data sam-
anism first finds a cutting point cut signifying the increase of ples to be mapped back to their original representations while
population means Â + A ≤ B̂ + B . Â, B̂ are the statis- L(hl , ĥl ) is to extract meaningful latent features in every hid-
2N k ×u0 0
tics of two data matrices A q∈ < j and B ∈ µDs,l + k3 ∗ σDs,l (7) dinality thereby expecting convergence. Only the winning
s=1,...,Clsl
cluster is adjusted here to avoid the issue of cluster’s overlap-
where µDs,l , σDs,l are the mean and standard devia- ping. The alternate optimization strategy is implemented in
tion of the distance measure D(Cs,l , hl ) while k3 = KIERA where the cluster’s parameters are fixed while adjust-
2 exp −||hl − Cs,l || + 2. (7) is perceived as the SPC method ing the network parameters and vice versa.
with the dynamic confidence degree assuring that a cluster is
added if it is far from the coverage of existing clusters. (7) 4.4 Centroid-based Experience Replay
hints the presence of a new concept unseen in the previous KIERA adopts the centroid-based experience replay strategy
observations. Hence, a data sample hl can be regarded as a to address the catastrophic forgetting problem. This mech-
focal point. A new cluster is integrated by setting the current anism stores focal-points of previous tasks T1 , T2 , ..., Tk−1
sample as a centroid of a new cluster C(s+1),l = hl while its into an episodic memory interleaved along with samples of
cardinality is set as Car(s+1),l = 1. Note that the clustering the current task. Tt to overcome the catastrophic interference
process occurs in different levels of deep embedding space. issue. Note that unlabelled images are retained in the episodic
memory. The sample selection mechanism is driven by the
4.3 Parameter Learning Mechanism cluster growing strategy in (7). That is, an image is con-
Network Parameters: the parameter learning strategy of sidered as a focal-point x∗ thus being stored in the episodic
network parameters performs simultaneous feature learning memory Mem k
= Mem k−1
∪ x∗ provided that (7) is observed.
and clustering in which the main goal is to establish cluster- The sample selection strategy is necessary to control the size
ing friendly latent spaces [Yang et al., 2017]. The loss func- of memory as well as to substantiate the efficacy of experi-
tion comprises two terms: reconstruction loss and clustering ence replay mechanism making sure conservation of impor-
loss written as follows: tant images, focal points. Focal points represent varieties of
L
X sae
λ concepts seen thus far.
Lall = L(x, x̂) + (L(hl , ĥl ) + ||hl − Cwin,l ||2 ) (8) The size of memory grows as the increase of tasks mak-
| {z } 2{z ing the experience replay mechanism intractable for a large
l=1 | }
L1
L2 problem. On the other hand, the network parameters and the
where λ is a tradeoff constant controlling the influence of cluster parameters are adjusted with recent samples thereby
each loss function. (8) can be solved using the stochastic gra- paving possibility of the catastrophic forgetting issue. A
dient descent optimizer where L1 is executed in the end-to- selective sampling strategy is implemented in the centroid-
end fashion while L2 is carried out per layer, i.e., greedy layer based experience replay method. The goal of the selec-

tive sampling strategy is to identify the most forgotten fo- lizes two convolutional layers with 16 and 4 filters respec-
cal points in the episodic memory for the sake of experi- tively and max-pooling layer in between. The decoder part
ence replay while ignoring other focal-points thereby expe- applies transposed convolutional layers with 4 and 16 filters
diting the model’s updates. The most forgotten samples are respectively. For PMNIST problem, the feature extraction
those focal points which do not receive sufficient coverage layer is formed as a multilayer perceptron (MLP) network
of existing clusters. That is, the cluster posterior probabili- with two hidden layers where the number of nodes are se-
ties P (Cs,l |x∗ ) = exp −||Cs,l − hl∗ || of the most forgotten lected as [1000, 500]. Note that the random permutation of
samples are below a midpoint mid determined: PMNIST requires a model to consider all image pixels as
PNem done in MLP. The fully connected layer is initialized as a sin-
maxs=1,...,Clsl P (Csl |x∗ ) gle hidden layer Lsae = 1 with 96 hidden nodes R1 = 96.
mid = n=1 (10)
Nem The ReLU activation function is applied as the hidden nodes
and the sigmoid function is implemented in the decoder out-
where Nem denotes the size of episodic memory. The mid- put to produce normalized reconstructed images.
point indicates the average level of coverage to all focal-
points in the episodic memory. A focal-point is included into 5.4 Hyper-parameters
the replay buffer B ∗ = B ∗ ∪ x∗ to be replayed along with the
The hyper-parameters of KIERA are fixed across the four
current concept if it is not sufficiently covered P (Cs,l |x∗ ) <
problems to demonstrate non ad-hoc characteristic. The
mid. Finally, the centroid-based experience replay strategy is
learning rate, momentum coefficient and weight decay
executed by interleaving images of current data batch B k and
strength of the SGD method are selected as 0.01, 0.95, 5 ×
replay buffer B ∗ for the training process B k ∪ B ∗ . The most
10−5 while the significant levels of the drift detector are set
forgotten focal points are evaluated by checking the current
as α = 0.001, αd = 0.001, αw = 0.005. The trade-off con-
situation of network parameters and cluster parameters por-
stant is chosen as λ = 0.01.
trayed by the cluster posterior probability.
5.5 Baseline Algorithms
5 Proof of Concepts KIERA is compared against Deep Clustering Networks
5.1 Datasets (DCN) [Yang et al., 2017], AE+KMeans and STAM [Smith
The performance of KIERA is numerically validated using et al., 2019]. DCN adopts simultaneous feature learning and
four popular continual learning problems: Permutted MNIST clustering where the cost function is akin to KIERA (8). Nev-
(PMNIST), Rotated MNIST (RMNIST), Split MNIST (SM- ertheless, it adopts a static network structure. AE+KMeans
NIST) and Split CIFAR10 (SCIFAR10). PMNIST is con- performs the feature learning first using the reconstruction
structed by applying four random permutations to the original loss while the KMeans clustering process is carried out after-
MNIST problem thus leading to four tasks K = 4. RMNIST ward. STAM [Smith et al., 2019] relies on an irregular feature
applies rotations with random angles to the original MNIST extraction layer based on the concept of patches while hav-
problem: [0, 30], [31, 60], [61, 90], [91, 120] thus creating four ing a self-clustering mechanism as with KIERA. DCN and
tasks K = 4 in total. The two problems characterize the con- AE+KMeans are fitted with the Learning Without Forgetting
cept drift problem in each task. The SMNIST presents the in- (LWF) method [Li and Hoiem, 2016] and Synaptic Intelli-
cremental class problem of five tasks where each task presents gence (SI) method [Zenke et al., 2017] to overcome the catas-
two mutually exclusive classes,(0/1, 2/3, 4/5, 6/7, 8/9). As trophic forgetting problem. DCN and AE+KMeans make use
with the SMINST problem, the SCIFAR10 also features the of the same network structure as KIERA to ensure fair com-
incremental class problem of five tasks where each task pos- parison. The regularization strength of LWF. is set as β = 5
sesses two non-overlapping target classes. while it is allocated as 0.2 for the first task and 0.8 for the
remaining task in SI method.
5.2 Implementation Notes of KIERA The hyper-parameters of STAM are selected as per their
KIERA applies an iterative training strategy for the initial original values but hand-tuned if its performance is surpris-
training process of each task and when a new layer is added. ingly poor. The baseline algorithms are executed in the same
This mechanism utilizes Ninitk
unlabelled samples to be it- computational environments using their published codes. The
k performance of consolidated algorithms are examined using
erated across E number of epochs where Ninit and E are
four evaluation metrics: prequential accuracy (Preq Acc),
respective selected as 1000 and 50 respectively. The training
task accuracy (Task Acc), backward transfer (BWT) and for-
process completely runs in the single-pass training mode af-
Nk
ward transfer (FWT). Preq Acc measures the classification
terward. B0k = {xi , yi }i=1
0
labelled samples are revealed for performance of the current task while Task Acc evaluates the
each task to associate a cluster with a target class in which classification performance of all tasks after completing the
for every class 100 labelled samples are offered. Hence, N0k learning process. BWT and FWT are put forward in [Lopez-
is 200 for the SMNIST problem and the SCIFAR10 while N0k Paz and Ranzato, 2017] where FWT indicates knowledge
is 1000 for RMNIST and PMNIST. transfer across task while BWT reveals knowledge retention
of a model after learning a new task. BWT and FWT ranges
5.3 Network Structure in −∞ and +∞ with a positive high value being the best
The feature extraction layer of KIERA is realized as the con- value. All consolidated algorithms are run five times where
volutional neural network (CNN) where the encoder part uti- the averages are reported in Table 1.

Table 1: Numerical Results of Consolidated Algorithms.                                                                    Table 2: The Final State of KIERA.

 Datasets   Methods        BWT          FWT        Task Acc. (%)    Preq. Acc. (%)                                   Datasets         NoN        NoL          NoC (K)   NoM (K)
 RMNIST     KIERA         -7 ± 1.53    44 ± 1.08     76.84 ± 0.53     79.91 ± 0.47
            STAM         0.9 ± 0.35   30 ± 0.37×     77.58 ± 0.43   74.71 ± 0.29×                                    RMNIST          101 ± 1     1±0      3.2 ± 0.02    1.2 ± 0.05
            DCN+LwF     -15 ± 6.54×   16 ± 5.50×   39.47 ± 10.13×   52.79 ± 13.54×                                   PMNIST         162 ± 12     3±1      5.4 ± 0.64    0.8 ± 0.15
            DCN+SI      -13 ± 6.00×   17 ± 6.92×   44.62 ± 12.46×   55.66 ± 14.83×                                   SMNIST          95 ± 3      1±0      2.8 ± 0.13     1 ± 0.1
            AE+KM+LwF   -18 ± 2.32×   18 ± 1.79×    45.31 ± 1.63×    60.15 ± 1.54×                                   SCIFAR10     (1.5 ± 2.8)K   1±0      4.2 ± 0.44    2.3 ± 0.16
            AE+KM+SI     -9 ± 2.72×   16 ± 2.15×    49.07 ± 0.73×    51.19 ± 1.39×
 PMNIST     KIERA        -22 ± 3.22    2 ± 0.77     59.90 ± 2.48     74.59 ± 0.5           NoN: total number of hidden nodes; NoL: total number of hidden
            STAM         0.3 ± 0.09   1 ± 0.44×    47.97 ± 0.59×    55.37 ± 0.26×
            DCN+LwF     -30 ± 1.70×    3 ± 1.42    35.53 ± 0.78×    56.50 ± 0.54×          layers; NoC: total number of hidden clusters; NoM: number of sam-
            DCN+SI      -43 ± 3.23×   1 ± 1.01×    33.09 ± 2.14×    64.87 ± 0.31×          ples in episodic memory.
            AE+KM+LwF   -28 ± 1.71×    3 ± 1.85    35.53 ± 1.02×    56.27 ± 0.55×
            AE+KM+SI    -35 ± 2.35×   1 ± 1.97×    36.06 ± 1.23×    61.50 ± 0.36×
 SMNIST     KIERA        -9 ± 1.15    15 ± 3.56     84.29 ± 0.92     91.06 ± 0.71

                                                                                     0 00 00 00 00 00 00 00 00
                                                                                       2 4 6 8 10 12 14 16
            STAM         -2 ± 0.13        0         92.18 ± 0.32     91.98 ± 0.32
            DCN+LwF      -7 ± 3.98    18 ± 1.17    52.42 ± 5.02×    53.46 ± 2.25×

                                                                                         Average number of samples
            DCN+SI       -4 ± 1.45    22 ± 2.89    58.82 ± 1.18×    57.00 ± 0.34×
            AE+KM+LwF    -5 ± 1.11    18 ± 1.03    55.12 ± 0.97×    54.69 ± 0.58×
            AE+KM+SI     -3 ± 1.88    22 ± 1.73    58.84 ± 0.71×    56.58 ± 0.28×
 SCIFAR10   KIERA       -15 ± 5.92    15 ± 1.30     25.64 ± 1.85     37.09 ± 2.83
            STAM        -18 ± 2.46       0×        20.60 ± 0.66×     35.43 ± 1.22
            DCN+LwF     -15 ± 1.32    6 ± 1.67×    14.87 ± 1.66×    23.98 ± 1.17×
            DCN+SI      -16 ± 1.68    6 ± 1.90×    17.15 ± 1.14×    25.51 ± 0.50×
            AE+KM+LwF   -12 ± 2.00    8 ± 0.63×    22.12 ± 0.33×    28.77 ± 0.84×
            AE+KM+SI    -13 ± 1.36    9 ± 1.34×    21.91 ± 0.68×    28.95 ± 1.11×
                                                                                                                                                                        Collected memories
×
 : Indicates that the numerical results of the respected baseline and                                                                                                   Replayed memories
KIERA are significantly different.                                                                                     0         10         20        30          40       50           60
                                                                                                                                                 k-th batch

5.6     Numerical Results                                                                  Figure 1: The Comparison of Average Collected and Replayed
                                                                                           Memories on SMNIST Problem.
Referring to Table 1, KIERA delivers the highest Preq accu-
racy and FWT in the rotated MNIST problem with statisti-
cally significant margin while being on par with STAM in the                               each task. As a result, the focalpoints of the episodic memory
context of Task Acc. KIERA outperforms other algorithms in                                 are less than those the number of clusters.
the PMNIST problem in which it obtains the highest Task Acc                                   Fig. 1 exhibits the evolution of episodic memory Mem
and Preq Acc with substantial differences to its competitors.                              and replay buffer B ∗ . It is shown that the episodic mem-
Although the FWT of DCN+LWF is higher than KIERA, its                                      ory Mem grows in much faster rate than the replay buffer
Task Acc and its Preq Acc are poor. Similar finding is ob-                                 B ∗ . This trend becomes obvious as the increase of the tasks.
served in the SCIFAR10 problem, where KIERA outperforms                                    This finding is reasonable because each task possesses con-
other algorithms in the Preq Acc, Task Acc and FWT with                                    cept changes inducing addition of new clusters and thus focal-
statistically significant differences. KIERA does not perform                              points of the episodic memory. The selective sampling is ca-
well only on the SMNIST problem but it is still much better                                pable of finding the most forgotten samples thus leading to
than DCN and AE+KM.                                                                        bounded size of the replay buffer. The number of forgotten
   Despite its competitive performance, STAM takes advan-                                  samples is high in the beginning of each task but reduces as
tage of a non-parametric feature extraction layer making it                                the execution of centroid-based experience replay.
robust against the issue of catastrophic forgetting. This mod-
ule, however, hinders its execution under the GPU environ-                                 6                         Conclusion
ments thus slowing down its execution time. The advantage                                  This paper presents an unsupervised continual learning ap-
of clustering-based approach is seen in the FWT aspect. Al-                                proach termed KIERA built upon the flexible clustering prin-
though STAM adopts the clustering approach, its network pa-                                ciple. KIERA features the self-organizing network struc-
rameters are trained with the absence of clustering loss. The                              ture adapting quickly to concept changes. The centroid-
self-evolving network structure plays vital role here where                                based experience replay is proposed to address the catas-
Task Acc and Preq Acc of KIERA and STAM outperforms                                        trophic forgetting problem. Our numerical study with four
DCN and AE+KMeans having fixed structure in all cases                                      popular continual learning problems confirm the efficacy of
with noticeable margins. The centroid-based experience re-                                 KIERA in attaining high Preq Acc, high Task Acc and high
play mechanism performs well compared to SI and LWF.                                       FWT. KIERA delivers higher BWT than those using SI and
   Table 2 reports the network structures of KIERA and the                                 LWF. The advantage of structural learning mechanism is also
episodic memory. The structural learning of KIERA gen-                                     demonstrated in our numerical study where it produces sig-
erates a compact and bounded network structure where the                                   nificantly better performance than those static network struc-
number of hidden nodes, hidden layer and hidden clusters are                               ture. KIERA is capable of learning and predicting without
much less than the number of data points. The hidden layer                                 the presence of Task ID and Task’s boundaries. Our memory
evolution is seen in the PMNIST problem where additional                                   analysis deduces the effectiveness of the centroid-based ex-
layers are inserted. The centroid-based experience replay ex-                              perience replay in which the size of replay buffer is bounded.
           k
cludes Ninit   unlabelled samples for the pre-training phase of                            Our future study answers the issue of multiple streams.

References                                                         Continual lifelong learning with neural networks: A re-
[Aljundi et al., 2018] Rahaf Aljundi, F. Babiloni, Mohamed         view. 2018.
   Elhoseiny, Marcus Rohrbach, and T. Tuytelaars. Mem-          [Pratama et al., 2019] Mahardhika Pratama, Choiru Za’in,
   ory aware synapses: Learning what (not) to forget. ArXiv,       Andri Ashfahani, Yew Soon Ong, and Weiping Ding. Au-
   abs/1711.09601, 2018.                                           tomatic construction of multi-layer perceptron network
[Chaudhry et al., 2019] Arslan Chaudhry, Marc’Aurelio              from streaming examples. 2019.
   Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Ef-         [Rebuffi et al., 2017] Sylvestre-Alvise             Rebuffi,
   ficient lifelong learning with A-GEM. In 7th International      A. Kolesnikov, Georg Sperl, and Christoph H. Lam-
   Conference on Learning Representations, ICLR 2019,              pert. icarl: Incremental classifier and representation
   New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,            learning. 2017 IEEE Conference on Computer Vision and
   2019.                                                           Pattern Recognition (CVPR), pages 5533–5542, 2017.
[Gama, 2010] Joao Gama. Knowledge Discovery from Data           [Rusu et al., 2016] Andrei A. Rusu, Neil C. Rabinowitz,
   Streams. Chapman & Hall/CRC, 1st edition, 2010.                 G. Desjardins, Hubert Soyer, James Kirkpatrick,
[Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and          K. Kavukcuoglu, Razvan Pascanu, and Raia Hadsell.
   Jeff Dean. Distilling the knowledge in a neural network.        Progressive neural networks. ArXiv, abs/1606.04671,
   arXiv preprint arXiv:1503.02531, 2015.                          2016.
[Kirkpatrick et al., 2016] James Kirkpatrick, Razvan Pas-       [Schwarz et al., 2018] Jonathan Schwarz, Jelena Luketina,
   canu, Neil Rabinowitz, Joel Veness, Guillaume Des-              Wojciech M. Czarnecki, Agnieszka Grabska-Barwinska,
   jardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago         Yee Whye Teh, Razvan Pascanu, and Raia Hadsell.
   Ramalho, Agnieszka Grabska-Barwinska, Demis Hass-               Progress & compress: A scalable framework for contin-
   abis, Claudia Clopath, Dharshan Kumaran, and Raia Had-          ual learning. 2018.
   sell. Overcoming catastrophic forgetting in neural net-      [Shin et al., 2017] Hanul Shin, Jung Kwon Lee, Jaehong
   works. 2016.                                                    Kim, and Jiwon Kim. Continual learning with deep gener-
[Lee et al., 2018] Jeongtae Lee, Jaehong Yoon, E. Yang, and        ative replay. 2017.
   Sung Ju Hwang. Lifelong learning with dynamically ex-        [Smith et al., 2019] James Smith, Seth Baer, Zsolt Kira, and
   pandable networks. ArXiv, abs/1708.01547, 2018.                 Constantine Dovrolis. Unsupervised continual learning
[Li and Hoiem, 2016] Zhizhong Li and Derek Hoiem.                  and self-taught associative memory hierarchies. In 2019
   Learning without forgetting. 2016.                              International Conference on Learning Representations
                                                                   Workshops, 2019.
[Li et al., 2019] Xilai Li, Yingbo Zhou, Tianfu Wu, Richard
   Socher, and Caiming Xiong. Learn to grow: A continual        [Yang et al., 2017] Bo Yang, Xiao Fu, Nicholas D.
   structure learning framework for overcoming catastrophic        Sidiropoulos, and Mingyi Hong. Towards k-means-
   forgetting. volume 97 of Proceedings of Machine Learn-          friendly spaces:      Simultaneous deep learning and
   ing Research, pages 3925–3934, Long Beach, California,          clustering. In Doina Precup and Yee Whye Teh, editors,
   USA, 09–15 Jun 2019. PMLR.                                      Proceedings of the 34th International Conference on
                                                                   Machine Learning, volume 70 of Proceedings of Machine
[Lopez-Paz and Ranzato, 2017] David        Lopez-Paz      and      Learning Research, pages 3861–3870. PMLR, 06–11 Aug
   Marc’Aurelio Ranzato. Gradient episodic memory for              2017.
   continual learning. In Proceedings of the 31st Inter-
   national Conference on Neural Information Processing         [Zenke et al., 2017] Friedemann Zenke, Ben Poole, and
   Systems, NIPS’17, page 6470–6479, Red Hook, NY,                 Surya Ganguli. Continual learning through synaptic in-
   USA, 2017. Curran Associates Inc.                               telligence. 2017.
[Mao et al., 2021] Fubing Mao, Weiwei Weng, Mahardhika
   Pratama, and Edward Yapp Kien Yee. Continual learning
   via inter-task synaptic mapping. Knowledge-Based Sys-
   tems, 222:106947, 2021.
[Paik et al., 2020] Inyoung Paik, Sangjun Oh, Taeyeong
   Kwak, and Injung Kim. Overcoming catastrophic for-
   getting by neuron-level plasticity control. In The Thirty-
   Fourth AAAI Conference on Artificial Intelligence, AAAI
   2020, The Thirty-Second Innovative Applications of Arti-
   ficial Intelligence Conference, IAAI 2020, The Tenth AAAI
   Symposium on Educational Advances in Artificial Intelli-
   gence, EAAI 2020, New York, NY, USA, February 7-12,
   2020, pages 5339–5346. AAAI Press, 2020.
[Parisi et al., 2018] German I. Parisi, Ronald Kemker,
   Jose L. Part, Christopher Kanan, and Stefan Wermter.

You can also read