Self-Supervised Training Enhances Online Continual Learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Self-Supervised Training Enhances Online Continual Learning Jhair Gallardo1 Tyler L. Hayes1 Christopher Kanan1,2,3 Rochester Institute of Technology1 , Paige2 , Cornell Tech3 {gg4099, tlh6792, kanan}@rit.edu arXiv:2103.14010v1 [cs.CV] 25 Mar 2021 Abstract In continual learning, a system must incrementally learn from a non-stationary data stream without catastrophic for- getting. Recently, multiple methods have been devised for incrementally learning classes on large-scale image clas- sification tasks, such as ImageNet. State-of-the-art contin- ual learning methods use an initial supervised pre-training phase, in which the first 10% - 50% of the classes in a dataset are used to learn representations in an offline man- ner before continual learning of new classes begins. We hy- pothesize that self-supervised pre-training could yield fea- tures that generalize better than supervised learning, espe- cially when the number of samples used for pre-training is small. We test this hypothesis using the self-supervised MoCo-V2 and SwAV algorithms. On ImageNet, we find that Figure 1: In this study, we found that self-supervised pre- both outperform supervised pre-training considerably for training was more effective than supervised pre-training on online continual learning, and the gains are larger when ImageNet when less data was used. This figure shows fewer samples are available. Our findings are consistent the percentage increase in final top-1 accuracy on Im- across three continual learning algorithms. Our best sys- ageNet of two self-supervised methods, MoCo-V2 [16] tem achieves a 14.95% relative increase in top-1 accuracy and SwAV [9], compared to supervised pre-training when on class incremental ImageNet over the prior state of the paired with REMIND as a function of the number of Ima- art for online continual learning. geNet classes used for pre-training. 1. Introduction tween them and offline systems [28, 29, 55, 10, 71, 32, 68]. Conventional convolutional neural networks (CNNs) are All of these methods use an initial supervised pre-training trained offline and then evaluated. When new training data period on the first 10–50% of the classes. After pre-training, is acquired, the CNN is re-trained from scratch. This can be continual learning begins on the remaining classes. wasteful for both storage and compute. Often, only a small Performance of these systems depends on the amount number of new samples need to be learned, and re-training of data used for supervised pre-training [29]. We hypothe- from scratch requires storing the entire dataset. Ideally, the sized that self-supervised pre-training would be more effec- CNN would be updated on only the new samples, which is tive, especially when less data is used. Because supervised known as continual learning. The longstanding challenge learning only requires features that discriminate among the has been that catastrophic forgetting [47] occurs in conven- classes in the pre-training set, it may not produce optimal tional CNNs when updating on only new samples, espe- representations for unseen categories. In contrast, self- cially when the data stream is non-stationary (e.g., when supervised learning promotes the acquisition of category classes are learned incrementally) [34, 6, 17]. Recently, agnostic feature representations. Using these features could continual learning methods have been shown to scale to help close the generalization gap between CNNs trained of- incremental class learning on full-resolution images in the fline and those trained in a continual manner. Here, we test 1,000 category ImageNet dataset with only a small gap be- this hypothesis using two self-supervised learning methods.
This paper makes the following contributions: • We compare the discriminative power of supervised features to self-supervised features learned with Mo- mentum Contrast v2 (MoCo-V2) [16] and Swapping Assignments between multiple Views (SwAV) [9] as a function of the amount of training data used during pre-training. In offline linear evaluation experiments, we find that self-supervised features generalize better to ImageNet categories omitted from pre-training, with the gap being larger when less data is used. • We further study the ability of supervised and self- supervised features on datasets that were not used for pre-training. • We study the effectiveness of self-supervised features in three systems for online continual learning of ad- ditional categories in ImageNet. Systems must learn Figure 2: During pre-training (left), the features of a CNN in a single-pass through the dataset without looping are initialized offline on a (possibly labeled) dataset, A. over large batches. Across these algorithms, we found Before online continual learning (right), these initialized self-supervised features worked better when fewer cat- weights are copied into the continual learner’s CNN, where egories were used for pre-training compared to super- a portion of the network will remain frozen and the remain- vised pre-training (see Fig. 1). der of the network will be plastic and updated continually. • We achieve the best known result for online continual For online continual learning, the learner receives a sin- learning of ImageNet, where each image is seen or- gle labeled sample (Xj , yj ) at each time-step t for which dered by category, with a relative increase of 14.95% it makes an update and subsequent prediction. The contin- over the previous state of the art [28]. ual learner first learns all samples from the pre-train dataset, A, before learning samples from dataset B. 2. Problem Formulation We study the common continual learning paradigm in and online continual learning in Fig. 2. which pre-training precedes continual learning [28, 29, 55, We focus on online continual learning because these sys- 10, 71, 32, 33, 34, 68, 4]. Formally, given a pre-training tems are more general and existing implementations can dataset {(Xi , yi )} ∈ A, with M images Xi and their cor- learn from data presented in any order [28, 29, 27], while responding labels yi ∈ Y, a set of parameters θ are learned most incremental batch learning implementations are be- for a CNN using A in an offline manner, i.e., the learner spoke to incremental class learning and cannot be used can shuffle the data to simulate independent and identically for other orderings without significant algorithmic changes. distributed data and loop over it as many times as it de- The alternative formulation is incremental batch learning, sires. The main variable of interest in this study is to com- where the system queues up examples in memory until it pare learning θ with supervised versus self-supervised ap- acquires a batch, which is typically about 100,000 exam- proaches. ples in papers using ImageNet (100 classes) [55, 10, 68]. Continual learning begins after pre-training by first ini- Systems then loop over the batch and then purge it from tializing the parameters of the CNN ψ to be continu- memory1 , and the next batch is acquired. The online formu- ally updated to ψ ← θ. The continual learning dataset lation is more general, i.e., the batch size is one sample, it {(Xj , yj )} ∈ B, with images Xj and their corresponding more closely matches real-world applications, systems train labels yj ∈ K is then incrementally presented to the learner, 5−130× faster [28, 29], and recent work has shown it works and the learner cannot loop over B. Here, we study the most nearly as well as batch learning [28]. general form of the problem, online continual learning, where examples are presented one-by-one to the learner and 3. Related Work cannot be revisited without using auxiliary storage, which 3.1. Continual Learning is kept at a fixed capacity. In the incremental class learn- ing paradigm, which causes severe catastrophic forgetting Deep neural networks suffer from catastrophic forget- in conventional approaches, examples are presented in or- ting [47] when they are incrementally updated on non- der of their category and the categories during pre-training stationary data streams. Catastrophic forgetting occurs and continual learning are disjoint, i.e, Y ∩ K = ∅. We 1 With the exception of those samples cached in an auxiliary storage show a depiction of our training protocol for pre-training buffer, if one is used.
when past representations are overwritten with new ones, 3.2. Self-Supervised Learning which causes a drop in performance on past data. There Neural networks require large amounts of data for train- are three main approaches to mitigating catastrophic forget- ing. Recently, self-supervised learning techniques have ting [53]: 1) increasing model capacity to include new rep- gained widespread popularity because they rival super- resentations that will not interfere with old knowledge [67, vised learning techniques [9] without requiring labeled 60, 58, 3, 57, 54, 46, 45, 65], 2) regularizing continual pa- data, which can be expensive and difficult to obtain. Fur- rameter updates such that parameters do not deviate too ther, features learned with self-supervised learning meth- much from their past values [2, 12, 13, 18, 35, 40, 44, 56, 61, ods have been shown to generalize to several downstream 69], and 3) replay (or rehearsal) models that cache previous tasks such as object detection and image recognition [49], data in an auxiliary memory buffer and mix it with new data even when their features are fixed [9]. Motivated by these to fine-tune the network [55, 10, 32, 68, 28, 21, 5, 33, 64]. findings, we believe self-supervised learning could facilitate While network expansion and regularization methods have data efficient generalization in the online continual learning been popular, they do not easily scale to large datasets con- paradigm, so we study it here. taining millions of images or thousands of categories (see Specifically, self-supervised learning methods use pre- [6] for a review). Moreover, many of these methods require text tasks to learn visual features, where the network pro- additional information about when a task changes during vides its own supervision during training. This eliminates continual learning or which task a particular sample came the need for annotated labels, which can be difficult and from2 , or they perform poorly [34, 28, 29]. Our experi- expensive to acquire. Different pretext tasks have been pro- ments focus on the use of a single classification head during posed for visual representation learning including the pre- online continual learning, where information about the task diction of image colorization [72], image rotation [24], and is unknown to models during both training and evaluation. several others [50, 20, 19, 8]. More recent works use con- We believe this setting is more readily deployable, as task trastive learning [26, 51], where the model learns which dat- information could be difficult to obtain in real-time. apoints are similar or different based on feature similarity, Recently, methods that use replay have demonstrated and they have even surpassed supervised networks on many success for large-scale continual learning of the ImageNet downstream tasks [30, 14, 16, 15, 38, 25, 9]. dataset [55, 10, 32, 68, 28, 21, 5, 64]. All of these models MoCo [30] compares instance features using two differ- follow a similar training paradigm where they are first pre- ent deep models, where one network is updated with a run- trained on a subset of 100 [55, 10, 68, 28] or 500 [32, 21, 64] ning mean of the other network’s weights. It then stores a ImageNet classes in an offline setting, where they can shuf- set of image representations in a dynamic buffer to be used fle and loop over the data as much as necessary. After the as negative examples. Representations learned by MoCo pre-training phase, these methods are updated on the re- achieve competitive results against supervised features dur- maining ImageNet classes and evaluated on seen classes at ing fine-tuning on several downstream tasks. A Simple various steps throughout training. Although the algorithms framework for Contrastive Learning of visual Representa- differ fundamentally, they all perform pre-training using a tion (SimCLR) [14] follows the same instance discrimina- supervised cross-entropy loss and a fixed pre-train number tion scheme as MoCo, but only one deep model is used of classes. To the best of our knowledge, only [29] has to obtain image representations. It uses a large batch size, investigated the impact of the size of the supervised pre- which provides enough negative samples without the need training set on a continual learner’s performance. However, for a buffer. SimCLR also composes several data augmen- we think this is an important aspect that should be investi- tation techniques and performs a non-linear transformation gated further as smaller pre-training sizes require less com- of feature representations before computing the contrastive pute and less data. Since self-supervised learning is ad- loss. These additional components of SimCLR allow it to vantageous in producing generalizable feature representa- outperform supervised networks on semi-supervised tasks tions for downstream tasks [30] and does not require sam- where only a small number of labels are available. ples to be labeled, we hypothesize that self-supervised pre- MoCo-V2 [16] incorporates some of SimCLR’s design training will yield more transferable features for online con- improvements into MoCo to outperform previous methods tinual learning. Here, we study the use of a self-supervised without the need for large batch sizes. These include the pre-training phase and the impact of different pre-training addition of a projection head and applying more data aug- dataset sizes on continual learning performance. mentation. SimCLR-V2 [15] explores larger models and uses Selective Kernel Networks [39] to achieve improved performance over SimCLR. SwAV [9] follows a different contrastive approach, where it performs a cluster assign- 2 This paradigm is commonly referred to as the multi-head setting in ment prediction instead of comparing features of instances continual learning literature. directly. This allows SwAV to outperform methods like
MoCo-V2 and SimCLR. SwAV is the first self-supervised of 0.9, and weight decay of 1e-4. Standard random crop and method to surpass ImageNet supervised features by per- horizontal flips were used for augmentation. forming a linear evaluation on datasets like Places-205 [74], MoCo-V2: The original self-supervised MoCo [30] ar- VOC-07 [22], and iNaturalist-2018 [66]. This means that chitecture builds a dynamic dictionary such that keys in the SwAV features generalize better in downstream tasks, even dictionary relate to training images. It then trains an en- when those features are fixed. coder network such that query images should be similar to Self-supervised features are commonly tested by per- their closest key in the dictionary and further from dissim- forming a linear evaluation on top of a frozen feature extrac- ilar keys using a contrastive loss. MoCo-V2 [16] makes tor, or by fine-tuning the complete network on downstream two additional improvements over MoCo: it uses an addi- tasks. In both cases, it is expected that the complete dataset tional projection layer and blur augmentation. We chose is available for self-supervised pre-training. [49] assumes MoCo-V2 due to its success in transferring representations that the entire dataset is not available to pre-train the sys- to downstream tasks and its superior performance to exist- tem. By fine-tuning the features on different downstream ing methods, while requiring less memory and compute. tasks, they showed that self-supervision surpasses supervi- Following [16], we train MoCo-V2 models for 800 epochs sion when less labeled data is available during pre-training. using a learning rate of 0.015 and minibatch size of 128. Our work differs from [49] in three ways: 1) we test newer SwAV: The self-supervised SwAV [9] method learns to self-supervised methods like MoCo-V2 [16] and SwAV [9], assign clusters to different augmentations or “views” of the 2) our downstream task is online continual learning for im- same image. Unlike standard contrastive learning tech- age classification, which naturally assumes that only a small niques (e.g., MoCo), SwAV does not require direct pair- portion of the data is available for pre-training, and 3) we wise comparisons of features for its swapped prediction do not use controlled synthetic datasets and instead focus contrastive loss. We chose SwAV since it was the first on high-resolution natural image datasets. self-supervised method to outperform supervised features on downstream tasks such as object detection and recog- 4. Algorithms nition. Following [9], SwAV models are trained for 400 epochs with an initial learning rate of 0.6, final learning rate In our study, we compare online continual learning of 0.0006, epsilon of 0.03, and minibatch size of 64. We set systems that use either self-supervised or supervised pre- the number of prototypes to 10 |Y| in a queue of length 384. training as a function of the size of the pre-training dataset. For all experiments, we use ResNet-18 as the CNN. 4.2. Online Continual Learning Models For continual learning with the full-resolution ImageNet We evaluate three online continual learning algorithms dataset, ResNet-18 has been adopted as the universal stan- that each use pre-trained features. They were chosen be- dard CNN architecture by the community [55, 68, 28, 32, cause they have been shown to get strong or state-of-the- 21, 28, 10, 5, 4]. art results on incremental class learning for ImageNet. For Next, we give details for the pre-training approaches and experiments with class-incremental learning on ImageNet, then we describe the online continual learning algorithms. all continual learning methods first visit the pre-training set with online updates before observing the set that was not 4.1. Pre-Training Approaches used for pre-training. We provide details below. We describe the three pre-training algorithms we study. SLDA [29]: Deep Streaming Linear Discriminant Anal- The self-supervised methods were chosen based on their ysis (SLDA) keeps the pre-trained CNN features fixed. It strong performance for offline linear evaluation in their pa- solely learns the output layer of the network. It was shown pers and because they were practical to train due to their to be extremely effective compared to earlier methods that relatively low computational costs since they do not require do update the CNN, despite not using any auxiliary mem- a large batch size, unlike some self-supervised algorithms. ory. It learns extremely quickly. SLDA stores a set of run- All pre-trained models were trained on a computer with 4 ning mean vectors for each class and a shared covariance TitanX GPUs (2015 edition) with 128 GB of system RAM. matrix, both initialized during pre-training and only these Supervised: Supervised pre-training for continual learn- parameters are updated during online training. A new input ing is our baseline as it is the standard approach used [29, is classified by assigning it the label of the closest Gaussian 28, 55, 10, 32, 68], where the CNN is trained using cross- in feature space. Following [29], we use a plastic covariance entropy with the labels on the pre-training data. We fol- matrix and shrinkage of 1e-4. lowed the same protocol as the state-of-the-art REMIND Online Softmax with Replay: Offline softmax classi- system for pre-training [28]. Systems were trained for 40 fiers are often trained with self-supervised features, so we epochs with a minibatch size of 256, an initial learning rate also created an online softmax classifier for continual learn- of 0.1 with a decrease of 10× every 15 epochs, momentum ing by using replay to mitigate catastrophic forgetting. Like
SLDA, it does not learn features in the CNN after pre- Table 1: Top-1 accuracies for offline linear evaluations with training. When a new example is to be learned, it randomly different features and numbers of pre-train classes. We eval- samples a buffer of CNN embeddings to mix in 50 addi- uate only on the pre-train ImageNet classes. tional samples and this set of 51 samples is used to update the weights using gradient descent and a learning rate of F EATURES 10 15 25 50 75 100 0.1. In our experiments, this buffer contains a maximum of Supervised 71.20 82.67 86.64 83.16 81.20 80.32 735K feature vectors each with 512 dimensions (1.5 GB). MoCo-V2 82.20 84.00 84.40 80.80 78.96 77.76 When the buffer reaches maximum capacity, we randomly SwAV 91.20 90.27 89.44 85.48 81.04 80.82 discard an example from the most represented class. REMIND [28]: Unlike SLDA and Online Softmax with Replay, REMIND does not keep the CNN features fixed af- classification. To do this we use the Places-365 dataset [73], ter pre-training. Instead, it keeps the lower layers of the which has 1.8 million images from 365 classes. We exclu- CNN fixed, but allows the weights in the upper layers to sively use it for offline linear evaluation and continual learn- change. To mitigate forgetting, REMIND replays mid-level ing, i.e., we do not perform pre-training on this dataset. CNN features that are compressed using Product Quanti- All of our continual learning experiments use the class zation (PQ) and stored in a buffer. After initializing the incremental learning setting, where the data is ordered by CNN features during pre-training, we use this same pre- class, but all images are shuffled within each class. We re- training data to train the PQ model and store the associ- port top-1 accuracy on the validation sets. For ImageNet, ated compressed representations of this data in the memory we ran the validation set through the model after the first buffer. We run REMIND with the recommended settings pre-training batch was observed and then after multiples of from [28], i.e., starting learning rate of 0.1, 32 codebooks 100 classes. For Places-365, validation was computed after each of size 256, 50 randomly selected replay samples, a all 365 classes were incrementally learned. buffer size of 959,665 (equal to 1.5 GB), mix-up data aug- mentation, and extracting mid-level features such that two 6. Results convolutional layers and the final classification layer remain plastic during online learning. 6.1. Offline Linear Evaluation Results Before examining the efficacy of the different pre- 5. Experimental Setup training methods for online continual learning, we first For continual learning on many classes, ImageNet compare the efficacy of the self-supervised and supervised ILSVRC-2012 is the standard benchmark for assessing the features as a function of the amount of data for pre-training. ability to scale [59]. It has 1.2 million images from 1,000 To do this, we use the standard offline linear evaluation classes, with about 1,000 images per class used for train- done for self-supervised learning [30, 9, 14]. This enables ing. Most existing papers have used the first 100 classes us to measure how generalizable the pre-trained features (10%) for supervised pre-training [55, 10, 68, 28] or the are when less data is used, without being concerned about first 500 (50%) classes [32, 21, 64]. An exception is catastrophic forgetting during continual learning. While [29], which studied performance as a function of the size this setup is standard in self-supervised learning, existing of the supervised pre-training set and found that perfor- work has focused on this comparison only when doing pre- mance was highly dependent on the amount of data. While training with all 1,000 ImageNet categories, rather than on large amounts of labeled data are available for natural RGB subsets that have varying numbers of categories. Although images, there are other domains where collecting large [49] evaluates the effectiveness of self-supervised training amounts of labeled data for pre-training may be infeasible. for downstream tasks using different dataset sizes, our study We split ImageNet into pre-training sets for feature differs in three ways: 1) we evaluate the effectiveness of learning of various sizes: 10, 15, 25, 50, 75, 100 classes. the more recent MoCo-V2 and SwAV techniques, which In our experiments on ImageNet, after learning features on were not evaluated in [49], 2) we study the ability of self- a pre-train set in an offline manner, the pre-train set and the supervised features to generalize to large-scale natural im- remainder of the dataset are combined and then examples age datasets containing millions of images, whereas [49] are fed one-by-one into the learner. Given our limited com- only evaluates on synthetic datasets, and 3) we are the first putational resources, we were not able to train for additional to evaluate self-supervised pre-training for online continual pre-training dataset sizes. learning downstream tasks. In addition to doing continual learning on ImageNet it- Following others, our offline linear evaluation to com- self, we also study how well the pre-trained features learned pare the quality of the pre-trained features was done using a on ImageNet from various sizes of pre-train sets generalize softmax classifier. The classifier was trained for 100 epochs to another dataset from an entirely different domain: scene with a minibatch size of 256. For supervised features with
Table 2: Top-1 accuracies for offline linear evaluations with Table 3: Top-1 accuracies for offline linear evaluations with different features and numbers of pre-train classes. We eval- different features and numbers of pre-train classes. We eval- uate on all 1,000 classes in ImageNet. uate on all 365 classes in Places-365. F EATURES 10 15 25 50 75 100 F EATURES 10 15 25 50 75 100 Supervised 10.42 18.23 26.76 34.32 38.29 41.58 Supervised 15.07 22.83 28.46 31.03 32.63 33.78 MoCo-V2 21.76 27.50 30.61 36.45 40.15 43.19 MoCo-V2 24.07 29.45 31.01 32.58 34.54 35.93 SwAV 31.09 33.62 35.81 39.54 43.43 44.82 SwAV 32.71 33.88 35.10 35.16 36.75 37.41 SwAV, we used a learning rate of 0.1 which we decay by across systems even with 10 classes, so we did not study a factor of 10 at 60 and 80 epochs and L2 weight decay of larger pre-training sets. 1e-5. These settings did not work well for MoCo-V2, so Lastly, we evaluate how well the ImageNet pre-trained we used the settings for linear evaluation from [30] instead, features transferred to Places-365. We trained the softmax which were a learning rate of 30 that we decay by a factor classifier for 28 epochs with a minibatch size of 256, re- of 10 at epochs 60 and 80 with no weight decay. ducing the learning rate by 10× at epochs 10 and 18. We We first compared how effective pre-training approaches used the same learning rates and L2 weight decays from the were when directly compared on only the same classes used ImageNet linear evaluation. Results are in Table 3. Simi- for pre-training before assessing their efficacy on unseen lar to our results on ImageNet, self-supervised features out- classes (Table 1). We had expected supervised features to perform supervised features when pre-trained on ImageNet outperform self-supervised features in this experiment, but and evaluated on the Places-365 dataset. SwAV features surprisingly SwAV outperformed supervised features when outperform MoCo-V2 features for all pre-train sizes, where 50 or fewer classes were used and it rivaled or exceeded more benefit is seen when the network was pre-trained on performance when the number of classes was 75 or 100. fewer classes. These results demonstrate the versatility of MoCo-V2 only outperformed supervised features when us- self-supervised features in generalizing to datasets beyond ing 15 or fewer classes. what they were trained on. This is especially useful when We next assessed the effectiveness of the features to gen- generalizing to datasets that have very few classes and/or eralize to unseen categories by training the softmax clas- samples, as pre-training directly on a subset of the dataset sifier on all 1,000 classes. Results are shown in Table 2. would be difficult. We found that SwAV features outperform supervised and MoCo-V2 features across all pre-train sizes. The gap in 6.2. Online Continual Learning Results performance between self-supervised and supervised pre- The results of training Deep SLDA, Online Softmax, and training is larger when using fewer pre-train classes. For REMIND using different pre-training methods and sizes are example, SwAV outperforms supervised features by 3.24% shown in Table 4, and the associated learning curves are in when using 100 classes during pre-training; however, when Fig. 3. Deep SLDA results show that SwAV outperforms we decrease the number of pre-training classes to 10, SwAV supervised features when the pre-train size is less than or outperforms supervised features by 20.67%. MoCo-V2 also equal to 75, while MoCo-V2 features surpass supervised outperforms supervised features, with a gap of 1.61% for features when the pre-train size is less than or equal to 100 pre-training classes, and 11.34% for 10. This shows 25. Even when supervised pre-training outperforms self- that self-supervised features generalize better in linearly supervised features, SwAV and MoCo-V2 still show com- classifying data outside the set they were trained on as com- petitive performance, with gaps no greater than 4%. Results pared to supervised representations, especially when the for Online Softmax show that SwAV performance surpasses pre-train size is small. supervised features for all pre-train sizes. MoCo-V2 outper- We next examine the impact of the classes chosen for forms supervised features for sizes less than 25 classes, but pre-training. To do this, we randomly chose 6 different sets it shows competitive performance for other sizes as well. of 10 pre-train classes from ImageNet to learn features, and REMIND shows the most benefit from self-supervised pre- then trained an offline softmax classifier using these features training, where SwAV and MoCo-V2 consistently outper- on all 1,000 ImageNet classes. We calculated the mean form supervised features for all pre-train sizes tested. Fig. 1 top-1 accuracy and standard deviation across the 6 runs and shows the relative percentage increase of self-supervision obtained: 12.65%±1.26% for supervised, 22.78%±0.88% over supervision for REMIND, with a maximum relative for MoCo-V2, and 31.88%±1.40% for SwAV. With only 10 improvement of 89.14% for SwAV and 60.74% for MoCo- classes, SwAV and MoCo-V2 always outperformed super- V2 when only 10 classes are used for pre-training. Simi- vised features. The standard deviation was relatively low lar relative percentage plots for Online Softmax and Deep
Table 4: Final top-1 accuracy values achieved by each con- Table 5: Final top-1 accuracy values achieved by Deep tinual learning method on ImageNet using various features SLDA when pre-trained on ImageNet using various features and numbers of pre-train classes. and numbers of pre-train classes and continually updated and evaluated on the Places-365 dataset. M ETHOD 10 15 25 50 75 100 Deep SLDA F EATURES 10 15 25 50 75 100 Supervised 9.53 14.67 20.7 26.54 29.53 31.99 Supervised 13.77 19.59 23.42 25.28 26.04 27.03 MoCo-V2 19.37 20.66 21.64 24.43 26.26 28.31 MoCo-V2 23.36 23.72 24.05 24.17 25.11 26.00 SwaV 22.33 24.01 25.22 28.5 30.89 31.77 SwaV 26.53 27.20 28.22 28.48 29.38 29.58 Online Softmax Supervised 9.87 15.92 23.58 30.16 33.05 35.14 MoCo-V2 23.09 25.67 25.66 28.79 31.14 33.81 6.2.1 Domain Transfer with Deep SLDA SwAV 18.65 21.40 27.42 35.79 39.48 41.31 REMIND So far, we have studied the effectiveness of supervised and Supervised 19.79 26.17 33.48 39.38 43.40 45.28 self-supervised pre-training to generalize to unseen classes MoCo-V2 31.81 34.39 38.43 44.50 47.36 49.25 from the same dataset for online continual learning. A SwAV 37.43 40.09 43.05 48.31 51.01 52.05 natural next step is to study how well self-supervised pre- training of features on one dataset generalize to continual learning on another dataset, e.g., pre-train on ImageNet and SLDA are in supplemental materials. continually learn on Places-365. This set of experiments is In Table 4, it is worth noting that REMIND using SwAV similar to standard transfer learning and domain adaptation pre-training on only 10 classes surpasses all the results ob- setups [42, 43, 52, 70], and could prove especially useful tained by Deep SLDA. Also, pre-training REMIND with when performing continual learning on small datasets that SwAV on 50 classes outperforms supervised pre-training require rich and generalizable feature representations. on 100 classes, showing that SwAV can surpass supervised To evaluate our hypothesis about feature transfer to a dif- features with only half of the data used during pre-training. ferent dataset, we trained Deep SLDA on Places-365 us- This is desirable because using less data during pre-training ing various pre-trained ImageNet features. We chose Deep results in faster initialization for online continual learning SLDA since it is extremely fast to train. We start online models. Further, labels are not needed for self-supervised learning with the first sample in Places-365 and learn one methods, which can be difficult and expensive to obtain. class at a time from mean vectors initialized as zeros and Fig. 3 shows learning curves for REMIND using su- a covariance matrix initialized as ones, as in [29]. Final pervised, MoCo-V2, and SwAV features for various pre- top-1 accuracy results are in Table 5. Overall, we found training set sizes. These curves show the top-1 accuracy that SwAV features generalized the best, independent of every time a multiple of 100 classes has been seen by the the number of pre-training classes. Although the perfor- model, including the performance right after pre-training. mance gap between SwAV features with 10 and 100 pre- We see that adding more classes during pre-training con- training classes is small (3.05%), SwAV outperforms super- sistently improves REMIND’s performance for all features vised features trained with 10 pre-train classes by 12.76%. used. Furthermore, using SwAV or MoCo-V2 consistently These results demonstrate that self-supervised features can improves performance over supervised pre-training, which improve online learning performance, even when the pre- can be seen by the vertical shift of the learning curves across training and continual learning datasets differ. different features. Similar learning curves for Online Soft- max and Deep SLDA are in supplemental materials. 7. Discussion For a fair comparison with the original REMIND paper, we compute the average top-5 accuracy of REMIND us- We replaced supervised pre-training with self-supervised ing SwAV features pre-trained on 100 classes of ImageNet techniques for online continual learning on ImageNet. Our and evaluated on seen classes at increments of 100 classes. results show that self-supervised methods require fewer pre- This variant of REMIND with SwAV features achieves an training classes to achieve high performance on image clas- average top-5 accuracy of 83.55%, which is a 4.87% ab- sification compared to supervised pre-training. This be- solute percentage improvement over the supervised results haviour is seen both during offline linear evaluation and reported in [28]. This sets a new state of the art result during online continual learning. Using SwAV pre-trained for online continual learning of ImageNet. Moreover, RE- features, REMIND achieves a new state-of-the-art perfor- MIND using SwAV features pre-trained on 100 classes of mance for online continual learning on ImageNet. ImageNet achieves a 14.95% relative increase in final top-1 We followed others in using the ResNet-18 CNN archi- accuracy over supervised features (see Fig. 1). tecture [55, 29, 28]; however, wider and deeper networks
(a) Supervised (b) MoCo-V2 (c) SwAV Figure 3: Learning curves for each pre-train size with REMIND using (a) supervised, (b) MoCo-V2, and (c) SwAV features. have been shown to improve performance, especially for 75], where an agent must identify samples outside of its self-supervised learning methods [14, 15, 25]. It would be training distribution as unknown and then learn them. We interesting to explore these alternate architectures for future showed that self-supervised features generalize to both un- online continual learners initialized using self-supervised seen classes and unseen datasets in online settings, so ex- methods. Additionally, we studied online learning meth- tending our work to open world learning would only re- ods because they are more versatile. However, we think it quire the addition of a component to identify if samples are is likely that self-supervised initialized features would also unknown or out-of-distribution [41, 31, 36]. Open world benefit incremental batch learning models [55, 68]. learning is more realistic than online continual learning In addition to features trained purely in the supervised since an agent cannot expect an indication about whether or self-supervised paradigms, future studies could investi- or not a sample is from its known training distribution. gate semi-supervised pre-training. Semi-supervised learn- ing would be advantageous to supervised learning because 8. Conclusion it requires fewer labels, but could still learn features dis- criminative to specific classes, while generalizing similarly Continually updating neural networks on non-stationary to self-supervised techniques. Furthermore, it would be in- data streams is a longstanding challenge in artificial intel- teresting to develop and evaluate self-supervised or semi- ligence. While much progress has been made to develop supervised online updates for continual learning methods. continual learners that scale to large-scale datasets like Ima- These could be used to continually update the plastic layers geNet, these approaches require an initial offline supervised of REMIND, which we hypothesize would improve perfor- training phase on the first 10-50% of ImageNet data. In this mance and generalization further. paper, we demonstrated the discriminative power of using Similar to how self-supervised features have been evalu- self-supervised CNN features produced by the MoCo-V2 ated for downstream tasks in offline settings [30], it would and SwAV models to generalize to unseen classes/datasets also be interesting to evaluate how well self-supervised fea- in online continual learning settings. We demonstrated that tures perform for additional downstream online continual self-supervised features consistently outperformed super- learning tasks, e.g., continual object detection [1, 62], con- vised features, especially when only a small subset of the tinual semantic segmentation [11, 48], or continual learn- dataset was available for training. Further, we set a new ing for robotics applications [23, 37]. We demonstrated state-of-the-art for online continual learning by pairing the that self-supervised features initialized on one image recog- recent REMIND architecture with features pre-trained on nition dataset (ImageNet) generalize to another recogni- SwAV for a 14.95% top-1 relative performance gain. Self- tion dataset (Places-365), and we hypothesize that self- supervised learning is desirable as it reduces overfitting, supervised features would also generalize to additional promotes generalization, and does not require labels, which downstream continual learning tasks. can be costly to obtain. In the future, it would be interest- A natural downstream setting that our work facilitates is ing to perform online updates with self-supervised learning open world learning [7] or automatic class discovery [63, techniques in addition to self-supervised pre-training.
Acknowledgements. This work was supported in part [13] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, by the DARPA/SRI Lifelong Learning Machines program and Mohamed Elhoseiny. Efficient lifelong learning with A- [HR0011-18-C-0051], AFOSR grant [FA9550-18-1-0121], GEM. In ICLR, 2019. 3 and NSF award #1909696. The views and conclusions con- [14] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- tained herein are those of the authors and should not be offrey Hinton. A simple framework for contrastive learning interpreted as representing the official policies or endorse- of visual representations. arXiv preprint arXiv:2002.05709, 2020. 3, 5, 8 ments of any sponsor. We thank fellow lab member Manoj [15] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Acharya for his comments and useful discussions. Norouzi, and Geoffrey Hinton. Big self-supervised mod- els are strong semi-supervised learners. arXiv preprint References arXiv:2006.10029, 2020. 3, 8 [1] Manoj Acharya, Tyler L Hayes, and Christopher Kanan. [16] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Rodeo: Replay for online object detection. In BMVC, 2020. Improved baselines with momentum contrastive learning. 8 arXiv preprint arXiv:2003.04297, 2020. 1, 2, 3, 4 [17] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah [2] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Marcus Rohrbach, and Tinne Tuytelaars. Memory aware Tinne Tuytelaars. Continual learning: A comparative study synapses: Learning what (not) to forget. In ECCV, pages on how to defy forgetting in classification tasks. arXiv 139–154, 2018. 3 preprint arXiv:1909.08383, 2(6), 2019. 1 [3] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. [18] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Expert gate: Lifelong learning with a network of experts. Ziyan Wu, and Rama Chellappa. Learning without memoriz- In Proceedings of the IEEE Conference on Computer Vision ing. In Proceedings of the IEEE/CVF Conference on Com- and Pattern Recognition, pages 3366–3375, 2017. 3 puter Vision and Pattern Recognition, pages 5138–5146, [4] Eden Belouadah and Adrian Popescu. Deesil: Deep-shallow 2019. 3 incremental learning. In Proceedings of the European Con- [19] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- ference on Computer Vision (ECCV) Workshops, pages 0–0, vised visual representation learning by context prediction. In 2018. 2, 4 Proceedings of the IEEE international conference on com- [5] Eden Belouadah and Adrian Popescu. Il2m: Class incre- puter vision, pages 1422–1430, 2015. 3 mental learning with dual memory. In Proceedings of the [20] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Ried- IEEE/CVF International Conference on Computer Vision, miller, and Thomas Brox. Discriminative unsupervised fea- pages 583–592, 2019. 3, 4 ture learning with convolutional neural networks. Citeseer, [6] Eden Belouadah, Adrian Popescu, and Ioannis Kanellos. 2014. 3 A comprehensive study of class incremental learning algo- [21] Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas rithms for visual tasks. Neural Networks, 2020. 1, 3 Robert, and Eduardo Valle. Podnet: Pooled outputs dis- [7] Abhijit Bendale and Terrance Boult. Towards open world tillation for small-tasks incremental learning. In Com- recognition. In CVPR, pages 1893–1902, 2015. 8 puter vision-ECCV 2020-16th European conference, Glas- [8] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and gow, UK, August 23-28, 2020, Proceedings, Part XX, volume Matthijs Douze. Deep clustering for unsupervised learning 12365, pages 86–102. Springer, 2020. 3, 4, 5 of visual features. In Proceedings of the European Confer- [22] Mark Everingham, Luc Van Gool, Christopher KI Williams, ence on Computer Vision (ECCV), pages 132–149, 2018. 3 John Winn, and Andrew Zisserman. The pascal visual object [9] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- classes (voc) challenge. International journal of computer otr Bojanowski, and Armand Joulin. Unsupervised learning vision, 88(2):303–338, 2010. 4 of visual features by contrasting cluster assignments. arXiv [23] Fan Feng, Rosa HM Chan, Xuesong Shi, Yimin Zhang, and preprint arXiv:2006.09882, 2020. 1, 2, 3, 4, 5 Qi She. Challenges in task incremental learning for assistive [10] Francisco M Castro, Manuel J Marı́n-Jiménez, Nicolás Guil, robotics. IEEE Access, 2019. 8 Cordelia Schmid, and Karteek Alahari. End-to-end incre- [24] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- mental learning. In Proceedings of the European conference supervised representation learning by predicting image rota- on computer vision (ECCV), pages 233–248, 2018. 1, 2, 3, tions. arXiv preprint arXiv:1803.07728, 2018. 3 4, 5 [25] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin [11] Fabio Cermelli, Massimiliano Mancini, Samuel Rota Bulo, Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do- Elisa Ricci, and Barbara Caputo. Modeling the background ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- for incremental learning in semantic segmentation. In CVPR, mad Gheshlaghi Azar, et al. Bootstrap your own latent: A pages 9233–9242, 2020. 8 new approach to self-supervised learning. arXiv preprint [12] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajan- arXiv:2006.07733, 2020. 3, 8 than, and Philip HS Torr. Riemannian walk for incremen- [26] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensional- tal learning: Understanding forgetting and intransigence. In ity reduction by learning an invariant mapping. In 2006 IEEE ECCV, pages 532–547, 2018. 3 Computer Society Conference on Computer Vision and Pat-
tern Recognition (CVPR’06), volume 2, pages 1735–1742. [40] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE, 2006. 3 IEEE transactions on pattern analysis and machine intelli- [27] Tyler L Hayes, Nathan D Cahill, and Christopher Kanan. gence, 40(12):2935–2947, 2017. 3 Memory efficient experience replay for streaming learning. [41] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the re- In ICRA, pages 9769–9776, 2019. 2 liability of out-of-distribution image detection in neural net- [28] Tyler L Hayes, Kushal Kafle, Robik Shrestha, Manoj works. In Proccedings of the International Conference on Acharya, and Christopher Kanan. Remind your neural net- Learning Representations (ICLR), 2018. 8 work to prevent catastrophic forgetting. In ECCV, 2020. 1, [42] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jor- 2, 3, 4, 5, 7 dan. Learning transferable features with deep adaptation net- [29] Tyler L Hayes and Christopher Kanan. Lifelong machine works. In International conference on machine learning, learning with deep streaming linear discriminant analysis. In pages 97–105. PMLR, 2015. 7 CVPRW, 2020. 1, 2, 3, 4, 5, 7 [43] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang [30] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Sun, and Philip S Yu. Transfer feature learning with joint Girshick. Momentum contrast for unsupervised visual rep- distribution adaptation. In Proceedings of the IEEE inter- resentation learning. In Proceedings of the IEEE/CVF Con- national conference on computer vision, pages 2200–2207, ference on Computer Vision and Pattern Recognition, pages 2013. 7 9729–9738, 2020. 3, 4, 5, 6, 8 [44] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient [31] Dan Hendrycks and Kevin Gimpel. A baseline for detecting episodic memory for continual learning. In NeurIPS, pages misclassified and out-of-distribution examples in neural net- 6467–6476, 2017. 3 works. In Proccedings of the International Conference on [45] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggy- Learning Representations (ICLR), 2017. 8 back: Adapting a single network to multiple tasks by learn- [32] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and ing to mask weights. In Proceedings of the European Con- Dahua Lin. Learning a unified classifier incrementally via ference on Computer Vision (ECCV), pages 67–82, 2018. 3 rebalancing. In Proceedings of the IEEE/CVF Conference on [46] Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- Computer Vision and Pattern Recognition, pages 831–839, tiple tasks to a single network by iterative pruning. In Pro- 2019. 1, 2, 3, 4, 5 ceedings of the IEEE Conference on Computer Vision and [33] Ronald Kemker and Christopher Kanan. Fearnet: Brain- Pattern Recognition, pages 7765–7773, 2018. 3 inspired model for incremental learning. arXiv preprint [47] Michael McCloskey and Neal J Cohen. Catastrophic inter- arXiv:1711.10563, 2017. 2, 3 ference in connectionist networks: The sequential learning [34] Ronald Kemker, Marc McClure, Angelina Abitino, Tyler problem. In Psychology of learning and motivation, vol- Hayes, and Christopher Kanan. Measuring catas- ume 24, pages 109–165. Elsevier, 1989. 1, 2 trophic forgetting in neural networks. arXiv preprint [48] Umberto Michieli and Pietro Zanuttigh. Incremental learn- arXiv:1708.02072, 2017. 1, 2, 3 ing techniques for semantic segmentation. In ICCV-W, pages [35] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel 0–0, 2019. 8 Veness, Guillaume Desjardins, Andrei A Rusu, Kieran [49] Alejandro Newell and Jia Deng. How useful is self- Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- supervised pretraining for visual tasks? In Proceedings of Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- the IEEE/CVF Conference on Computer Vision and Pattern maran, and Raia Hadsell. Overcoming catastrophic forget- Recognition, pages 7345–7354, 2020. 3, 4, 5 ting in neural networks. PNAS, 2017. 3 [36] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A [50] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of simple unified framework for detecting out-of-distribution visual representations by solving jigsaw puzzles. In Euro- samples and adversarial attacks. In Proceedings of the Ad- pean conference on computer vision, pages 69–84. Springer, vances in Neural Information Processing Systems (NeurIPS), 2016. 3 2018. 8 [51] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- [37] Timothée Lesort, Vincenzo Lomonaco, Andrei Stoian, Da- sentation learning with contrastive predictive coding. arXiv vide Maltoni, David Filliat, and Natalia Dı́az-Rodrı́guez. preprint arXiv:1807.03748, 2018. 3 Continual learning for robotics: Definition, framework, [52] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang learning strategies, opportunities and challenges. Informa- Yang. Domain adaptation via transfer component analy- tion Fusion, 58:52–68, 2020. 8 sis. IEEE Transactions on Neural Networks, 22(2):199–210, [38] Junnan Li, Pan Zhou, Caiming Xiong, Richard Socher, and 2010. 7 Steven CH Hoi. Prototypical contrastive learning of unsu- [53] German I Parisi, Ronald Kemker, Jose L Part, Christopher pervised representations. arXiv preprint arXiv:2005.04966, Kanan, and Stefan Wermter. Continual lifelong learning with 2020. 3 neural networks: A review. Neural Networks, 2019. 3 [39] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selec- [54] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. tive kernel networks. In Proceedings of the IEEE/CVF Con- Efficient parametrization of multi-domain deep neural net- ference on Computer Vision and Pattern Recognition, pages works. In Proceedings of the IEEE Conference on Computer 510–519, 2019. 3 Vision and Pattern Recognition, pages 8119–8127, 2018. 3
[55] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg [69] Friedemann Zenke, Ben Poole, and Surya Ganguli. Contin- Sperl, and Christoph H Lampert. icarl: Incremental classifier ual learning through synaptic intelligence. In ICML, pages and representation learning. In Proceedings of the IEEE con- 3987–3995, 2017. 3 ference on Computer Vision and Pattern Recognition, pages [70] Jing Zhang, Wanqing Li, and Philip Ogunbona. Joint geo- 2001–2010, 2017. 1, 2, 3, 4, 5, 7, 8 metrical and statistical alignment for visual domain adapta- [56] Hippolyt Ritter, Aleksandar Botev, and David Barber. On- tion. In Proceedings of the IEEE conference on computer line structured laplace approximations for overcoming catas- vision and pattern recognition, pages 1859–1867, 2017. 7 trophic forgetting. In NeurIPS, pages 3738–3748, 2018. 3 [71] Junting Zhang, Jie Zhang, Shalini Ghosh, Dawei Li, Serafet- [57] Amir Rosenfeld and John K Tsotsos. Incremental learning tin Tasci, Larry Heck, Heming Zhang, and C-C Jay Kuo. through deep adaptation. IEEE transactions on pattern anal- Class-incremental learning via deep model consolidation. In ysis and machine intelligence, 42(3):651–663, 2018. 3 Proceedings of the IEEE/CVF Winter Conference on Appli- [58] Deboleena Roy, Priyadarshini Panda, and Kaushik Roy. cations of Computer Vision, pages 1131–1140, 2020. 1, 2 Tree-cnn: a hierarchical deep convolutional neural network [72] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful for incremental learning. Neural Networks, 121:148–160, image colorization. In European conference on computer 2020. 3 vision, pages 649–666. Springer, 2016. 3 [59] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [73] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, and Antonio Torralba. Places: A 10 million image database Aditya Khosla, Michael Bernstein, et al. Imagenet large for scene recognition. IEEE Transactions on Pattern Analy- scale visual recognition challenge. International journal of sis and Machine Intelligence, 2017. 5 computer vision, 115(3):211–252, 2015. 5 [74] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor- ralba, and Aude Oliva. Learning deep features for scene [60] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, recognition using places database. 2014. 4 Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Raz- van Pascanu, and Raia Hadsell. Progressive neural networks. [75] Jun-Yan Zhu, Jiajun Wu, Yan Xu, Eric Chang, and Zhuowen arXiv preprint arXiv:1606.04671, 2016. 3 Tu. Unsupervised object class discovery via saliency-guided multiple class learning. TPAMI, 37(4):862–875, 2014. 8 [61] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In ICML, pages 4555–4564, 2018. 3 [62] Konstantin Shmelkov, Cordelia Schmid, and Karteek Ala- hari. Incremental learning of object detectors without catas- trophic forgetting. In ICCV, pages 3400–3409, 2017. 8 [63] Josef Sivic, Bryan C Russell, Alexei A Efros, Andrew Zisser- man, and William T Freeman. Discovering object categories in image collections. In ICCV, 2005. 8 [64] Xiaoyu Tao, Xinyuan Chang, Xiaopeng Hong, Xing Wei, and Yihong Gong. Topology-preserving class-incremental learning. In ECCV, 2020. 3, 5 [65] Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, Songlin Dong, Xing Wei, and Yihong Gong. Few-shot class- incremental learning. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 12183–12192, 2020. 3 [66] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and de- tection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018. 4 [67] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Grow- ing a brain: Fine-tuning by increasing model capacity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2471–2480, 2017. 3 [68] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale in- cremental learning. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 374–382, 2019. 1, 2, 3, 4, 5, 8
Supplemental Material supervised features across all pre-train sizes. It is worth noting that when 100 pre-train classes are used, all three S1. Additional Results sets of features start around the same top-1 accuracy, but af- ter the model has finished learning all 1,000 classes, SwAV S1.1. Relative Performance Improvements achieves the highest performance. This behaviour means Fig. S1 shows the relative performance improvements that SwAV features obtained during pre-training on 100 exhibited by the Deep SLDA and Online Softmax methods classes are more useful when continually learning all 1,000 when performing online continual learning on the ImageNet classes. dataset using MoCo-V2 and SwAV features. Deep SLDA (S1a) shows a maximum relative improve- ment of 134.31% for SwAV features and 103.25% for MoCo-V2 features when only 10 classes are used for pre- training. Deep SLDA follows the same trend as REMIND from Fig. 1, with SwAV outperforming MoCo-V2 for all pre-train sizes. MoCo-V2 shows negative relative improve- ment for 50, 75, and 100 pre-training classes, while SwAV shows a small negative relative improvement of -0.69% for 100 classes. These few cases of negative relative perfor- mances occur when there are a large number of pre-training classes, which is less desirable for pre-training as it requires more data. Online Softmax (S1b) shows a maximum relative im- provement of 88.96% for SwAV features and 133.94% for MoCo-V2 features when only 10 classes are used for pre- training. Surprisingly, MoCo-V2 outperforms SwAV for 10 and 15 pre-train classes, which differs from the results for REMIND and Deep SLDA. We see a small negative relative improvement with MoCo-V2 for 50, 75 and 100 pre-train classes, while SwAV shows positive relative improvement for all pre-train sizes. Overall, these results demonstrate that self-supervised features are superior to supervised features for continual learning when less data is used during pre-training. S1.2. Learning Curves Learning curves for online continual learning on Ima- geNet using Deep SLDA and Online Softmax with various features are in Fig. S2 and Fig. S3, respectively. These curves show the top-1 accuracy every time a multiple of 100 classes has been seen by the model, including the per- formance right after pre-training. Similar to REMIND (Fig. 3), adding more classes dur- ing pre-training improves Deep SLDA’s performance for all features used. The vertical shift of the curves upwards across features shows that MoCo-V2 and SwAV consis- tently outperform supervised features for 10, 15 and 25 pre- train classes. Even though supervised features show the best performance for 100 classes, MoCo-V2 and SwAV show competitive results. Online Softmax also has better performance when using more pre-train classes across different features. Surpris- ingly, MoCo-V2 outperforms supervised and SwAV fea- tures for 10 and 15 pre-train classes. SwAV outperforms
You can also read