Deep transfer learning for image classification: a survey - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Deep transfer learning for image classification: a survey Jo Plested1* and Tom Gedeon2 arXiv:2205.09904v1 [cs.CV] 20 May 2022 1* Schoolof Engineering and Information Technology, University of New South Wales, Northcott Drive, Campbell, 2612, ACT, Australia. 2 Optus Centre for Artificial Intelligence, Curtin University, Kent Street, Bentley, 6102, WA, Australia. *Corresponding author(s). E-mail(s): j.plested@unsw.edu.au; Contributing authors: tom.gedeon@curtin.edu.au; Abstract Deep neural networks such as convolutional neural networks (CNNs) and transformers have achieved many successes in image classification in recent years. It has been consistently demonstrated that best practice for image classification is when large deep models can be trained on abundant labelled data. However there are many real world scenarios where the requirement for large amounts of training data to get the best performance cannot be met. In these scenarios transfer learning can help improve performance. To date there have been no surveys that comprehensively review deep transfer learning as it relates to image classification overall. However, several recent general surveys of deep transfer learning and ones that relate to particular specialised target image classification tasks have been published. We believe it is important for the future progress in the field that all current knowledge is collated and the overarching patterns analysed and discussed. In this survey we formally define deep transfer learning and the problem it attempts to solve in relation to image classification. We survey the current state of the field and identify where recent progress has been made. We show where the gaps in current knowledge are and make suggestions for how to progress the field to fill in these knowledge gaps. We present a new taxonomy of the applications of transfer learning for image classification. This taxonomy makes it easier to see overarching patterns of where transfer learning has been effective and, where it has failed to fulfill its potential. This also allows us to suggest where the problems lie and how it could be used more effectively. We demonstrate that under this new taxonomy, many of the applications where transfer learning has been shown to be ineffective or even hinder performance are to be expected when taking into account the source and target datasets and the techniques used. In many of these cases, the key problem is that methods and hyperparameter settings designed for large and very similar target datasets are used for smaller and much less similar target datasets. We identify alternative choices that could lead to better outcomes. Keywords: Deep Transfer Learning, Image Classification, Convolutional Neural Networks, Deep Learning 1 Introduction recently transformers have achieved many suc- cesses in image classification [26, 58, 62, 73, 74]. Deep neural network architectures such as con- It has been consistently demonstrated that these volutional neural networks (CNNs) and more models perform best when there is abundant 1
labelled data available for the task and large mod- make suggestions for how to progress the field to els can be trained [50, 70, 87]. However there are fill in these knowledge gaps. While there are many many real world scenarios where the requirement surveys in related domains and specific sub areas, for large amounts of training data cannot be met. to the best of our knowledge there are none that Some of these are: focus on deep transfer learning for image classifi- cation in general. We believe it is important for the 1. Insufficient data because the data is very rare future progress in the field that all the knowledge or there are issues with privacy etc. For exam- is collated together and the overarching patterns ple new and rare disease diagnosis tasks in the analysed and discussed. medical domain have limited training data due We make the following contributions: to both the examples themselves being rare and privacy concerns. 1. formally defining deep transfer learning and the 2. It is prohibitively expensive to collect and/or problem it attempts to solve as it relates to label data. For example labelling can only be image classification done by highly qualified experts in the field. 2. performing a thorough review of recent 3. The long tail distribution where a small number progress in the field of objects/words/classes are very frequent and 3. presenting a taxonomy of source and target thus easy to model, while many many more are dataset relationships in transfer learning appli- rare and thus hard to model [6]. For example cations that helps highlight why transfer learn- most language generation problems. ing does not perform as expected in certain application areas There are a several other reasons why we may 4. giving a detailed summary of source and tar- want to learn from a small number of training get datasets commonly used in the area to examples: provide an easy reference for the reader look- • It is interesting from a cognitive science per- ing to understand relationships between where spective to attempt to mimic the human ability transfer learning has performed best and where to learn general concepts from a small number results have been less consistent of examples. 5. summarizing current knowledge in the area • There may be restraints on compute resources as well as pointing out knowledge gaps and that limit training a large model from random suggested directions for future research. initialisation with large amounts of data. For In Section 2 we review all surveys in the area example environmental concerns [124]. from general transfer learning, to more closely In all these scenarios transfer learning can often related domains. In section 3 we introduce the greatly improve performance. In this paradigm the problem domain and formalise the difficulties with model is trained on a related dataset and task learning from small datasets that transfer learning for which more data is available and the trained attempts to solve. This section includes termi- weights are used to initialise a model for the target nology and definitions that are used throughout task. In order for this process to improve rather this paper. Section 4 details the source and tar- than harm performance the dataset must related get datasets commonly used in deep learning for closely enough and best practice methods used. image classification. Section 5 provides a detailed In this survey we review recent progress in deep analysis of all recent advances and improvements transfer learning for image classification and high- to transfer learning and specific application areas light areas where knowledge is lacking and could and highlights gaps in current knowledge. In be improved. With the exponentially increasing Section 6 we give an overview of other problem demand for the application of modern deep CNN domains that are closely related to deep transfer models to a wider array of real world application learning for image classification including the sim- areas, work in transfer learning has increased at ilarities and differences in each. Finally Section 7 a commensurable pace. It is important to regu- summarises all current knowledge, gaps and prob- larly take stock and survey the current state of the lems and recommends directions for future work field, where recent progress has been made and in the area. where the gaps in current knowledge are. We also 2
2 Related work learning and image classification the trends in this area are not covered. Many reviews related to deep transfer learning Weiss et al. [141] divide general transfer learn- have been published in the past decade and the ing into homogeneous, where the source and target pace has only increased in the last few years. How- dataset distributions are the same, and heteroge- ever, they differ from ours in two main ways. The neous, where they are not, and give a thorough first group consists of more general reviews, that description of each. They review many different provide a high level overview of transfer learning approaches in each category, but few of them are and attempt to include all machine learning sub- related to deep neural networks. fields and all task sub-fields. Reviews in this group are covered in Section 2.1. The second group is 2.2 Closely related work more specific with reviews providing a compre- hensive breakdown of the progress on a particular There are some recent review papers that based on narrow domain specific task. They are discussed their title seem to be more closely related. How- in the relevant parts of Section 5.7. There are a ever, they are short summary papers containing few surveys that are more closely related to ours limited details on the subject matter rather than with differences discussed in Section 2.2. full review papers. A Survey on Deep Transfer Learning [127] 2.1 General transfer learning defines deep transfer learning and separates it into four categories based on the subset of techniques surveys used. The focus is more on showing a broad selec- The most recent general transfer learning sur- tion of methods rather than providing much detail vey [46], is an extremely broad overview of most or focusing particularly on deep transfer learning areas related to deep transfer learning including methods. Most major works in the area from the those areas related to deep transfer learning for past decade are missing. image classification outlined in Section 6. As it is a Deep Learning and Transfer Learning broad general survey there is no emphasis on how Approaches for Image Classification [56] focuses deep transfer learning applies to image classifica- on defining CNNs along with some of the major tion and thus the trends seen in this area are not architectures and results from the past decade. covered. The paper includes a few brief paragraphs defin- A thorough theoretical analysis of general ing transfer learning and some of the image transfer learning techniques is given in [158]. classification results incorporate transfer learning, Transfer learning techniques are split into data- but no review of the topic is performed. based and model-based, then further divided into A Survey of Transfer Learning for Convolu- subcategories. Deep learning models are explicitly tional Neural Networks [106] is a short paper discussed as a sub-Section of model-based cate- which briefly introduces the transfer learning task gorisation. The focus is on generative models such and settings, and introduces general categories of as auto-encoders and Generative Adversarial Net- approaches and applications. It does not review works (GANs) and several papers are reviewed. any specific approaches or applications. Neural networks are also mentioned briefly under Transfer Learning for Visual Categorization: A the Parameter Control Strategy and Feature Survey [115]. Is a full review paper, but is older Transformation Strategy Sections. However, the with no deep learning techniques included. focus is on unsupervised pretraining strategies, In Small Sample Learning in Big Data Era rather than best practice for transferring learning. [119] deep transfer learning is a large part of the Zhang et al. [153] take the most similar work, but not the focus. Some examples of deep approach to categorizing the transfer learning task learning applied to image classification domains space as ours. They divide transfer learning into are mentioned, but there is no discussion of meth- 17 categories based on source and target dataset ods for improving deep transfer learning as it and label attributes. They then review approaches relates to image classification. taken within each category. Since it is a general transfer learning survey with no focus on deep 3
3 Overview [41], in practice because the loss function is non- convex with respect to the weights it is difficult 3.1 Problem Definition to optimise. For this reason modern networks are In this Section definitions used throughout the often arranged in very deep networks and task spe- paper are introduced. Transfer learning can be cific architectures, like CNNs and transformers for categorised by both the task and the mode. We images, to allow for easier training of parameters. start by defining the model, then the task and The hierarchical structure of these networks finally how they interact together in this case. allows for ever more complex patterns to be Deep learning is a modern name for neural learned. This is one of the things that has allowed networks with more than one hidden layer. Neu- deep learning to be successful at many different ral networks are themselves a sub-area of machine tasks in recent years, when compared to other learning. Mitchell [79] provides a succinct defini- machine learning algorithms. However, this only tion of machine learning: applies if there is enough data to train them. Figure 1 shows the increase in ImageNet 1K per- formance with the number of model parameters. Definition 1 ”A computer program is said to learn Figure 2 shows that for large modern CNN mod- from experience E with respect to some class of tasks els in general the performance on ImageNet 1K T and performance measure P, if its performance at increases with the number of training examples in tasks in T, as measured by P, improves with experience the source dataset. This suggests that large mod- E.” ern CNNS are likely overfitting when trained from random initialization on ImageNet 1K. Of course Neural networks are defined by Gurney [33] as: there are some outliers as the increase in perfor- mance from additional source data also depends Definition 2 ”A neural network is an intercon- on how related the source data is to the target nected assembly of simple processing elements, units data. This is discussed further in in Section 5.3.1. or nodes, called neurons, whose functionality is loosely These two results combine to show the stated based on the animal neuron. The processing ability effect that deep learning performance scales with of the network is stored in the inter unit connec- the size of the dataset and model. tion strengths, or weights, obtained by a process of As noted in section 1 there are many real world adaptation to, or learning from, a set of training scenarios where large amounts of data are unavail- patterns.” able or we are interested in training a model on a small amount of data for other reasons. The neurons in a multilayer feed forward neu- ral network of the type that we consider in this 3.2 Learning from small vs large review have nonlinear activation functions [29] and datasets are arranged in layers with weights W feeding forward from one layer to the next. A thorough review of the problems of learning Generally, a neural network learns to improve from a small number of training examples is given its performance at task T from Experience E, in [139]. being the set of training patterns, via gradient descent and backpropagation. Backpropagation is 3.2.1 Empirical Risk Minimization an application of the chain rule applied to prop- We are interested in finding a function f that agate derivatives from final layers of the neural minimises the expected risk: network to the hidden and input weights [110]. There are other less frequently used ways to train Z neural networks, such as with genetic algorithms, RT RU E (f ) = E[`(f (x), y)] = `(f (x), y) dp(x, y) that have shown to be successful in particular applications. In this paper we assume training is done via backpropagation for generality. with While it has been proven that neural networks with one hidden layer are universal approximators f ∗ = arg minf RT RU E (f ) 4
Fig. 1 Increase in performance on ImageNet 1K due to model size, measured by number of parameters in millions Fig. 2 Percentage increase in performance on ImageNet 1K due to increased source dataset size RT RU E (f ) is the true risk if we have access Before we begin training our model we must to an infinite set of all possible data and labels, choose a family of candidate functions F. In the with fˆ being the function that minimizes the true case of CNNs this involves choosing the relevant risk. In practical applications however the joint hyperparameters that determine our model archi- probability distribution P (x, y) = P (y|x)P (x) is tecture, including number of layers, the number unknown and the only available information is and shape of filters in each convolutional layer, contained in the training set. For this reason the whether and where to include features like residual true risk is replaced by the empirical risk, which is connections and normalization layers, and many the average of sample losses over the training set more. This constrains our final function to the D family of candidate functions defined by the free parameters that make up the given architecture. n 1X We are then attempting to find a function in F Rn (f ) = `(f (xi ), yi ), n i=1 which minimises the empirical risk: leading to empirical risk minimisation [132]. fn = arg minf Rn (f ) 5
Since the optimal function f ∗ is unlikely to be In this review we focus on transfer learning in F we also define: as a form of constraining the parameters of f to address the unreliable empirical risk minimizer fF∗ = arg minf F RT RU E (f ) problem. Section 6 discusses how deep transfer to be the function in F that minimises the learning relates to other techniques that use prior true risk. We can then decompose the excess error knowledge to solve the small dataset problem. that comes from choosing the function in F that minimizes Rn (f ): 3.3 Deep transfer learning Deep transfer learning is transfer learning applied E[R(fn ) − R(f ∗ )] = E[R(fF∗ ) − R(f ∗ )] to deep neural networks. Pan and Yang [91] define + E[R(fn ) − R(fF∗ )] transfer learning as: = εapp + εest Definition 3 ”Given a source domain DS and learn- The approximation error εapp measures how ing task TS , a target domain DT and learning task closely functions in F can approximate the opti- TT , transfer learning aims to help improve the learn- mal solution f ∗ . The estimation error εest mea- ing of the target predictive function fT (.) in DT using sures the effect of minimizing the empirical risk the knowledge in DS and TS , where DS 6= DT , or R(fn ) instead of the expected risk R(f ∗ ) [9]. So TS 6= TT .” finding a function that is as close as possible to f ∗ can be broken down into: For the purposes of this paper we define deep transfer learning as follows: 1. choosing a class of models that is more likely to contain the optimal function 2. having a large and broad range of training Definition 4 Given a source domain DS and learn- examples in D to better approximate an infinite ing task TS , a target domain DT and learning task set of all possible data and labels. TT deep transfer learning aims to improve the perfor- mance of the target model M on the target task TT by initialising it with weights W that are trained on 3.2.2 Unreliable Empirical Risk source task TS using source dataset DS (pretraining), Minimizer where DS 6= DT , or TS 6= TT . In general, εest can be reduced by having a larger number of examples [139]. Thus, when there are Some or all of W are retained when the model sufficient and varied labelled training examples is “transferred” to the target task TT and dataset in D, the empirical risk R(fn ) can provide a DT . The model is used for prediction on TT after good approximation to R(fF∗ ) the optimal f in F. fully training any reinitialised weights and with When n the number of training examples in D is or without continuing training on the pretrained small the empirical risk R(fn ) may not be a good weights (fine-tuning). Figure 3 shows the pre- approximation of the expected risk R(fF∗ ). In this training and fine-tuning pipeline when applying case the empirical risk minimizer overfits. transfer learning with a deep neural network. To alleviate the problem of having an unre- Combining the discussion from Section 3.2.2 liable empirical risk minimizer when Dtrain is with Definition 4, using deep transfer learning not sufficient, prior knowledge can be used. Prior techniques to pretrain weights W can be thought knowledge can be used to augment the data in of as regularizing W . Initialising W with weights Dtrain , constrain the candidate functions F, or that have been well trained on a large source constrain the parameters of f via initialization or dataset rather than with very small random values regularization [139]. Task specific deep neural net- results in a flatter loss surface and smaller gradi- work architectures such as CNNs and Recurrent ents, which in turn results in more stable updates Neural Networks (RNNs) are examples of con- [67, 85]. In the classic transfer learning setting straining the candidate functions F through prior the source dataset is many orders of magnitude knowledge of what the optimal function form may larger than the target dataset. One example is pre- be. training on ImageNet 1K with 1.3 million training 6
Fig. 3 Deep transfer learning images and transferring to medical imaging tasks that if the features stay close to those trained which often only have 100s of labelled examples. by the large source data set the model will be So even with the same learning rate and number less likely to overfit. of epochs, the number of updates to the weights We describe progress and problems with deep while training on the target dataset will be orders transfer learning under these categories as well of magnitude less than for pretraining. This also as based on the relationship between source and prevents the model from creating large weights target dataset in section 4. Then, in section 5, that are based on noise or idiosyncrasies in the we describe how deep transfer learning relates to small target dataset. other methods. Advances in transfer learning can be catego- rized based on ways of constraining the parame- ters of W as follows: 3.4 Negative Transfer 1. Initialization. Answering questions like: The stated goal of transfer learning as per Defi- nition 3 is to improve the learning of the target • how much pretraining should be done? predictive function fT (.) in DT using the knowl- • is more source data or more closely related edge in DS and TS . To achieve this goal the source data better? source dataset must be similar enough to the tar- • which pretrained parameters should be get dataset to ensure that the features learned transferred vs reinitialized? in pretraining are relevant to the target task. If 2. Parameter regularization. Regularizing the source dataset is not well related to the tar- weights, with the assumption that if the get dataset the target model can be negatively parameters are constrained to be close to a set impacted by pretraining. This is negative transfer point, they will be less likely to overfit. [91, 108]. Wang et al. [138, 140] define the negative 3. Feature regularization. Regularizing the fea- transfer gap (NTG) as follows: tures for each training example that are pro- duced by the weights. Based on the assumption 7
Definition 5 ”Let τ represent the test error on the to these inappropriate features. Scenarios such as target domain, θ a specific transfer learning algorithm this usually lead to overfitting the idiosyncrasies of under which the negative transfer gap is defined and the target training set [51, 95]. A related scenario ∅ is used to represent the case where the source is explored in [51] where it is shown that alter- domain data/information are not used by the tar- native loss functions that improve how well the get domain learner. Then, negative transfer happens pretrained features fit the source dataset lead to when the error using the source data is larger than the error without using the source data: τ (θ(S, τ )) > a reduction in performance on the target dataset. τ (θ(∅, τ )), and the degree of negative transfer can be The authors state that ”.. there exists a trade-off evaluated by the negative transfer gap” between learning invariant features for the original task and features relevant for transfer tasks.” N T G = τ (θ(S, τ )) − τ (θ(∅, τ )) In image classification models, features learned through lower layers are more general, and those From this definition we see that negative trans- learned in higher layers are more task specific fer occurs when the negative transfer gap is pos- [149]. It is likely that if less layers are transferred itive. Wang et al. elaborate on factors that affect negative transfer should be less prevalent, with negative transfer [138, 140]: training all layers from random initialization being the extreme end of this. There has been limited • Divergence between the source and target work to test this, however it is shown to an extent domains. Transfer learning makes the assump- in [1]. tion that there is some similarity between joint distributions in source domain PS (X, Y ) and target domain PT (X, Y ). The higher the diver- 4 Datasets commonly used in gence between these values the less information transfer learning for image there is in the source domain that can be exploited to to improve performance in the tar- classification get domain. In the extreme case if there is no 4.1 Source similarity it is not possible for transfer learning to improve performance. ImageNet 1K, 5K, 9K, 21K • Negative transfer is relative to the size and ImageNet is an image database organized accord- quality of the source and target datasets. For ing to the WordNet hierarchy [18]. ImageNet 1K or example, if labelled target data is abundant ILSVRC2012 is a well known subset of ImageNet enough, a model trained on this data only may that is used for an annual challenge. ImageNet perform well. In this example, transfer learning 1K consists of 1,000 common image classes with methods are more likely to impair the target at least 1,000 total images in each class for a learning performance. Conversely, if there is no total of just over 1.3 million images in the train- labelled target data, a bad transfer learning ing set. ImageNet 5K, 9K and 21K are larger method would perform better than a random subsets of the full ImageNet dataset containing guess, which means negative transfer would not the most common 5,000, 9,000 and 21,000 image happen. classes respectively. All three ImageNet datasets With deep neural networks, once the weights have have been used as both source and target datasets, been pretrained to respond to particular features depending on the type of experiments being per- in a large source dataset the weights will not formed. They are most commonly used as a source change far from their pretrained values during dataset because of their large sizes and general fine-tuning [85]. This is particularly so if the target classes. dataset is orders of magnitude smaller as is often the case. This premise allows transfer learning to JFT dataset improve performance and also allows for negative JFT is an internal Google dataset for large-scale transfer. If the weights transferred are pretrained image classification, which comprises over 300 to respond to unsuitable features then this train- million high-resolution images [40]. Images are ing will not be fully reversed during the fine-tuning annotated with labels from a set of 18291 cate- phase and the model could be more likely to overfit gories. For example, 1165 type of animals and 5720 8
types of vehicles are labelled in the dataset [125]. Examples of general image classification There are 375M labels and on average each image datasets commonly used as target datasets are: has 1.26 labels. • CIFAR-10 and CIFAR-100 [57]: Each have a total of 50,000 training and 10,000 test images Instagram hashtag datasets of 32x32 colour images from 10 and 100 classes Mahajan et al. [70] collected a weakly labelled respectively. image dataset with a maximum size of 3.5 billion • PASCAL VOC 2007 [20]: Has 20 classes belong- labelled images from Instagram, being over 3,000 ing to the superordinate categories of person, times larger than the commonly used large source animal, vehicle, and indoor objects. It contains dataset ImageNet 1K. The hashtags were used 9,963 images with 24,640 annotated objects and as labels for training and evaluation. By varying a 50/50 train test split. The size of each image the selected hashtags and the number of images is roughly 501 × 375. to sample, a variety of datasets of different sizes • Caltech-101 [21]: has pictures of objects belong- and visual distributions were created. One of the ing to 101 categories. About 40 to 800 images datasets created contained 1,500 hashtags that per category, with most categories having closely matched the 1,000 ImagNet 1K classes. around 50 images. The size of each image is roughly 300 × 200 pixels. Places365 • Caltech-256 [31]. An extension of Caltech-101 Places365 (Places) [155] contains 365 categories of with 256 categories and a minimum of 80 images scenes collected by counting all the entries that per category. It includes a large clutter category corresponded to names of scenes, places and envi- for testing background rejection. ronments in WordNet English dictionary. They included any concrete noun which could reason- Fine-grained ably complete the phrase I am in a place, or let’s Fine-grained image classification datasets contain go to the place. There are two datasets: subordinate classes from one particular superordi- • Places365-standard has 1.8 million training nate class. examples are: examples total with a minimum of 3,068 images • Food-101 (Food) [8]: Contains 101 different per class. classes of food objects with 75,750 training • Places365-challenge has 8 million training examples and 25,250 test examples. examples. • Birdsnap (Birds) [7]: Contains 500 different Places365 is generally used as a source dataset species of birds, with 47,386 training examples when the target dataset is scene based such as and 2,443 test examples. SUN. • Stanford Cars (Cars) [55]: Contains 196 dif- ferent makes and models of cars with 8,144 Inaturalist training examples and 8,041 test examples. • FGVC Aircraft (Aircraft) [71]: Contains 100 dif- Inaturalist [131] consists of 859,000 images from ferent makes and models of aircraft with 6,667 over 5,000 different species of plants and animals. training examples and 3,333 test examples. Inaturalist is generally used as a source dataset • Oxford-IIIT Pets (Pets)[92]: Contains 37 differ- when the target dataset is when the target dataset ent breeds of cats and dogs with 3,680 training contains fine-grained plants or animal classes. examples and 3,369 test examples. • Oxford 102 Flowers (Flowers) [89]: Contains 102 4.2 Target different types of flowers with 2,040 training General examples and 6,149 test examples. • Caltech-uscd Birds 200 (CUB) [134]: Contains General image classification datasets contain a variety of classes with a mixture of superordinate 200 different species of birds with around 60 and subordinate classes from many different cat- training examples per class. • Stanford Dogs (Dogs) [49]: Contains 20,580 egories in WordNet [77]. ImageNet is a canonical example of a general image classification dataset. images of 120 breeds of dogs 9
Scenes 5 Deep transfer learning Scene datasets contain examples of different progress and areas for indoor and/or outdoor scene settings. Examples are: improvement • SUN397 (SUN) [145]: Contains 397 categories In the past decade, the successes of CNNs on of scenes. This dataset preceded Places-365 and image classification tasks have inspired many used the same techniques for data collection. researchers to apply them to an increasingly wide The scenes with at least 100 training examples range of domains. Model performance is strongly were included in the final dataset. affected by the relationship between the amount of • MIT 67 Indoor Scenes [98]: Contains 67 Indoor training data and the number of trainable param- categories, and a total of 15620 images. There eters in a model as shown in Figures 1 and 2. As a are at least 100 images per category. result there has been ever growing interest in using transfer learning to allow large CNN models to Others be trained in domains where there is only limited training data available or other constraints exist. There are a number of other datasets that have As deep learning gained popularity in 2012 less of an overarching theme and are less related to 2016 transferability of features and best prac- to the common source datasets. These are often tices for performing deep transfer learning was used in conjunction with deep transfer learning explored [2, 4, 42, 116, 149]. While there are some for image classification to show models and tech- recent works that have introduced improvement to niques are widely applicable. Examples of these transfer learning techniques and insights, there are are: many more that have focused on best practice for • Describable Textures (DTD) [14]: Consists of either general [52, 61, 70, 96] or specific [37, 100] 3,760 training examples of texture images with application domains rather than techniques. We 47 classes of texture adjectives. fully review both. • Daimler pedestrian classification [81]: Con- When reviewing the application of deep trans- tains 23,520 training images with two classes, fer learning for image classification we divide being contains pedestrians and does not contain applications into categories. We split tasks in two pedestrians. directions being small versus large target datasets • German road signs (GTSRB) [123]: Contains and closely versus loosely related source and tar- 39,209 training images of German road signs in get datasets. For example using ImagNet [18] as 43 classes. a source dataset to pretrain a model for classify- • Omniglot [59]: Contains over 1.2 million train- ing tumours on medical images is a loosely related ing examples of 1,623 different handwritten transfer and is likely to be a small target dataset characters from 50 writing systems. due to privacy and scarcity of disease. This cat- • SVHN digits in the wild (SVHN) [84]: Con- egory division aligns with the factors that affect tains 73,257 training examples of labelled digits negative transfer outlined in [138, 140]. cropped from Street View images. The distinction between target dataset sizes • UCF101 Dynamic Images (UCF101) [121]: Con- is useful as it has been shown that small target tains 9,537 static frames of 101 classes of actions datasets are much more sensitive to changes in cropped from action videos. transfer learning hyperparameters [95]. It has also • Visual Decathlon Challenge (Decathlon) [103]: been shown that standard transfer learning hyper- A challenge designed to simultaneously solve paramters do not perform as well when trans- 10 image classification problems being: Ima- ferring to a less related target task [34, 52, 96], geNet, CIFAR-100, Aircraft, Daimler pedes- with negative transfer being an extreme example trian, Describable textures, German traf- of this [138, 140], and that the similarity between fic signs, Omniglot, SVHN, UCF101, VGG- datasets should be considered when deciding on Flowers. All images resized to have a shorter hyperparameters [61, 96]. These distinctions go side of 72 pixels some way to explaining conflicting performance 10
of deep transfer learning methods in recent years related, particularly with smaller target datasets [34, 61, 135, 159]. [1, 95, 96]. We start this section by describing general More recently it has been shown that the per- studies on deep transfer learning techniques, formance of models on ImageNet 1K correlates including recent advances. Then we review work well with performance when the pretrained model in each of the application areas described by our is transferred to other tasks [52]. The authors split above. Section 7 summarizes current knowl- additionally demonstrate that the increase in per- edge and makes final recommendation for future formance of deep transfer learning over random directions of research in the field. initialization is highly dependent on both the tar- get dataset size and the relationship between the 5.1 General deep transfer learning classes in the source and target datasets. This will for image classification be discussed more in the following sections. Early work on deep transfer learning showed that: 5.2 Recent advances 1. Deep transfer learning results in comparable Recent advances in the body of knowledge related or above state of the art performance in many to deep transfer learning for image classification different tasks, particularly when compared to can be divided into advances in techniques, and shallow machine learning methods [4, 116]. general insights on best practice. We describe 2. More pretraining both in terms of the number advances in transfer learning techniques here and of training examples and the number of iter- insights on best practice in Section 5.2.4. Recent ations tends to result in better performance advances in techniques are divided into regulariza- [2, 4, 42]. tion, hyperparameter based, matching the source 3. Fine-tuning the weights on the target task domain to the target domain, and a few others tends to result in better performance particu- that do not fit the previous categories. We discuss larly when the target dataset is larger and less matching the source domain to the target domain similar to the source dataset [2, 4, 149]. under the relevant source versus task domains in 4. Transferring more layers tends to result in bet- Sections 5.3.1 and 5.6.1 and the rest below. In ter performance when the source and target our reviews of recent work we attempt to present dataset and task are closely matched, but less a balanced view of the evidence for the improve- layers are better when they are less related ments offered by newer models compared to prior [1, 2, 4, 13, 149]. ones and the limitations of those improvements. 5. Deeper networks result in better performance However, in some of the more recent cases this is [4]. difficult as the original papers provide limited evi- It should be noted that all the studies referenced dence and new work showing the limitations of the above were completed prior to advances in resid- methods has not yet been done. ual networks [36] and other modern very deep CNNs. It has been argued that residual networks 5.2.1 Regularization based technique when combined with fine-tuning makes features advances more transferable [36]. As many of the above Most regularization based techniques aim to solve studies were carried out within a similar time the problem of the unreliable empirical risk mini- period some results have not been combined. For mizer 3.2.2 by restricting the model weights or the instance, most were done with AlexNet, a rela- features produced by them so they can’t fit small tively shallow network, as a base and many did idiosyncrasies in the data. They achieve this by not perform fine tuning and/or simply used a deep adding a regularization term λ · Ω (.) to the loss neural network as a feature detector at whatever function to make it: layer it was transferred. It has since been shown that when fine-tuning is used effectively, transfer- ring less than the maximum number of layers can ( n ) 1X result in better performance. This applies even minw L (w) = L (z (xi , w) , yi ) + λ · Ω (.) when the source and target datasets are highly n i=1 11
Pn with the first term n1 i=1 L (z (xi , w) , yi ) two source datasets used for pretraining. It being the empirical loss and the second term being has since been shown that the L2-SP regular- the regularization term. The tuning parameter izer can result in minimal improvement or even λ > 0 balances the trade-off between the two. negative transfer when the source and target Weight regularization directly restricts how datasets are less related [12, 61, 96, 135]. More much the model weights can move. recent work has showed that in some cases Knowledge distillation or feature based regu- using L2-SP regularization for lower layers and larization uses the distance between the feature L2 regularization for higher layers can improve maps output from one or more layers of the source performance [96]. and target networks to regularize the model: 2. DELTA [63] is an example of knowledge distil- lation or feature map based regularization. It is based on the idea of re-using CNN channels N n 1 XX that are not useful to the target task while not Ω(w, ws ) = d (Fj (wt , xi ) , Fj (ws , xi )) n j=1 i=1 changing channels that are useful. Training on the target task is regularized by the attention where Fj (wt , xi ) is the feature map output weighted L2 loss between the final layer feature by the jth filter in the target network defined by maps of the source and target models: weights wt for input value xi , and d (.) is a measure N of dissimilarity between two feature maps. X Ω(w, w0 , xi , yi ) = (Wj (w0 , xi , yi ) The success of regularization based techniques j=1 for deep transfer learning rely heavily on the 2 assumption that the source and target datasets · F Mj (w, xi ) − F Mj (w0 , xi )) 2 are closely related. This is required to ensure that the optimal weights or features for the tar- Where F Mj (w, xi ) is the output from the jth get dataset are not far from those trained on the filter applied to the ith input. The attention source dataset. weights Wj for each filter are calculated by There have been many new regularization removing the model’s filters one by one (set- based techniques introduced in the last three ting its output weights to 0), calculating the years. We review major new techniques in chrono- increase in loss. Filters resulting in a high logical order. increase in loss are then set with a higher weight for regularization, encouraging them to 1. L2-SP [64, 65] is a form of weight regulariza- stay similar to those trained on the source task. tion. The aim of transfer learning is to create Others that are not as useful in the target task models that are regularized by keeping features are less regularized and can change more. This that are reasonably close to those trained on regularization resulted in performance that was a source dataset for which overfitting is not slightly better than the L2-SP regularization in as much of a problem. The authors argue that most cases with ResNet-101 and Inceptionv3 because of this, during the target dataset train- models, ImageNet 1K as the source dataset ing phase the fine tuned weights should be and a variety of target datasets. The original decayed towards the pretrained weights, not paper showed state of the art performance for zero. Several regularizers that decay weights DELTA on Caltech 256-30, however they used towards their starting point, denoted SP reg- mostly the same datasets as the original L2-SP ularizers, were tested in the original papers. paper [64] and for the two additional datasets 2 The L2-SP regularizer Ω(w) = α2 w − w0 2 used they showed that L2-SP outperformed the which is the L2 loss between the source weights baseline L2 regularization. It has since been and the current weights is shown to signif- shown that like L2-SP, DELTA can also hin- icantly outperform the standard L2 loss on der performance when the source and target the four target datasets shown in the paper datasets are less similar [12, 45, 53]. with a Resent-101 model. The original paper 3. Wan et al. [135] propose decomposing the showed results for transferring to four small transfer learning gradient update into the target datasets that were very similar to the empirical loss and regularization loss gradient 12
vectors. Then when the angle between the two 5.2.2 Normalization based technique vectors is greater than 90 degrees they fur- advances ther decompose the regularization loss gradient Further to regularization based methods, there vector into the portion perpendicular to the are several recent techniques that attempt to bet- empirical loss gradient and the remaining vec- ter align fine-tuning in the target domain with tor in the opposite direction of the empirical the source domain. This is achieved by making loss gradient. They remove the latter term, adjustments to the standard batch normalization in the hopes that not allowing the regulariza- or other forms of normalization that are used tion term to move the weights in the opposite between layers in modern CNNs. direction of the empirical loss term will stop negative transfer. They show that their pro- 1. Sharing batch normalization hyperparameters posal improves performance slightly with a across source and target domains has been ResNet 18 on four different datasets. However, shown to be more effective than having sep- their results are poor compared to state of the arate ones across many domain adaptation art as they do not test on modern very deep tasks [72, 138]. Wang et al. [138] introduce models. For this reason, it is difficult to judge an additional batch normalization hyperpa- how well their regularization method performs rameter called domain adaptive α. This takes in general. standard batch normalization with γ and β 4. Batch spectral shrinkage (BSS) [12] introduces shared across source and target domain and a loss penalty applied to smaller singular values scales them based on the transferability value of channelwise features in each batch update of each channel calculated using the mean and during fine-tuning so that untransferable spec- variance statistics prior to normalization. As tral components are suppressed. They test this far as we are aware these techniques have not method using a ResNet50 pretrained on Ima- been applied to the general supervised transfer geNet 1K and fine-tuned on a range of different learning case. target datasets. The results show that their 2. Stochastic normalization [53] samples batch method never hurts performance on the given normalization based on mini-batch statistics datasets and often produces significant per- or based on moving statistics for each fil- formance gains over L2, L2-SP and DELTA ter with probability hyperparameter p. At the regularization for smaller target datasets. They start of fine-tuning on the target dataset the also show that BSS can improve performance moving statistics are initialised with those cal- for less similar target datasets where L2-SP culated during pretraining in order to act as a hinders performance. regularizer. This is designed to overcome prob- 5. Sample-based regularization [45] proposes reg- lems with small batch sizes resulting in noisy ularization using the distance between feature batch-statistics or the collapse in training asso- maps of pairs of inputs in the same class, ciated with using moving statistics to normalize as well as weight regularization. The model all feature maps [43, 44]. The authors results was tested using a ResNet-50 and transfer- show that their methods improve over BSS, ring from ImageNet 1K and Places365 to a DELTA and L2-SP for low sampling versions number of different, fine grained classification of three standard target datasets and improve tasks. The authors report an improvement over over all but BSS for larger versions of the L2-SP, DELTA and BSS in all tests. Their same datasets. Their results again show that results reconfirm that BSS performs better BSS performs better than DELTA and L2SP than DELTA and L2SP in most cases and in in most cases and in many cases DELTA and some cases DELTA and L2SP decrease perfor- L2-SP decrease performance compared to the mance compared to the standard L2 regular- standard L2 regularization baseline. ization baseline. 5.2.3 Other recent new techniques Guo et al. [32] make two copies of their ResNet models pretrained on ImageNet 1K. One model is 13
used as a fixed feature selector with the pretrained Given a set of models with similar accuracy on layers frozen and the other model is fine-tuned. a source task, the best model for target tasks They reinitialize the final classification layer in can vary between target datasets [1]. both. A policy net trained with reinforcement • Choosing the best data for pretraining. In many learning is then used to create a mask to com- cases pretraining with smaller more closely bine layers from each model together in a unique related source datasets was found to produce way for each target example. They show that better results on target datasets than with their SpotTune model improves performance com- larger less closely related source datasets [16, pared to fine-tuning with an equivalent size single 17, 70, 76, 87, 97]. For best results the source model (double the size of the two individual mod- dataset should include the image domain of the els within the SpotTune architecture) and achieves target dataset [76]. For example ImageNet 1k close to or better than state of the art in most contains more classes of pets than Oxford Pets cases. MultiTune simplifies SpotTune by removing making them an ideal source and target dataset the policy network and concatenating the features combination. There are various measures of sim- from each model prior to the final classification ilarity used to define closely related that are layer rather than selecting layers. It also improves outlined in Section 5.3.1. on SpotTune by using two different non-binary • Finding the best hyperparameters for fine- fine-tuning hyperparameter settings [96] rather tuning. Several studies include extensive hyper- than one fine-tuned and one frozen model. The parameter searches over learning rate, learning results show that MultiTune improves or equals rate decay, weight decay, and momentum [52, accuracy compared to SpotTune in most cases 61, 64, 70, 96]. These studies show the rela- tested, with significantly less training time. tionship between the size of the target dataset Co-tuning for transfer learning [150] uses a and its similarity to the source dataset with probabilistic mapping of hard labels in the source fine-tuning hyperparameter settings. Optimal dataset to soft labels in the target dataset. This learning rate and momentum, are both shown mapping allows them to keep the final classifica- to be lower for more related source and tar- tion layer in a ResNet50 and train it using both get datasets [61, 96]. Also the number of layers the target data and soft labels from the source to reinitialise from random weights is strongly dataset. As with many other recent results, they related to the optimal learning rate [85, 96]. show that their algorithm improves on all others, • Whether a multi-step transfer process is better including BSS, DELTA and L2SP, but their results than a single step process. A multi-step pretrain- are significantly below state of the art for identi- ing process, where the intermediate dataset is cal model sizes, source and target dataset. They smaller and more closely related to the target do show the same ordering for the target datasets, dataset, often outperform a single step pre- using BSS improves on DELTA which improves on training process when originating from a very L2SP. different, large source dataset [28, 76, 86, 97]. Related to this, using a self-supervised learn- 5.2.4 Insights on best practice ing technique for pretraining on a more closely related source dataset can outperform using a Further to advances in techniques and models, supervised learning technique on a less closely there has been a large body of recent research that related dataset [159]. extends the early work on best practice for deep • Which type of regularization to use. L2-SP transfer learning for image classification described or other more recent transfer learning specific in Section 5.1. These studies give insights on regularization techniques like DELTA, BSS, the following decisions that need to be made stochastic normalization, etc, improve perfor- when performing deep transfer learning for image mance when the source and target dataset are classification: closely related, but often hinder it when they • Selecting the best model for the task. Models are less related [63, 64, 96, 135]. These regular- that perform better on ImageNet were found to ization techniques are discussed in more detail perform better on a range of target datasets in in Section 5.2.1. [52], however this effect eventually saturates [1]. 14
• Which loss function to use. Alternatives to making them easier to transfer to. Their final tar- the cross-entropy loss function are shown to get dataset, Flowers, is also known to be better produce representations with higher class sepa- suited to transfer to from their source datasets. ration that obtain higher accuracy on the source See section 5.6 for further discussion of which task, but are less useful for target tasks in [51]. target datasets are easier to transfer to. The results show a trade-off between learning We expect that best practice recommendations features that perform better on the source task developed for closely related datasets will not be and features relevant for the target task. applicable to less closely related target datasets as has been shown for many other methods and rec- In an attempt to generalize hyperparame- ommendations [12, 53, 61, 63, 64, 96, 135]. To test ters and protocols when pretraining with source this hypothesis we reran a selection of the exper- source datasets that are larger than ImageNet 1K, iments in BiT using Stanford Cars as the target Kolesnikov et al. created Big Transfer (BiT) [50]. dataset which is very different from the source They pretrain various sizes of ResNet on Ima- dataset ImageNet 21K and known to be more diffi- geNet 1K and 21K, and JFT and transfer them cult to transfer to [52, 96]. We first confirmed that to four small to medium closely related image we could reproduce their state of the art results classification target datasets as well as the COCO- for the datasets listed in the paper, then produced 2017 object detection dataset [66]. Based on these the results in Table 1 using Stanford Cars. These experiments they make a number of general claims results show that BiT produces far below state of about deep transfer learning when pretraining on the art results for this less related dataset. The very large datasets including: first column shows the results with all the recom- 1. Batch normalization (BN) [44] is detrimental mended hyperparameters from the paper. While to BiT, and Group Normalization [144] com- the performance can be improved with increases bined with Weight Standardization performs in learning rates and number of epochs before the well with large batches. learning rate is decayed, final results are still well 2. MixUp [151] is not useful for pretraining on below state of the art for a comparable model, large source datasets and is only useful dur- source and target dataset. The fine grained clas- ing fine-tuning for mid-sized target datasets sification task in Stanford Cars is known to be (20-500K training examples) less similar to the more general ImageNet and 3. Regularization (L2, L2-SP, dropout) does not JFT datasets. Because of this it is not surprising enhance performance in the fine-tuning phase, that recommendations developed for more closely even with very large models (the largest model related target datasets do not apply. used for experiments has 928 million parame- ters). Adjusting the training and learning rate 5.2.5 Insights on transferability decay time based on the size of the target Here we review works that give more general dataset, longer for larger datasets, provides insight as to what is happening with model sufficient regularization. weights, representations and the loss landscape The authors use general fine-tuning hyperparame- when transfer learning is performed as well as ters for learning rate scheduling, training time and measures of transferability of pretrained weights amount/usage of MixUp that are only adjusted to target tasks. based on the target dataset size, not for individual Several methods for analysing the feature target datasets. They achieve performance that space were used in [85]. They found that mod- is comparable to models with selectively tuned els trained from pretrained weights make similar hyperparameters for their model pretrained on mistakes on the target domain, have similar fea- ImageNet and state of the art, or close to in many tures and are surprisingly close in `2 distance in cases, for their model pretrained on the 300 times the parameter space. They are in the same basins larger source dataset JFT. However their target of the loss landscape. Models trained from random datasets, ImageNet, CIFAR 10 & 100, and Pets initialization do not live in the same basin, make are very closely related to their source datasets different mistakes, have different features and are farther away in `2 distance in the parameter space. 15
Table 1 Big transfer (BiT): General visual representation learning [50] extended results using BiT-M pretrained on ImageNet 21K . State of the art is the best known result for this model, source and target dataset. Default is the learning rate decay schedule specified by the paper for this size target dataset and x2 is two times the number of batches before decaying the learning rate compared to the default. Dataset Default lr (0.003) lr 0.01 lr 0.03 lr 0.1 State of the art default x2 default x2 default x2 default x2 decay decay decay decay Cars 86.20 86.15 85.81 87.49 81.41 88.96 27.51 5.22 95.3[87] A flatter and easier to navigate loss landscape They then define conditional cross-entropy for pretrained models compared to their randomly (NCE) as another measure of transferability, initialized counterparts was also shown in [67]. defined as being the empirical cross entropy They showed improved Lipschitzness and that this of label ȳ from the target domain given a accelerates and stabilizes training substantially. lablel z̄ from the source domain. To empiri- Particularly that the singular vectors of the weight cally demonstrate the effectiveness of the NCE gradient with large singular values are shrunk in measure a ResNet18 model as the backbone the weight matrices. Thus, the magnitude of gradi- was paired with an SVM classifier. NCE was ent back-propagated through a pretrained layer is demonstrated to have strong correlation with controlled, and pretrained weight matrices stabi- accuracy on the target tasks for combinations lize the magnitude of gradient, especially in lower of 437 source and target tasks. layers, leading to more stable training. 3. LEEP [88] is another measure of transferability. Several recent techniques have been proposed Using the pretrained model, the joint distri- for measuring the transferability of pretrained bution over labels in the source dataset and weights: the target dataset labels is estimated to con- struct an empirical predictor. LEEP is the log 1. H-score [5] is a measure of how well a pretrained expectation of the empirical predictor. LEEP model f is likely to perform on a new task with is defined mathematically as: input space X and output space Y based on the inter-class covariance cov(EPX|Y [f (X)|Y ]) n ! and the feature redundancy tr(cov(f (X))) 1X X T (θ, D) = log P̂ (yi |z) θ (xi )z n i=1 zZ H(f ) = tr(cov(f (X))−1 cov(EPX|Y [f (X)|Y ]) where θ (xi )z is is the probability of the source the H-score increases as interclass covariance label z for target input data xi predicted using increases and feature redundancy decreases. the pretrained weights θ, and P̂ (yi |z) is the The authors show that H-score has a strong empirical conditional probability of target label correlation with target task performance. They yi given source label z. LEEP is shown to have also show that it can be used to rank transfer- good theoretical properties and empirically it ability and create minimum spanning trees of is demonstrated to have strong correlation with task transferability. The latter may be useful performance gain from pretraining weights on in guiding multi-step transfer learning for less the source tasks. This is shown on source tasks related tasks as discussed in Section 5.2.4. ImageNet 1K and CIFAR10 and 200 random 2. Transferability and negative conditional target tasks taken from the closely related entropy (NCE) for transfer learning tasks CIFAR100 and less closely related FashionM- where the source and target datasets are NIST. The authors expand NCE to the case the same, but the tasks differ, are defined where the source and target datasets are dif- in [130]. The authors define transferability ferent by creating dummy labels for the target as the log-likelihood lY (wZ , kY ), where wz is data based on the source task using the pre- the weights of the model backbone pretrained trained model θ. They show that LEEP has on the source task Z and kY is the weights a stronger correlation with performance gain of the classifier trained on the target task. than the expanded NCE measure and H-score. 16
You can also read