TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning

Page created by Marshall Farmer

Arts & Entertainment

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning

TinyTL: Reduce Memory, Not Parameters
                                                        for Efficient On-Device Learning

                                                                  Han Cai1 , Chuang Gan2 , Ligeng Zhu1 , Song Han1
                                                        1
                                                            Massachusetts Institute of Technology, 2 MIT-IBM Watson AI Lab
                                                                               http://tinyml.mit.edu/
arXiv:2007.11622v4 [cs.CV] 8 Jan 2021

                                                                                      Abstract

                                                   On-device learning enables edge devices to continually adapt the AI models to new
                                                   data, which requires a small memory footprint to fit the tight memory constraint
                                                   of edge devices. Existing work solves this problem by reducing the number of
                                                   trainable parameters. However, this doesn’t directly translate to memory saving
                                                   since the major bottleneck is the activations, not parameters. In this work, we
                                                   present Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning.
                                                   TinyTL freezes the weights while only learns the bias modules, thus no need
                                                   to store the intermediate activations. To maintain the adaptation capacity, we
                                                   introduce a new memory-efficient bias module, the lite residual module, to refine
                                                   the feature extractor by learning small residual feature maps adding only 3.8%
                                                   memory overhead. Extensive experiments show that TinyTL significantly saves
                                                   the memory (up to 6.5×) with little accuracy loss compared to fine-tuning the
                                                   full network. Compared to fine-tuning the last layer, TinyTL provides significant
                                                   accuracy improvements (up to 34.1%) with little memory overhead. Furthermore,
                                                   combined with feature extractor adaptation, TinyTL provides 7.3-12.9× memory
                                                   saving without sacrificing accuracy compared to fine-tuning the full Inception-V3.

                                        1       Introduction

                                        Intelligent edge devices with rich sensors (e.g., billions of mobile phones and IoT devices)1 have been
                                        ubiquitous in our daily lives. These devices keep collecting new and sensitive data through the sensor
                                        every day while being expected to provide high-quality and customized services without sacrificing
                                        privacy2 . These pose new challenges to efficient AI systems that could not only run inference but
                                        also continually fine-tune the pre-trained models on newly collected data (i.e., on-device learning).
                                        Though on-device learning can enable many appealing applications, it is an extremely challenging
                                        problem. First, edge devices are memory-constrained. For example, a Raspberry Pi 1 Model A
                                        only has 256MB of memory, which is sufficient for inference, but by far insufficient for training
                                        (Figure 1 left), even using a lightweight neural network architecture (MobileNetV2 [1]). Furthermore,
                                        the memory is shared by various on-device applications (e.g., other deep learning models) and the
                                        operating system. A single application may only be allocated a small fraction of the total memory,
                                        which makes this challenge more critical. Second, edge devices are energy-constrained. DRAM
                                        access consumes two orders of magnitude more energy than on-chip SRAM access. The large
                                        memory footprint of activations cannot fit into the limited on-chip SRAM, thus has to access DRAM.
                                        For instance, the training memory of MobileNetV2, under batch size 16, is close to 1GB, which is by
                                        far larger than the SRAM size of an AMD EPYC CPU3 (Figure 1 left), not to mention lower-end
                                            1
                                              https://www.statista.com/statistics/330695/number-of-smartphone-users-worldwide/
                                            2
                                              https://ec.europa.eu/info/law/law-topic/data-protection_en
                                            3
                                              https://www.amd.com/en/products/cpu/amd-epyc-7302

                                        34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

Table 3
                                                                       Table 2                                                        ResNet-50            MbV2-1.4

                                                                   SRAM Access        Training, bs=8               Param (MB)                       102                  24

                                                          Energy                 20               890.82           Activation (MB)                1414.4              1252.8

                                                                                                                                              ResNet-50                         MbV2-1.4
                        1000                                                                               1600
                                                                                                                                     The main bottleneck does not improve much.

Memory Footprint (MB)
                                                                                                                                                                                            1.1x
                         750                                                                               1200
                                                                                                                                         13.9x larger
       MbV2

                         500                                                                                800
                                                                                                                                         Activation is the
                                                                                                                                         main bottleneck,
                               Raspberry Pi 1 DRAM                                                                                       not parameters.
                         250                                                                                400

                               AMD EPYC CPU SRAM (L3 Cache)
                                                                                                                                                                      4.3x
                           0                                                                                   0
                                    Inference                Training                                                                Param (MB)                                Activation (MB)
                                  Batch Size = 1          Batch Size = 16

Figure 1: Left: The memory footprint required by training is much larger than inference. Right:
Memory cost comparison between ResNet-50 and MobileNetV2-1.4 under batch size 16. Recent
advances in efficient model design only reduce the size of parameters, but the activation size, which
is the main bottleneck for training, does not improve much.

edge platforms. If the training memory can fit on-chip SRAM, it will drastically improve the speed
and energy efficiency.
There is plenty of efficient inference techniques that reduce the number of trainable parameters and
the computation FLOPs [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], however, parameter-efficient or FLOPs-efficient
techniques do not directly save the training memory. It is the activation that bottlenecks the training
memory, not the parameters. For example, Figure 1 (right) compares ResNet-50 and MobileNetV2-
1.4. In terms of parameter size, MobileNetV2-1.4 is 4.3× smaller than ResNet-50. However, for
training activation size, MobileNetV2-1.4 is almost the same as ResNet-50 (only 1.1× smaller),
leading to little memory reduction. It is essential to reduce the size of intermediate activations
required by back-propagation, which is the key memory bottleneck for efficient on-device training.
                                                                                                           1
In this paper, we propose Tiny-Transfer-Learning (TinyTL) to address these challenges. By analyzing
the memory footprint during the backward pass, we notice that the intermediate activations (the
main bottleneck) are only needed when updating the weights, not the biases (Eq. 2). Inspired by
this finding, we propose to freeze the weights of the pre-trained feature extractor and only update
the biases to reduce the memory footprint (Figure 2b). To compensate for the capacity loss, we
introduce a memory-efficient bias module, called lite residual module, which improves the model
capacity by refining the intermediate feature maps of the feature extractor (Figure 2c). Meanwhile,
we aggressively shrink the resolution and width of the lite residual module to have a small memory
overhead (only 3.8%). Extensive experiments on 9 image classification datasets with the same
pre-trained model (ProxylessNAS-Mobile [11]) demonstrate the effectiveness of TinyTL compared to
previous transfer learning methods. Further, combined with a pre-trained once-for-all network [10],
TinyTL can select a specialized sub-network as the feature extractor for each transfer dataset (i.e.,
feature extractor adaptation): given a more difficult dataset, a larger sub-network is selected, and vice
versa. TinyTL achieves the same level of (or even higher) accuracy compared to fine-tuning the full
Inception-V3 while reducing the training memory footprint by up to 12.9×. Our contributions can be
summarized as follows:

• We propose TinyTL, a novel transfer learning method to reduce the training memory footprint by
  an order of magnitude for efficient on-device learning. We systematically analyze the memory
  of training and find the bottleneck comes from updating the weights, not biases (assume ReLU
  activation).
• We also introduce the lite residual module, a memory-efficient bias module to improve the model
  capacity with little memory overhead.
• Extensive experiments on transfer learning tasks show that our method is highly memory-efficient
  and effective. It reduces the training memory footprint by up to 12.9× without sacrificing accuracy.

2                       Related Work

Efficient Inference Techniques. Improving the inference efficiency of deep neural networks on
resource-constrained edge devices has recently drawn extensive attention. Starting from [4, 5, 12, 13,

                                                                                                  2

14], one line of research focuses on compressing pre-trained neural networks, including i) network
pruning that removes less-important units [4, 15] or channels [16, 17]; ii) network quantization that
reduces the bitwidth of parameters [5, 18] or activations [19, 20]. However, these techniques cannot
handle the training phase, as they rely on a well-trained model on the target task as the starting point.
Another line of research focuses on lightweight neural architectures by either manual design [1, 2,
3, 21, 22] or neural architecture search [6, 8, 11, 23]. These lightweight neural networks provide
highly competitive accuracy [10, 24] while significantly improving inference efficiency. However,
concerning the training memory efficiency, key bottlenecks are not solved: the training memory is
dominated by activations, not parameters (Figure 1).
There are also some non-deep learning methods [25, 26, 27] that are designed for efficient inference
on edge devices. These methods are suitable for handling simple tasks like MNIST. However, for
more complicated tasks, we still need the representation capacity of deep neural networks.

Memory Footprint Reduction. Researchers have been seeking ways to reduce the training memory
footprint. One typical approach is to re-compute discarded activations during backward [28, 29].
This approach reduces memory usage at the cost of a large computation overhead. Thus it is not
preferred for edge devices. Layer-wise training [30] can also reduce the memory footprint compared
to end-to-end training. However, it cannot achieve the same level of accuracy as end-to-end training.
Another representative approach is through activation pruning [31], which builds a dynamic sparse
computation graph to prune activations during training. Similarly, [32] proposes to reduce the bitwidth
of training activations by introducing new reduced-precision floating-point formats. Besides reducing
the training memory cost, there are some techniques that focus on reducing the peak inference
memory cost, such as RNNPool [33] and MemNet [34]. Our method is orthogonal to these techniques
and can be combined to further reduce the memory footprint.

Transfer Learning. Neural networks pre-trained on large-scale datasets (e.g., ImageNet [35]) are
widely used as a fixed feature extractor for transfer learning, then only the last layer needs to be
fine-tuned [36, 37, 38, 39]. This approach does not require to store the intermediate activations of the
feature extractor, and thus is memory-efficient. However, the capacity of this approach is limited,
resulting in poor accuracy, especially on datasets [40, 41] whose distribution is far from ImageNet
(e.g., only 45.9% Aircraft top1 accuracy achieved by Inception-V3 [42]). Alternatively, fine-tuning
the full network can achieve better accuracy [43, 44]. But it requires a vast memory footprint and
hence is not friendly for training on edge devices. Recently, [45,46] propose to only update parameters
of the batch normalization (BN) [47] layers, which greatly reduces the number of trainable parameters.
Unfortunately, parameter-efficiency doesn’t translate to memory-efficiency. It still requires a
large amount of memory (e.g., 326MB under batch size 8) to store the input activations of the BN
layers (Table 3). Additionally, the accuracy of this approach is still much worse than fine-tuning
the full network (70.7% v.s. 85.5%; Table 3). People can also partially fine-tune some layers, but
how many layers to select is still ad hoc. This paper provides a systematic approach to save memory
without losing accuracy.

3     Tiny Transfer Learning

3.1   Understanding the Memory Footprint of Back-propagation

Without loss of generality, we consider a neural network M that consists of a sequence of layers:
                            M(·) = Fwn (Fwn−1 (· · · Fw2 (Fw1 (·)) · · · )),                         (1)

where wi denotes the parameters of the ith layer. Let ai and ai+1 be the input and output activations
of the ith layer, respectively, and L be the loss. In the backward pass, given ∂a∂L
                                                                                  i+1
                                                                                      , there are two
                                     ∂L          ∂L
goals for the ith layer: computing   ∂ai   and   ∂wi .

Assuming the ith layer is a linear layer whose forward process is given as: ai+1 = ai W + b, then
its backward process under batch size 1 is
          ∂L     ∂L ∂ai+1    ∂L                              ∂L        ∂L       ∂L    ∂L
              =           =       WT ,                          = aTi       ,      =       .         (2)
          ∂ai   ∂ai+1 ∂ai   ∂ai+1                            ∂W       ∂ai+1     ∂b   ∂ai+1

                                                         3

fmap in memory              fmap not in memory         i th mobile inverted bottleneck block
                           learnable params            fixed params               weight               bias

                                  1x1 Conv                 Depth-wise Conv                 1x1 Conv

                 C                              6C                           6C                            C
                     R                                                                                         R
                                                 R                            R
                                        (a) Fine-tune the full network (Conventional)
                                  1x1 Conv                 Depth-wise Conv                 1x1 Conv

                                                     (b) Fine-tune bias only
                         Downsample           Group Conv                     1x1 Conv              Upsample

                C
                 0.5R                                             0.5R

                                                     (c) Lite residual learning

Figure 2: TinyTL overview (“C” denotes the width and “R” denote the resolution). Conventional
transfer learning relies on fine-tuning the weights to adapt the model (Fig.a), which requires a large
amount of activation memory (in blue) for back-propagation. TinyTL reduces the memory usage by
fixing the weights (Fig.b) while only fine-tuning the bias. (Fig.c) exploit lite residual learning to
compensate for the capacity loss, using group convolution and avoiding inverted bottleneck to achieve
high arithmetic intensity and small memory footprint. The skip connection remains unchanged
(omitted for simplicity).

According to Eq. (2), the intermediate activations (i.e., {ai }) that dominate the memory footprint are
                                                               ∂L
only required to compute the gradient of the weights (i.e., ∂W     ), not the bias. If we only update the
bias, training memory can be greatly saved. This property is also applicable to convolution layers and
normalization layers (e.g., batch normalization [47], group normalization [48], etc) since they can be
considered as special types of linear layers.
Regarding non-linear activation layers (e.g., ReLU, sigmoid, h-swish), sigmoid and h-swish require
                        ∂L
to store ai to compute ∂a i
                            (Table 1), hence they are not memory-efficient. Activation layers that
build upon them are also not memory-efficient consequently, such as tanh, swish [49], etc. In contrast,
ReLU and other ReLU-styled activation layers (e.g., LeakyReLU [50]) only requires to store a binary
mask representing whether the value is smaller than 0, which is 32× smaller than storing ai .

Table 1: Detailed forward and backward processes of non-linear activation layers. |ai | denotes the
number of elements of ai . “◦” denotes the element-wise product. (1ai ≥0 )j = 0 if (ai )j < 0 and
(1ai ≥0 )j = 1 otherwise. ReLU6(ai ) = min(6, max(0, ai )).

 Layer Type                   Forward                                        Backward                                Memory Cost
                                                                         ∂L      ∂L
 ReLU              ai+1 = max(0, ai )                                   ∂ai = ∂ai+1 ◦ 1ai ≥0                               |ai | bits
                                   1                           ∂L         ∂L
 sigmoid       ai+1 = σ(ai ) = 1+exp(−ai)                      ∂ai   = ∂ai+1 ◦ σ(ai ) ◦ (1 − σ(ai ))                      32 |ai | bits
                                      ReLU6(ai +3)      ∂L          ∂L       ReLU6(ai +3)       1
 h-swish [7]    ai+1 = ai ◦                6            ∂ai   =    ∂ai+1 ◦ (      6       + ai ◦ −3≤a
                                                                                                   6
                                                                                                     i ≤3
                                                                                                          )               32 |ai | bits

3.2   Lite Residual Learning

Based on the memory footprint analysis, one possible solution of reducing the memory cost is to freeze
the weights of the pre-trained feature extractor while only update the biases (Figure 2b). However,
only updating biases has limited adaptation capacity. Therefore, we introduce lite residual learning
that exploits a new class of generalized memory-efficient bias modules to refine the intermediate
feature maps (Figure 2c).

                                                                    4

Formally, a layer with frozen weights and learnable biases can be represented as:

ai+1 = FW (ai ) + b. (3)

To improve the model capacity while keeping a small memory footprint, we propose to add a lite
residual module that generates a residual feature map to refine the output:

ai+1 = FW (ai ) + b + Fwr (a0i = reduce(ai )), (4)

where a0i = reduce(ai ) is the reduced activation. According to Eq. (2), learning these lite residual
modules only requires to store the reduced activations {a0i } rather than the full activations {ai }.

Implementation (Figure 2c). We apply Eq. (4) to mobile inverted bottleneck blocks (MB-block)
[1]. The key principle is to keep the activation small. Following this principle, we explore two design
dimensions to reduce the activation size:

• Width. The widely-used inverted bottleneck requires a huge number of channels (6×) to com-
pensate for the small capacity of a depthwise convolution, which is parameter-efficient but highly
activation-inefficient. Even worse, converting 1× channels to 6× channels back and forth requires
two 1 × 1 projection layers, which doubles the total activation to 12×. Depthwise convolution also
has a very low arithmetic intensity (its OPs/Byte is less than 4% of 1 × 1 convolution’s OPs/Byte if
with 256 channels), thus highly memory in-efficient with little reuse. To solve these limitations,
our lite residual module employs the group convolution that has much higher arithmetic intensity
than depthwise convolution, providing a good trade-off between FLOPs and memory. That also
removes the 1×1 projection layer, reducing the total channel number by 6×2+1 1+1 = 6.5×.

• Resolution. The activation size grows quadratically with the resolution. Therefore, we shrink the
resolution in the lite residual module by employing a 2 × 2 average pooling to downsample the
input feature map. The output of the lite residual module is then upsampled to match the size of
the main branch’s output feature map via bilinear upsampling. Combining resolution and width
optimizations, the activation of our lite residual module is roughly 22 × 6.5 = 26× smaller than
the inverted bottleneck.

3.3 Discussions

Normalization Layers. As discussed in Section 3.1, TinyTL flexibly supports different normal-
ization layers, including batch normalization (BN), group normalization (GN), layer normalization
(LN), and so on. In particular, BN is the most widely used one in vision tasks. However, BN requires
a large batch size to have accurate running statistics estimation during training, which is not suitable
for on-device learning where we want a small training batch size to reduce the memory footprint.
Moreover, the data may come in a streaming fashion in on-device learning, which requires a training
batch size of 1. In contrast to BN, GN can handle a small training batch size as the running statistics
in GN are computed independently for different inputs. In our experiments, GN with a small training
batch size (e.g., 8) performs slightly worse than BN with a large training batch size (e.g., 256).
However, as we target at on-device learning, we choose GN in our models.

Feature Extractor Adaptation. TinyTL can be applied to different backbone neural networks,
such as MobileNetV2 [1], ProxylessNASNets [11], EfficientNets [24], etc. However, since the
weights of the feature extractor are frozen in TinyTL, we find using the same backbone neural
network for all transfer tasks is sub-optimal. Therefore, we choose the backbone of TinyTL using
a pre-trained once-for-all network [10] to adaptively select the specialized feature extractor that
best fits the target transfer dataset. Specifically, a once-for-all network is a special kind of neural
network that is sparsely activated, from which many different sub-networks can be derived without
retraining by sparsely activating parts of the model according to the architecture configuration (i.e.,
depth, width, kernel size, resolution), while the weights are shared. This allows us to efficiently
evaluate the effectiveness of a backbone neural network on the target transfer dataset without the
expensive pre-training process. Further details of the feature extractor adaptation process are provided
in Appendix A.

Table 2: Comparison between TinyTL and conventional transfer learning methods (training memory
footprint is calculated assuming the batch size is 8 and the classifier head for Flowers is used).
For object classification datasets, we report the top1 accuracy (%) while for CelebA we report the
average top1 accuracy (%) over 40 facial attribute classification tasks. ‘B’ represents Bias while
‘L’ represents LiteResidual. FT-Last represents only the last layer is fine-tuned. FT-Norm+Last
represents normalization layers and the last layer are fine-tuned. FT-Full represents the full network is
fine-tuned. The backbone neural network is ProxylessNAS-Mobile, and the resolution is 224 except
for ‘TinyTL-L+B@320’ whose resolution is 320. TinyTL consistently outperforms FT-Last and
FT-Norm+Last by a large margin with a similar or lower training memory footprint. By increasing the
resolution to 320, TinyTL can reach the same level of accuracy as FT-Full while being 6× memory
efficient.

Train.
Method Flowers Cars CUB Food Pets Aircraft CIFAR10 CIFAR100 CelebA
Mem.
FT-Last 31MB 90.1 50.9 73.3 68.7 91.3 44.9 85.9 68.8 88.7
TinyTL-B 32MB 93.5 73.4 75.3 75.5 92.1 63.2 93.7 78.8 90.4
TinyTL-L 37MB 95.3 84.2 76.8 79.2 91.7 76.4 96.1 80.9 91.2
TinyTL-L+B 37MB 95.5 85.0 77.1 79.7 91.8 75.4 95.9 81.4 91.2
TinyTL-L+B@320 65MB 96.8 88.8 81.0 82.9 92.9 82.3 96.1 81.5 -
FT-Norm+Last 192MB 94.3 77.9 76.3 77.0 92.2 68.1 94.8 80.2 90.4
FT-Full 391MB 96.8 90.2 81.0 84.6 93.0 86.0 97.1 84.1 91.4

4 Experiments

4.1 Setups

Datasets. Following the common practice [43, 44, 45], we use ImageNet [35] as the pre-training
dataset, and then transfer the models to 8 downstream object classification tasks, including Cars [41],
Flowers [51], Aircraft [40], CUB [52], Pets [53], Food [54], CIFAR10 [55], and CIFAR100 [55].
Besides object classification, we also evaluate our TinyTL on human facial attribute classification
tasks, where CelebA [56] is the transfer dataset and VGGFace2 [57] is the pre-training dataset.

Model Architecture. To justify the effectiveness of TinyTL, we first apply TinyTL and previous
transfer learning methods to the same backbone neural network, ProxylessNAS-Mobile [11]. For
each MB-block in ProxylessNAS-Mobile, we insert a lite residual module as described in Section 3.2
and Figure 2 (c). The group number is 2, and the kernel size is 5. We use the ReLU activation since it
is more memory-efficient according to Section 3.1. We replace all BN layers with GN layers to better
support small training batch sizes. We set the number of channels per group to 8 for all GN layers.
Following [58], we apply weight standardization [59] to convolution layers that are followed by GN.
For feature extractor adaptation, we build the once-for-all network using the MobileNetV2 design
space [10, 11] that contains five stages with a gradually decreased resolution, and each stage consists
of a sequence of MB-blocks. In the stage-level, it supports elastic depth (i.e., 2, 3, 4). In the
block-level, it supports elastic kernel size (i.e., 3, 5, 7) and elastic width expansion ratio (i.e., 3, 4,
6). Similarly, for each MB-block in the once-for-all network, we insert a lite residual module that
supports elastic group number (i.e., 2, 4) and elastic kernel size (i.e., 3, 5).

Training Details. We freeze the memory-heavy modules (weights of the feature extractor) and only
update memory-efficient modules (bias, lite residual, classifier head) during transfer learning. The
models are fine-tuned for 50 epochs using the Adam optimizer [60] with batch size 8 on a single GPU.
The initial learning rate is tuned for each dataset while cosine schedule [61] is adopted for learning
rate decay. We apply 8bits weight quantization [5] on the frozen weights to reduce the parameter
size, which causes a negligible accuracy drop in our experiments. For all compared methods, we also
assume the 8bits weight quantization is applied if eligible when calculating their training memory
footprint. Additionally, as PyTorch does not support explicit fine-grained memory management, we
use the theoretically calculated training memory footprint for comparison in our experiments. For
simplicity, we assume the batch size is 8 for all compared methods throughout the experiment section.

TinyTL (LiteResidual+Bias)                                                            TinyTL (Bias)                                    FT-Norm+Last                                         FT-Last                              FT-Full
                                                                                                                                                                                                                                                                                                                         4.5x memory
             98                                                                                    95                   4.6x memory                                                           82                                                                                               85                           saving
                                 6.5x memory                                                                                                                                                                                                                                                           65MB
                  45MB              saving                                                                 45MB            saving                                                                                         6.5x memory                                                                                                  292MB
                                                             292MB
             96                                                                                    85                                                       209MB                             80   45MB                      saving                                                            81
                                                                                                                                                                                                                                                              292MB
   Flowers                                                                                                                                                                                    78
             94                                                                                    75                                                                                                                                                                                          77

                                                                                                                                                                                                                                                                                    Food
                                                                                        Cars

                                                                                                                                                                                 CUB
                                                                                                                                                                                              76
             92                                                                                    65                                                                                                                                                                                          73
                                                                                                                                                                                              74
             90                                                                                    55                                                                                         72                                                                                               69

             88                                                                                    45                                                                                         70                                                                                               65
                  0         100     200  300    400                                                     0          75     150   225   300                                                          0             100     200  300     400                                                            0           100    200   300   400
                           Training Memory (MB)                                                                  Training Memory (MB)                                                                            Training Memory (MB)                                                                          Training Memory (MB)
             94                              6.0x memory                                           90                                                                                  100                                                                                                     85                  4.6x memory
                                                                                                                         3.9x memory                                                                             4.6x memory
                                                saving                                                                      saving                                                                                  saving   87MB                                                                      19MB           saving   87MB
                                                                                                                                                                                                      19MB
             93       65MB                                                                         80       54MB
                                                                                                                                                                                              95                                                                                               79
                                                                      391MB                                                                                 209MB

                                                                                                                                                                                                                                                                                    CIFAR100
             92

                                                                                                                                                                                    CIFAR10
                                                                                        Aircraft
                                                                                                   70                                                                                         90                                                                                               73
    Pets

             91
                                                                                                   60                                                                                         85                                                                                               67
             90
             89                                                                                    50                                                                                         80                                                                                               61

             88                                                                                    40                                                                                         75                                                                                               55
                  0            100    200   300   400                                                   0          75     150   225   300                                                          0            40      80   120    160                                                              0             40     80   120     160
                               Training Memory (MB)                                                              Training Memory (MB)                                                                          Training Memory (MB)                                                                               Training Memory (MB)

  Figure 3: Top1 accuracy results of different transfer learning methods under varied resolutions using
  the same pre-trained neural network (ProxylessNAS-Mobile). With the same level of accuracy,                                                      Flowers102-1

                        Full                                   Last                                             BN                                          Bias                                           LiteResidual                                     LiteResidual+bias                               Batch Size
  TinyTL achieves 3.9-6.5× memory saving compared to fine-tuning the full network.
   Model Size                    18.98636                               5.138576                                      5.264432                                      5.201504                                       10.587824                                         10.63352                                             8

   Act@256, Act@448             60.758528                              12.845056                                       93.6488                                     13.246464                                       13.246464                                        13.246464

   Act@224, Act@416             46.482132                              11.075584                                     80.713856                                     11.421696                                       11.421696                                        11.421696

   Act@192, Act@384             34.176672                               9.437184                                       68.8032                                      9.732096                                        9.732096                                         9.732096

   Act@160, Act@352              23.70904                               7.929856                                     57.785036                                      8.177664                                        8.177664                                         8.177664

   Act@128, Act@320             15.189632                                 6.5536                                         47.78                                        6.7584                                              6.7584                                       6.7584

  4.2
   Act@96, Act@288

   , Act@256
                       Main Results
                                 8.530757                               5.308416

                                                                        4.194304
                                                                                                                     38.678632

                                                                                                                       30.5792
                                                                                                                                                                    5.474304

                                                                                                                                                                    4.325376
                                                                                                                                                                                                                    5.474304

                                                                                                                                                                                                                    4.325376
                                                                                                                                                                                                                                                                     5.474304

                                                                                                                                                                                                                                                                     4.325376

   , Act@224                                                            3.211264                                      23.39462                                      3.311616                                        3.311616                                         3.311616

   , Act@192                                                            2.359296                                       17.2008                                      2.433024                                        2.433024                                         2.433024

  Effectiveness of TinyTL. Table 2 reports the comparison between TinyTL and previous transfer
   , Act@160                                                              1.6384                                     11.933009                                        1.6896                                              1.6896                                       1.6896

                                                                                                                                      Stanford-Cars
  learning methods including: i) fine-tuning the last linear layer [36, 37, 39] (referred to as FT-Last);
                                             Full                                   Last                                           BN                                           Bias                                                LiteResidual                                  LiteResidual+Bias

   256, 448
  ii) fine-tuning the normalization layers (e.g., BN, GN) and the last linear layer [42] (referred to as
   224, 416

   192, 384                          89.1              292.4

  FT-Norm+Last) ; iii) fine-tuning the full network [43, 44] (referred to as FT-Full). We also study
   160, 352

   128, 320
                                     87.3

                                     84.2
                                                       208.7

                                                       140.5                60.0                         57.6                                                           80.1                        59.3                    88.3                     64.7                 88.8                       64.7

  several variants of TinyTL including: i) TinyTL-B that fine-tunes biases and the last linear layer;
   96, 288

   , 256
                                     76.1               87.2                58.4

                                                                            54.7
                                                                                                         47.6

                                                                                                         38.7             80.2                      249.9
                                                                                                                                                                        78.1

                                                                                                                                                                        75.9
                                                                                                                                                                                                    49.0

                                                                                                                                                                                                    39.8
                                                                                                                                                                                                                            87.7

                                                                                                                                                                                                                            86.3
                                                                                                                                                                                                                                                     54.4

                                                                                                                                                                                                                                                     45.2
                                                                                                                                                                                                                                                                          88.0

                                                                                                                                                                                                                                                                          87.4
                                                                                                                                                                                                                                                                                                     54.4

                                                                                                                                                                                                                                                                                                     45.2

  ii) TinyTL-L that fine-tunes lite residual modules and the last linear layer; iii) TinyTL-L+B that
   , 224

   , 192
                                                                            50.9                         30.8             77.9

                                                                                                                          73.7
                                                                                                                                                    192.4

                                                                                                                                                    142.9
                                                                                                                                                                        73.4

                                                                                                                                                                        68.6
                                                                                                                                                                                                    31.7

                                                                                                                                                                                                    24.7
                                                                                                                                                                                                                            84.2

                                                                                                                                                                                                                            82.1
                                                                                                                                                                                                                                                     37.1

                                                                                                                                                                                                                                                     30.1
                                                                                                                                                                                                                                                                          85.0

                                                                                                                                                                                                                                                                          83.6
                                                                                                                                                                                                                                                                                                     37.1

                                                                                                                                                                                                                                                                                                     30.1

  fine-tunes lite residual modules, biases, and the last linear layer. All compared methods use the same
   , 160                                                                                                                  67.9

                                                                                                                                        Aircraft
                                                                                                                                                    100.7               61.2                        18.7                    77.3                     24.1                 78.2                       24.2

  pre-trained model but fine-tune different parts of the model as discussed above. We report the average
  256, 448
                                            Full                                   Last                                          BN                                            Bias                                                LiteResidual                                  LiteResidual+Bias

  accuracy across five runs.
  224, 416

  192, 384                         83.5              292.4

  160, 352                         81.0              208.7

  Compared to FT-Last, TinyTL maintains a similar training memory footprint while improving the top1
  128, 320

  96, 288
                                   77.7

                                   70.5
                                                     140.5

                                                      87.2
                                                                          51.9

                                                                          50.6
                                                                                                        57.6

                                                                                                        47.6
                                                                                                                                                                      68.6

                                                                                                                                                                      67.3
                                                                                                                                                                                                   59.3

                                                                                                                                                                                                   49.0
                                                                                                                                                                                                                          81.5

                                                                                                                                                                                                                          80.0
                                                                                                                                                                                                                                                   64.7

                                                                                                                                                                                                                                                   54.4
                                                                                                                                                                                                                                                                        82.3

                                                                                                                                                                                                                                                                        80.8
                                                                                                                                                                                                                                                                                                    64.7

                                                                                                                                                                                                                                                                                                    54.4

  accuracy by a significant margin. In particular, TinyTL-L+B improves the top1 accuracy by 34.1% on
  , 256

  , 224
                                                                          48.6

                                                                          44.9
                                                                                                        38.7

                                                                                                        30.8
                                                                                                                        70.7

                                                                                                                        68.1
                                                                                                                                                   249.9

                                                                                                                                                   192.4
                                                                                                                                                                      65.6

                                                                                                                                                                      63.2
                                                                                                                                                                                                   39.8

                                                                                                                                                                                                   31.7
                                                                                                                                                                                                                          79.0

                                                                                                                                                                                                                          76.4
                                                                                                                                                                                                                                                   45.2

                                                                                                                                                                                                                                                   37.1
                                                                                                                                                                                                                                                                        78.9

                                                                                                                                                                                                                                                                        75.4
                                                                                                                                                                                                                                                                                                    45.2

                                                                                                                                                                                                                                                                                                    37.1

  Cars, by 30.5% on Aircraft, by 12.6% on CIFAR100, by 11.0% on Food, etc. It shows the improved
  , 192

  , 160
                                                                                                                        64.7

                                                                                                                        60.5
                                                                                                                                                   142.9

                                                                                                                                                   100.7
                                                                                                                                                                      59.4

                                                                                                                                                                      55.2
                                                                                                                                                                                                   24.7

                                                                                                                                                                                                   18.7
                                                                                                                                                                                                                          73.3

                                                                                                                                                                                                                          69.5
                                                                                                                                                                                                                                                   30.1

                                                                                                                                                                                                                                                   24.1
                                                                                                                                                                                                                                                                        74.9

                                                                                                                                                                                                                                                                        70.4
                                                                                                                                                                                                                                                                                                    30.1

                                                                                                                                                                                                                                                                                                    24.2

  adaptation capacity of our method over FT-Last. Compared to FT-Norm+Last, TinyTL-L+B improves                                     Flowers
                                          Full                                   Last                                         BN                                            Bias                                              LiteResidual                                   LiteResidual+Bias

  the training memory efficiency by 5.2× while providing up to 7.3% higher top1 accuracy, which
256, 448

224, 416                          96.8              390.8

  shows that our method is not only more memory-efficient but also more effective than FT-Norm+Last.
192, 384

160, 352
                                  96.1

                                  95.4
                                                    292.4

                                                    208.7

  Compared to FT-Full, TinyTL-L+B@320 can achieve the same level of accuracy while providing
128, 320

96, 288
                                  93.6

                                  89.6
                                                    140.5

                                                     87.2
                                                                         93.3

                                                                         92.6
                                                                                                    57.6

                                                                                                    47.6
                                                                                                                       96.0

                                                                                                                       95.6
                                                                                                                                               387.5

                                                                                                                                               314.7
                                                                                                                                                                     95.6

                                                                                                                                                                     95.1
                                                                                                                                                                                               59.3

                                                                                                                                                                                               49.0
                                                                                                                                                                                                                     96.7

                                                                                                                                                                                                                     96.4
                                                                                                                                                                                                                                                  64.7

                                                                                                                                                                                                                                                  54.4
                                                                                                                                                                                                                                                                      96.8

                                                                                                                                                                                                                                                                      96.4
                                                                                                                                                                                                                                                                                                64.7

                                                                                                                                                                                                                                                                                                54.4

  6.0× training memory saving.
, 256

, 224
                                                                         91.6

                                                                         90.1
                                                                                                    38.7

                                                                                                    30.8
                                                                                                                       95.0

                                                                                                                       94.3
                                                                                                                                               249.9

                                                                                                                                               192.4
                                                                                                                                                                     94.5

                                                                                                                                                                     93.5
                                                                                                                                                                                               39.8

                                                                                                                                                                                               31.7
                                                                                                                                                                                                                     95.9

                                                                                                                                                                                                                     95.3
                                                                                                                                                                                                                                                  45.2

                                                                                                                                                                                                                                                  37.1
                                                                                                                                                                                                                                                                      96.0

                                                                                                                                                                                                                                                                      95.5
                                                                                                                                                                                                                                                                                                45.2

                                                                                                                                                                                                                                                                                                37.1

, 192                                                                                                                  92.8                    142.9                 91.5                      24.7                  94.6                         30.1                94.6                      30.1

  Regarding the comparison between different variants of TinyTL, both TinyTL-L and TinyTL-L+B
, 160                                                                                                                  90.5

                                                                                                                                    Cub200
                                                                                                                                               100.7                 89.5                      18.7                  92.8                         24.1                93.1                      24.2

  have clearly better accuracy than TinyTL-B while incurring little memory overhead. It shows that the
256, 448
                                          Full                                   Last                                         BN                                            Bias                                              LiteResidual                                   LiteResidual+Bias

  lite residual modules are essential in TinyTL. Besides, we find that TinyTL-L+B is slightly better
224, 416

192, 384
                                  81.0

                                  79.0
                                                    390.8

                                                    292.4

  than TinyTL-L on most of the datasets while maintaining the same memory footprint. Therefore, we
160, 352

128, 320
                                  76.7

                                  71.8
                                                    208.7

                                                    140.5                77.9                       57.6               80.6                    387.5                 79.8                      59.3                  80.5                         64.7                81.0                      64.7

96, 288                                                                  77.0                       47.6               79.6                    314.7                 78.6                      49.0                  79.6                         54.4                80.0                      54.4

  choose TinyTL-L+B as the default.
, 256

, 224
                                                                         75.4

                                                                         73.3
                                                                                                    38.7

                                                                                                    30.8
                                                                                                                       79.1

                                                                                                                       76.3
                                                                                                                                               249.9

                                                                                                                                               192.4
                                                                                                                                                                     77.5

                                                                                                                                                                     75.3
                                                                                                                                                                                               39.8

                                                                                                                                                                                               31.7
                                                                                                                                                                                                                     78.5

                                                                                                                                                                                                                     76.8
                                                                                                                                                                                                                                                  45.2

                                                                                                                                                                                                                                                  37.1
                                                                                                                                                                                                                                                                      78.8

                                                                                                                                                                                                                                                                      77.1
                                                                                                                                                                                                                                                                                                45.2

                                                                                                                                                                                                                                                                                                37.1

, 192                                                                                                                  73.7                    142.9                 72.7                      24.7                  74.7                         30.1                74.7                      30.1

  Figure 3 demonstrates the results under different input resolutions. We can observe that simply
, 160

  reducing the input resolution will result in significant accuracy drops for FT-Full. In contrast,                                 Food101

                                          Full                                   Last                                         BN                                            Bias                                              LiteResidual                                   LiteResidual+Bias

  TinyTL can reduce the memory footprint by 3.9-6.5× while having the same or even higher accuracy
256, 448

224, 416                          84.6              390.8

  compared to fine-tuning the full network.
192, 384

160, 352
                                  83.2

                                  81.2
                                                    292.4

                                                    208.7

128, 320                          78.1              140.5                73.0                       57.6               80.2                    387.5                 78.7                      59.3                  82.8                         64.7                82.9                      64.7

96, 288                           73.5               87.2                72.0                       47.6               79.5                    314.7                 77.9                      49.0                  82.0                         54.4                82.1                      54.4

, 256                                                                    70.7                       38.7               78.4                    249.9                 76.8                      39.8                  81.1                         45.2                81.5                      45.2

, 224                                                                    68.7                       30.8               77.0                    192.4                 75.5                      31.7                  79.2                         37.1                79.7                      37.1
  Combining TinyTL and Feature Extractor Adaptation. Table 3 summarizes the results of
, 192                                                                                                                  74.9                    142.9                 73.0                      24.7                  78.2                         30.1                78.4                      30.1

, 160                                                                                                                  72.4                    100.7                 70.1                      18.7                  74.6                         24.1                75.1                      24.2

  TinyTL and previously reported transfer learning results, where different backbone neural net-                                        Pets

  works are used as the feature extractor. Combined with feature extractor adaptation, TinyTL achieves
256, 448                                  Full                                   Last                                          BN                                            Bias                                                LiteResidual                                  LiteResidual+Bias

  7.5-12.9× memory saving compared to fine-tuning the full Inception-V3, reducing from 850MB
  to 66-114MB while providing the same level of accuracy. Additionally, we try updating the last
                                                  1 (indicated by † ), which results in 2MB of extra
  two layers besides biases and lite residual modules

                                                                                                                                                                               7

Table 3: Comparison with previous transfer learning results under different backbone neural networks.
‘I-V3’ is Inception-V3; ‘N-A’ is NASNet-A Mobile; ‘M2-1.4’ is MobileNetV2-1.4; ‘R-50’ is ResNet-
 50; ‘PM’ is ProxylessNAS-Mobile; ‘FA’ represents feature extractor adaptation. † indicates the last
 two layers are updated besides biases and lite residual modules in TinyTL. TinyTL+FA reduces the
 training memory by 7.5-12.9× without sacrificing accuracy compared to fine-tuning the widely used
 Inception-V3.
                                                               Train. Reduce
        Method                          Net                                  Flowers Cars CUB Food Pets Aircraft CIFAR10 CIFAR100
                                                               mem.    ratio
                                        I-V3 [44]          850MB                      1.0×             96.3           91.3 82.8 88.7 -                                                      85.5               -                   -
                                        R-50 [43]          802MB                      1.1×             97.5           91.7  -   87.8 92.5                                                   86.6              96.8                84.5
        FT-Full
                                        M2-1.4 [43]        644MB                      1.3×             97.5           91.8  -   87.7 91.0                                                   86.8              96.1                82.5
                                        N-A [43]           566MB                      1.5×             96.8           88.5  -   85.5 89.4                                                   72.8              96.8                83.9
        FT-Norm+Last I-V3 [42]                             326MB                      2.6×             90.4           81.0             -                 -                -                 70.7                  -                 -
        FT-Last                         I-V3 [42]              94MB                   9.0×             84.5           55.0             -                 -                -                 45.9                  -                 -
                                        PM@320             65MB 13.1×                                  96.8           88.8 81.0 82.9 92.9                                                   82.3              96.1                81.5
                                        FA@256             66MB 12.9×                                  96.8           89.6 80.8 83.4 93.0                                                   82.4              96.8                82.7
        TinyTL
                                        FA@352             114MB 7.5×                                  97.4           90.7 82.4 85.0 93.4                                                   84.8               -                   -
                                        FA@352†            116MB 7.3×                                   -             91.5  -   86.0 -                                                      85.4               -                   -

                                                         TinyTL                       Activation Pruning (ResNet-50)                                         Activation Pruning (MobileNetV2)
          100                                                                         95                                                                                            90
                                            0%                                                                    0%
                                      20%
                                                       20% 0%                                                                                                                                                                           0%
Flowers Top1 (%)

                   95                                                                 90                                        20% 0%                                              84                            0%

                                                                                                                                                                Aircraft Top1 (%)
                                                                                                            20%                                                                                                                   20%
                                                                      Cars Top1 (%)

                                                 pruning 50%                                                            pruning 50%
                                                                                                                                                                                                            20%
                   90                                                                 85                              pruning 60%                                                   78
                                             pruning 60%                                                                                                                                                                  pruning 50%
                   85                                                                 80                                                                                            72                                pruning 60%

                            pruning                                                            pruning
                   80        50%                                                      75        50%                                                                                 66
                                                                                                                                                                                                pruning
                                                                                                                                                                                                 50%
                   75                                                                 70                                                                                            60
                        0      225     450   675                900                        0        225     450   675                                    900                             0          225     450   675                    900
                              Training Memory (MB)                                                 Training Memory (MB)                                                                            Training Memory (MB)

Figure 4: Compared with the dynamic activation pruning [31], TinyTL saves the memory more
effectively.
                                                                                                                                  Flowers102
                                                                                           ResNet-50                            Ours                                                   MobileNetV2
                                                                                           Activation Pruning                                                                          Activation Pruning

training memory footprint. This slightly improves the accuracy performances, from 90.7% to 91.5%         97.5           802.2                                                                        96.6             447.8

                                                                                                                        682.7                     97.4                     114.0                                      373.8
on Cars, from 85.0% to 86.0% on Food, and96.9
                                            96.3
                                                                           95.8
                                              from 84.8% to 85.4% on Aircraft.
                                                                           94.1                                         612.0                     96.8                          66.0                                  330.0

                                                                                                       95.2             541.3                                                                      90.4               286.2

                                                                                                       93.4             470.6                                                                      79.7               242.3
4.3                     Ablation Studies and Discussions                                               88.6             399.9

Comparison with Dynamic Activation Pruning. The comparison between TinyTL and dynamic                                                  Aircraft

                                                                                           ResNet-50                            Ours                                                   MobileNetV2
activation pruning [31] is summarized in Figure 4. TinyTL is more effective because it re-designed
                                                                                           Activation Pruning                                                                          Activation Pruning

                                                                                                         86.6           802.1                                                                        82.8             447.8
the transfer learning framework (lite residual83.53module, feature extractor adaptation)
                                                                                  79.8
                                                                                           rather than prune            682.7                     84.8                     116.0                                      373.8

an existing architecture. The transfer accuracy
                                             80.83drops quickly when the pruning     ratio increases beyond
                                                                                  77.0                                  612.0                     82.4                          69.0                                  330.0

50% (only 2× memory saving). In contrast,    77.47TinyTL can achieve much higher  70.4   memory reduction               541.3                                                                                         286.2

                                             75.64                                61.8                                  470.6                                                                                         242.3
without loss of accuracy.                                                                                               399.8
                                                                                                      72.24

Initialization for Lite Residual Modules. By default, we use the pre-trained weights on the pre-
                                                                                                                                 Stanford-Cars
training dataset to initialize the lite residual modules. It requires to have lite residual modules during
                                                                                           ResNet-50                            Ours                                                   MobileNetV2
both the pre-training phase and transfer learning phase. When applying TinyTL to existing pre-trained
                                                                                           Activation Pruning

                                                                                                         91.7           802.8
                                                                                                                                                                                       Activation Pruning

                                                                                                                                                                                                     91.0             448.3

neural networks that do not have lite residual   91.28modules during the pre-training    phase, we need to use
                                                                                      88.7                              683.5                     90.7                     119.0                                      374.3

another initialization strategy for the lite residual
                                                 90.95      modules during transfer learning.
                                                                                      86.2       To verify the          612.8                     89.6                          71.0                                  330.5

                                                 89.71                                82.5                              542.1                                                                                         286.6
effectiveness of TinyTL under this setting, we   88.20
                                                       also evaluate the performances   of
                                                                                      75.0
                                                                                           TinyTL  when using           471.3                                                                                         242.8

random weights [62] to initialize the lite residual
                                                 85.20   modules except for the scaling parameter of the final          400.6

normalization layer in each lite residual module. These scaling parameters are initialized with zeros.
Table 4 reports the summarized results. We find using the pre-trained weights to initialize the lite
residual modules consistently outperforms using random weights. Besides, we also find that using

                                                                                                                  8

Table 4: Results of TinyTL under different initialization strategies for lite residual modules. TinyTL-
L+B adds lite residual modules starting from the pre-training phase and uses the pre-trained weights
to initialize the lite residual modules during transfer learning. In contrast, TinyTL-RandomL+B uses
random weights to initialize the lite residual modules. Using random weights for initialization hurts
the performances of TinyTL. But on datasets whose distribution is far from the pre-training dataset,
TinyTL-RandomL+B still provides competitive results.

                                               Train.
       Method                                         Flowers Cars CUB Food Pets Aircraft CIFAR10 CIFAR100 CelebA
                                               Mem.
       FT-Last                                31MB     90.1                     50.9 73.3 68.7 91.3                                     44.9                            85.9           68.8              88.7
       TinyTL-RandomL+B 37MB                           88.0                     82.4 72.9 79.3 84.3                                     73.6                            95.7           81.4              91.2
       TinyTL-L+B       37MB                           95.5                     85.0 77.1 79.7 91.8                                     75.4                            95.9           81.4              91.2
       FT-Norm+Last                          192MB     94.3                     77.9 76.3 77.0 92.2                                     68.1                            94.8           80.2              90.4
       FT-Full                               391MB     96.8                     90.2 81.0 84.6 93.0                                     86.0                            97.1           84.1              91.4

                                                      TinyTL (batch size 1)                                             TinyTL (batch size 8)
                   98                                                      95                                                                                      90
                              16MB                                                             16MB                                                                              16MB
                              Typical L3 Cache Size                                            Typical L3 Cache Size                                                             Typical L3 Cache Size
Flowers Top1 (%)

                                                                           91                                                                                      84

                                                                                                                                               Aircraft Top1 (%)
                                                           Cars Top1 (%)

                   96
                                                                           87                                                                                      78

                                                                           83                                                                                      72
                   94
                                                                           79                                                                                      66

                   92                                                      75                                                                                      60
                        0     18      35    53        70                        0          18      35    53                             70                              0        18      35    53               70
                            Training Memory (MB)                                         Training Memory (MB)                                                                  Training Memory (MB)

Figure 5: Results of TinyTL when trained with batch size 1. It further reduces the training memory
footprint to around 16MB (typical L3 cache size), making it possible to train on the cache (SRAM)
instead of DRAM.                                                                                                   Flowers102
                                                                                TinyTL (batch size               TinyTL (batch size
                                                                                8)                               1)
                                                                                             96.8         64.7                   96.3                   17.4
                                                                                             96.4         54.4                96.1                      16.1

TinyTL-RandomL+B still provides highly competitive
                                                 96.0      45.2  results
                                                                     95.9 on Cars,
                                                                               15.0  Food, Aircraft, CIFAR10,
                                                 95.5      37.1                13.9
CIFAR100, and CelebA. Therefore, if having       94.6
                                                      the budget,
                                                           30.1
                                                                    it is
                                                                     95.6
                                                                          better  to
                                                                               13.1
                                                                                     use pre-trained weights to
                                                                     94.8
initialize the lite residual modules. If not, TinyTL
                                                 93.1
                                                        can24.2
                                                            still be applied
                                                                     93.4
                                                                              and
                                                                               12.3
                                                                                    provides competitive results
on datasets whose distribution is far from the pre-training dataset.
                                                                                                                      Aircraft

                                                                                TinyTL (batch size               TinyTL (batch size
                                                                                8)                               1)

Results of TinyTL under Batch Size 1. Figure
                                           82.3
                                           80.8
                                                    5 64.7
                                                       demonstrates the17.4
                                                       54.4
                                                                                results of TinyTL when using
                                                                             16.1
                                                                                                                                 82.7

                                                                 80.2
a training batch size of 1. We tune the initial
                                           78.9
                                                   learning
                                                       45.2
                                                               rate
                                                                 79.6
                                                                       for each
                                                                             15.0
                                                                                   dataset while keeping the
other training settings unchanged. As our model
                                           75.4       employs
                                                       37.1      group
                                                                 77.5      normalization
                                                                             13.9            rather than batch
normalization (Section 3.3), we observe little/no
                                           74.9      loss
                                                       30.1 of accuracy
                                                                 75.0       than
                                                                             13.1  training with batch size 8.
                                           70.4
Meanwhile, the training memory footprint is     further24.2reduced           12.3
                                                                 70.7 to around    16MB, a typical L3 cache
size. This makes it much easier to train on the cache (SRAM), which can greatly reduce energy
consumption than DRAM training.                                                                                   Stanford-Cars
                                                                                TinyTL (batch size               TinyTL (batch size
                                                                                8)                               1)

                                                                                              88.8        64.7                   88.7                   17.4
                                                                                              88.0        54.4                87.8                      16.1
5                   Conclusion                                                                87.4        45.2                86.6                      15.0

                                                                                              85.0        37.1                84.5                      13.9

                                                                                              83.6        30.1                82.1                      13.1
We proposed Tiny-Transfer-Learning (TinyTL) for memory-efficient
                                                     24.2     78.1      on-device learning that aims to
                                                                       12.3                   78.2

adapt pre-trained models to newly collected data on edge devices. Unlike previous methods that focus
on reducing the number of parameters or FLOPs, TinyTL directly optimizes the training memory
footprint by fixing the memory-heavy modules (i.e., weights) while learning memory-efficient bias
modules. We further introduce lite residual modules that significantly improve the adaptation capacity
of the model with little memory overhead. Extensive experiments on benchmark datasets consistently
show the effectiveness and memory-efficiency of TinyTL, paving the way for efficient on-device
machine learning.

                                                                                                     9

Broader Impact

The proposed efficient on-device learning technique greatly reduces the training memory footprint
of deep neural networks, enabling adapting pre-trained models to new data locally on edge devices
without leaking them to the cloud. It can democratize AI to people in the rural areas where the
Internet is unavailable or the network condition is poor. They can not only inference but also fine-tune
AI models on their local devices without connections to the cloud servers. This can also benefit
privacy-sensitive AI applications, such as health care, smart home, and so on.

Acknowledgements

We thank MIT-IBM Watson AI Lab, NSF CAREER Award #1943349 and NSF Award #2028888 for
supporting this research. We thank MIT Satori cluster for providing the computation resource.

References

[1] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.
Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018. 1, 2, 3, 5
[2] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias
Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural
networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 2, 3
[3] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient
convolutional neural network for mobile devices. In CVPR, 2018. 2, 3
[4] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for
efficient neural network. In NeurIPS, 2015. 2, 3
[5] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding. In ICLR, 2016. 2, 3, 6
[6] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and
Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In CVPR, 2019. 2,
3
[7] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan,
Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3.
In ICCV, 2019. 2, 4
[8] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong
Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet
design via differentiable neural architecture search. In CVPR, 2019. 2, 3
[9] Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Kuan Wang, Tianzhe Wang, Ligeng Zhu, and Song
Han. Automl for architecting efficient and specialized neural networks. IEEE Micro, 2019. 2
[10] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one
network and specialize it for efficient deployment. In ICLR, 2020. 2, 3, 5, 6, 13
[11] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target
task and hardware. In ICLR, 2019. 2, 3, 5, 6
[12] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional
networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014. 2
[13] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting
linear structure within convolutional networks for efficient evaluation. In NeurIPS, 2014. 2
[14] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks
on cpus. In NeurIPS Deep Learning and Unsupervised Feature Learning Workshop, 2011. 2
[15] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable
neural networks. In ICLR, 2019. 3
[16] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.
Learning efficient convolutional networks through network slimming. In ICCV, 2017. 3

[17] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural
     networks. In ICCV, 2017. 3
[18] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep
     neural networks with binary weights during propagations. In NeurIPS, 2015. 3
[19] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard,
     Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for
     efficient integer-arithmetic-only inference. In CVPR, 2018. 3
[20] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated
     quantization. In CVPR, 2019. 3
[21] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and
     Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb
     model size. arXiv preprint arXiv:1602.07360, 2016. 3
[22] Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: An
     efficient densenet using learned group convolutions. In CVPR, 2018. 3
[23] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search
     by network transformation. In AAAI, 2018. 3
[24] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural
     networks. In ICML, 2019. 3, 5
[25] Ashish Kumar, Saurabh Goyal, and Manik Varma. Resource-efficient machine learning in 2 kb
     ram for the internet of things. In ICML, 2017. 3
[26] Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri, Bhargavi Paranjape,
     Ashish Kumar, Saurabh Goyal, Raghavendra Udupa, Manik Varma, and Prateek Jain. Protonn:
     Compressed and accurate knn for resource-scarce devices. In ICML, 2017. 3
[27] Dennis, Don Kurian and Gaurkar, Yash and Gopinath, Sridhar and Goyal, Sachin and Gupta,
     Chirag and Jain, Moksh and Kumar, Ashish and Kusupati, Aditya and Lovett, Chris and Patil,
     Shishir G and Saha, Oindrila and Simhadri, Harsha Vardhan. EdgeML: Machine Learning for
     resource-constrained edge devices. 3
[28] Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-
     efficient backpropagation through time. In NeurIPS, 2016. 3
[29] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear
     memory cost. arXiv preprint arXiv:1604.06174, 2016. 3
[30] Klaus Greff, Rupesh K Srivastava, and Jürgen Schmidhuber. Highway and residual networks
     learn unrolled iterative estimation. arXiv preprint arXiv:1612.07771, 2016. 3
[31] Liu Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. Dynamic
     sparse graph for efficient deep learning. In ICLR, 2019. 3, 8
[32] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan.
     Training deep neural networks with 8-bit floating point numbers. In NeurIPS, 2018. 3
[33] Oindrila Saha, Aditya Kusupati, Harsha Vardhan Simhadri, Manik Varma, and Prateek
     Jain. Rnnpool: Efficient non-linear pooling for ram constrained inference. arXiv preprint
     arXiv:2002.11921, 2020. 3
[34] Peiye Liu, Bo Wu, Huadong Ma, Pavan Kumar Chundi, and Mingoo Seok. Memnet: Memory-
     efficiency guided neural architecture search with augment-trim learning. arXiv preprint
     arXiv:1907.09569, 2019. 3
[35] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale
     hierarchical image database. In CVPR, 2009. 3, 6
[36] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil
     in the details: Delving deep into convolutional nets. In BMVC, 2014. 3, 7
[37] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor
     Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML,
     2014. 3, 7

                                                 11

[38] Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. Devnet: A
     deep event network for multimedia event detection and evidence recounting. In CVPR, pages
     2568–2577, 2015. 3
[39] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features
     off-the-shelf: an astounding baseline for recognition. In CVPR Workshops, 2014. 3, 7
[40] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-
     grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013. 3, 6
[41] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-
     grained categorization. In Proceedings of the IEEE International Conference on Computer
     Vision Workshops, 2013. 3, 6
[42] Pramod Kaushik Mudrakarta, Mark Sandler, Andrey Zhmoginov, and Andrew Howard. K for
     the price of 1: Parameter efficient multi-task and transfer learning. In ICLR, 2019. 3, 7, 8
[43] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better?
     In CVPR, 2019. 3, 6, 7, 8, 14
[44] Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large scale fine-grained
     categorization and domain-specific transfer learning. In CVPR, 2018. 3, 6, 7, 8, 13
[45] Pramod Kaushik Mudrakarta, Mark Sandler, Andrey Zhmoginov, and Andrew Howard. K for
     the price of 1: Parameter-efficient multi-task and transfer learning. In ICLR, 2019. 3, 6, 13
[46] Jonathan Frankle, David J Schwab, and Ari S Morcos. Training batchnorm and only batchnorm:
     On the expressive power of random features in cnns. arXiv preprint arXiv:2003.00152, 2020. 3
[47] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training
     by reducing internal covariate shift. In ICML, 2015. 3, 4
[48] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018. 4
[49] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. In ICLR
     Workshop, 2018. 4
[50] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations
     in convolutional network. arXiv preprint arXiv:1505.00853, 2015. 4
[51] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large num-
     ber of classes. In Sixth Indian Conference on Computer Vision, Graphics & Image Processing,
     2008. 6
[52] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The
     caltech-ucsd birds-200-2011 dataset. 2011. 6
[53] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In
     CVPR, 2012. 6
[54] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative
     components with random forests. In ECCV, 2014. 6
[55] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.
     Technical report, Citeseer, 2009. 6
[56] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes
     (celeba) dataset. Retrieved August, 2018. 6
[57] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset
     for recognising faces across pose and age. In 2018 13th IEEE International Conference on
     Automatic Face & Gesture Recognition (FG 2018), 2018. 6
[58] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain
     Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. arXiv
     preprint arXiv:1912.11370, 2019. 6
[59] Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight standardization.
     arXiv preprint arXiv:1903.10520, 2019. 6
[60] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
     arXiv:1412.6980, 2014. 6

                                               12

80.0                                                95.0                                      65.0                                           55.0

77.5                                                91.3                                      57.5                                           50.0

75.0                                                87.5                                      50.0                                           45.0

72.5                                                83.8                                      42.5                                           40.0

70.0                                                80.0                                      35.0                                           35.0
              ImageNet Top1 (%)                                     Flowers Top1 (%)                              Cars Top1 (%)                              Aircraft Top1 (%)
   [61] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv
        preprint arXiv:1608.03983, 2016. 6
   [62] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
        recognition. In CVPR, 2016. 8
   [63] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian
                              MobileNetV2              ResNet-34          MnasNet       ProxylessNAS          MobileNetV3       ResNet-50      ResNet-101      Inception-v3

        Sun. Single path one-shot neural architecture search with uniform sampling. arXiv preprint
              ImageNet Top1                  72.0           73.3           74.0             74.6                 75.2             77.2            78.3             78.8

        arXiv:1904.00420, 2019. 13
                              MobileNetV2              ResNet-34          MnasNet       ProxylessNAS          MobileNetV3       ResNet-50      ResNet-101      Inception-v3

               Flowers102                    91.7           90.6           90.8             90.3                 83.2             92.4            92.1             84.5

   A          Details of Feature Extractor Adaptation
                               MobileNetV2                 ResNet-34      MnasNet        ProxylessNAS          MobileNetV3       ResNet-50      ResNet-101      Inception-v3

              Stanford cars                   51.6           49.2           47.3             48.8                 42.4             51.6            53.4             55.0
   Conventional transfer learning chooses the feature extractor according to the pre-training accuracy
   (e.g., ImageNet accuracy) and uses the same one for all transfer tasks [44, 45]. However, we find
                              MobileNetV2              ResNet-34          MnasNet       ProxylessNAS          MobileNetV3       ResNet-50      ResNet-101      Inception-v3

   this approach
          Aircraft sub-optimal since different target tasks may need very different feature extractors, and
                                             44.2           41.3           41.6             43.2                 37.4             41.5            41.5             45.9

   high pre-training accuracy does not guarantee good transferability of the pre-trained weights. This is
   especially critical in our case where the weights are frozen.

          MobileNetV2              ResNet-34                   MnasNet            ProxylessNAS           MobileNetV3               ResNet-50             ResNet-101            Inception-v3
   80.0                                                65.0                                            95.0                                        55.0

   77.5                                                57.5                                            91.3                                        50.0

   75.0                                                50.0                                            87.5                                        45.0

   72.5                                                42.5                                            83.8                                        40.0

   70.0                                                35.0                                            80.0                                        35.0
                  ImageNet Top1 (%)                                     Cars Top1 (%)                                    Flowers Top1 (%)                       Aircraft Top1 (%)

   Figure 6: Transfer learning performances of various ImageNet pre-trained models with the last linear
   layer trained. The relative accuracy order between different pre-trained models changes significantly
   among ImageNet and the transfer learning datasets.

   Figure 6 shows the top1 accuracy of various   1 widely used ImageNet pre-trained models on three
   transfer datasets by only learning the last layer, which reflects the transferability of their pre-trained
   weights. The relative order between different pre-trained models is not consistent with their ImageNet
   accuracy on all three datasets. This result indicates that the ImageNet accuracy is not a good proxy
   for transferability. Besides, we also find that the same pre-trained model can have very different
   rankings on different tasks. For instance, Inception-V3 gives poor accuracy on Flowers but provides
   top results on the other two datasets.
   Therefore, we need to specialize the feature extractor to best match the target dataset. In this work,
   we achieve this by using a pre-trained once-for-all network [10] that comprises many different
   sub-networks. Specifically, given a pre-trained once-for-all network on ImageNet, we fine-tune it
   on the target transfer dataset with the weights of the main branches (i.e., MB-blocks) frozen and the
   other parameters (i.e., biases, lite residual modules, classifier head) updated via gradient descent. In
   this phase, we randomly sample one sub-network in each training step. The peak memory cost of
   this phase is 61MB under resolution 224, which is reached when the largest sub-network is sampled.
   Regarding the computation cost, the average MAC (forward & backward)4 of sampled sub-nets
   is (776M + 2510M) / 2 = 1643M per sample, where 776M is the training MAC of the smallest
   sub-network and 2510M is the training MAC of the largest sub-network. Therefore, the total MAC
   of this phase is 1643M × 2040 × 0.8 × 50 = 134T on Flowers, where 2040 is the number of total
   training samples, 0.8 means the once-for-all network is fine-tuned on 80% of the training samples
   (the remaining 20% is reserved for search), and 50 is the number of training epochs.
   Based on the fine-tuned once-for-all network, we collect 500 [sub-net, accuracy] pairs on the
   validation set (20% randomly sampled training data) and train an accuracy predictor5 using the
   collected data [10]. We employ evolutionary search [63] based on the accuracy predictor to find the
   sub-network that best matches the target transfer dataset. No back-propagation on the once-for-all
          4
         The training MAC of a sampled sub-network is roughly 2× larger than its inference MAC, rather than 3×,
   since we do not need to update the weights of the main branches.
       5
         Details of the accuracy predictor is provided in Appendix B.

                                                                                                   13

Pets
                               Memory Cost (GB)             Computation Cost (T)    Top1 (%)

  ResNet-50 (Full)                                                                                                              3

  Mbv2-1.4 (Full)                                     644                   8900                91.0                            7

  TinyTL (R=320)                                      65                     330                92.9                            55

                                                      FT-Full (MbV2-1.4)                                   TinyTL
                           92.9                                 8,900
93.0                                         9000                                               700             644
                                                                                                560
92.0                                         6000                           27x                                                                  Activation (MB)
                                                                                                420                    9.9x             55
                                                                           smaller                                                               Trained Params (MB)
                    91.0                                                                                              smaller        (84.6%)
                                                                                                280                                              Frozen Params (MB)
91.0                                         3000
                                                                                                140                     65
                                                                              330                                                    7 (10.8%)
90.0                                              0                                                    0                             3 (4.6%)
                Pets Top1 (%)                               Training Cost (TMAC)                             Memory Cost (MB)

Figure 7: On-device training cost on Pets. TinyTL requires 9.9× smaller memory cost (assuming
using the same batch size) and 27× smaller computation cost compared to fine-tuning the full
MobileNetV2-1.4 [43] while having a better accuracy.

network is required in this step, thus incurs no additional memory overhead. The primary computation
cost of this phase comes from collecting 500 [sub-net, accuracy] pairs required to train the accuracy
predictor. It only involves the forward processes of sampled sub-nets, and no back-propagation is
required. The average MAC (only forward) of sampled sub-nets is (355M + 1182M) / 2 = 768.5M per
sample, where 355M is the inference MAC of the smallest sub-network and 1182M is the inference
MAC of the largest sub-network. Therefore, the total MAC of this phase is 768.5M × 2040 × 0.2 ×
500 = 157T on Flowers, where 2040 is the number of total training samples, 0.2 means the validation
set consists of 20% of the training samples, and 500 is the number of measured sub-nets.
Finally, we fine-tune the searched sub-network with the weights of the main branches frozen and the
other parameters updated, using the full training set to get the final results. The memory cost of this
phase is 66MB under resolution 256 on Flowers. The total MAC is 2190M × 2040 × 1.0 × 50 =
223T, on Flowers, where 2190M is the training MAC, 2040 is the number of total training samples,
1.0 means the full training set is used, and 50 is the number of training epochs.

B       Details of the Accuracy Predictor

The accuracy predictor is a three-layer feed-forward neural network with a hidden dimension of 400
and ReLU as the activation function for each layer. It takes the one-hot encoding of the sub-network’s
architecture as the input and outputs the predicted accuracy of the given sub-network. The inference
MAC of this accuracy predictor is only 0.37M, which is 3-4 orders of magnitude smaller than the
inference MAC of the CNN classification models. The memory footprint of this accuracy predictor
is only 5KB. Therefore, both the computation overhead and the memory overhead of the accuracy
predictor are negligible.

C        Cost Details

The on-device training cost of TinyTL and FT-Full on Pets is summarized in Figure 7 (left side), while
the memory cost breakdown of TinyTL is provided in Figure 7 (right side). Compared to fine-tuning
the full MobileNetV2-1.4, TinyTL not only greatly reduces the activation size but also reduces the
parameter size (by applying weight quantization on the frozen parameters) and the training cost (by
not updating weights of the feature extractor and using fewer training steps). Specifically, TinyTL
reduces the memory footprint by 9.9×, and reduces the training computation by 27× without loss of
accuracy. Therefore, TinyTL is not only more memory-efficient but also more computation-efficient.

                           1

                                                                                               14

You can also read